## female - male differential analysis LT cells - MAST

**LT-HSCs**


Run this model:

`zlmCond_all <- zlm(formula = ~female + n_genes + leiden, sca=sca)`

Comparisons:

compare both replicates of old and new


done with this docker image:

docker run --rm -d --name scanpy -p 8883:8888 -e JUPYTER_ENABLE_LAB=YES -v /Users/efast/Documents/:/home/jovyan/work r_scanpy:vs5

In [1]:
import scanpy as sc
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import colors
import seaborn as sb
from gprofiler import GProfiler

import rpy2.rinterface_lib.callbacks
import logging

from rpy2.robjects import pandas2ri
import anndata2ri

In [2]:
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython

plt.rcParams['figure.figsize']=(8,8) #rescale figures
sc.settings.verbosity = 3
#sc.set_figure_params(dpi=200, dpi_save=300)
sc.logging.print_versions()

scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.17.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22.2.post1 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1


In [3]:
%%R
# Load libraries from correct lib Paths for my environment - ignore this!
.libPaths(.libPaths()[c(3,2,1)])

# Load all the R libraries we will be using in the notebook
library(scran)
library(ggplot2)
library(plyr)
library(MAST)

## Old LT

In [4]:
# load data

adata = sc.read('./sc_objects/old_LT_preprocessed.h5ad', cache = True)

In [5]:
#Create new Anndata object for use in MAST with non-batch corrected data as before
adata_raw = adata.copy()
adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.obs, var=adata.raw.var)
adata_raw.obs['n_genes'] = (adata_raw.X > 0).sum(1) # recompute number of genes expressed per cell
adata = None

In [6]:
adata_raw.obs.head()

Unnamed: 0,n_genes,percent_mito,n_counts,Female,rXist,Female_cat,leiden
CAGTAACAGCGATAGC-1,1971,0.023572,4115.0,True,2.586473,True,Activated
CCTTACGTCAACGAAA-1,1973,0.008254,5331.0,True,2.515928,True,Metabolism
GATCGTAAGGAACTGC-1,1409,0.038104,3123.0,True,3.163349,True,Activated
TGAGCCGAGAAGGTTT-1,1901,0.029093,4709.0,False,0.01,False,Activated
GGCTGGTCATTACCTT-1,2293,0.044041,6494.0,False,0.01,False,Metabolism


### Run MAST on total cells - Select genes expressed in >5% of cells (no adaptive thresholding)

In [7]:
%%R -i adata_raw

#Convert SingleCellExperiment to SingleCellAssay type as required by MAST
sca <- SceToSingleCellAssay(adata_raw, class = "SingleCellAssay")

#Scale Gene detection rate
colData(sca)$n_genes = scale(colData(sca)$n_genes)

# filter genes based on hard cutoff (have to be expressed in at least 5% of all cells)
freq_expressed <- 0.05
expressed_genes <- freq(sca) > freq_expressed
sca <- sca[expressed_genes,]

#### everything

background:  
`zlmCond_all <- zlm(formula = ~Female + n_genes + leiden, sca=sca)` #this runs the model`

a formula with the measurement variable (gene expression) on the LHS (left hand side) and 
predictors present in colData on the RHS
expression of genes controlling for cluster, condition, sex + n_genes
questions I can ask:
sex differences controlling for treatments
sex differences controlling for clusters - not necessary analyze all the clusters
overall gene expression changes in treatment


In [8]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~Female + n_genes + leiden, sca=sca) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [9]:
%%R
head(summaryDt_all)

       primerid component         contrast   Pr..Chisq.       ci.hi       ci.lo
1 0610007P14Rik         C       FemaleTRUE 4.368316e-01  0.04506621 -0.01943866
2 0610007P14Rik         C  leidenActivated 1.635218e-03 -0.02110383 -0.08998700
3 0610007P14Rik         C leidenInterferon 4.026422e-01  0.09121898 -0.22752488
4 0610007P14Rik         C leidenMetabolism 7.656793e-01  0.04324560 -0.05878036
5 0610007P14Rik         C          n_genes 3.078488e-37 -0.09693541 -0.13055782
6 0610007P14Rik         C      (Intercept)           NA  1.33243472  1.27797353
         coef           z
1  0.01281377   0.7786865
2 -0.05554542  -3.1609177
3 -0.06815295  -0.8381484
4 -0.00776738  -0.2984297
5 -0.11374662 -13.2613491
6  1.30520413  93.9440615


In [11]:
%%R -o female_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table

In [12]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [13]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MAST_raw_old_LT.csv')
female_all.to_csv('./write/MAST_female_old_LT.csv')

In [14]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

## New LT

In [15]:
# load data
adata = sc.read('./sc_objects/new_ct_LT_preprocessed.h5ad', cache = True)

In [16]:
#Create new Anndata object for use in MAST with non-batch corrected data as before
adata_raw = adata.copy()
adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.obs, var=adata.raw.var)
adata_raw.obs['n_genes'] = (adata_raw.X > 0).sum(1) # recompute number of genes expressed per cell
adata = None

In [17]:
adata_raw.obs.head()

Unnamed: 0,n_genes,percent_mito,n_counts,Female,rXist,Female_cat,leiden
AAACCCACACAGAGCA-1,2664,0.049227,7699.0,False,0.01,False,Activated
AAACCCAGTATCGTGT-1,2539,0.054656,8032.0,False,0.01,False,Activated
AAACCCAGTCTGTCAA-1,3203,0.05021,9978.0,True,3.23099,True,Quiescent
AAACCCAGTGAACTAA-1,2778,0.061296,8043.0,True,2.988065,True,Quiescent
AAACCCATCGAAATCC-1,2738,0.050149,7697.0,True,3.091328,True,Quiescent


### Run MAST on total cells - Select genes expressed in >5% of cells (no adaptive thresholding)

In [18]:
%%R -i adata_raw

#Convert SingleCellExperiment to SingleCellAssay type as required by MAST
sca <- SceToSingleCellAssay(adata_raw, class = "SingleCellAssay")

#Scale Gene detection rate
colData(sca)$n_genes = scale(colData(sca)$n_genes)

# filter genes based on hard cutoff (have to be expressed in at least 5% of all cells)
freq_expressed <- 0.05
expressed_genes <- freq(sca) > freq_expressed
sca <- sca[expressed_genes,]

#### everything

background:  
`zlmCond_all <- zlm(formula = ~Female + n_genes + leiden, sca=sca)` #this runs the model`

a formula with the measurement variable (gene expression) on the LHS (left hand side) and 
predictors present in colData on the RHS
expression of genes controlling for cluster, condition, sex + n_genes
questions I can ask:
sex differences controlling for treatments
sex differences controlling for clusters - not necessary analyze all the clusters
overall gene expression changes in treatment


In [19]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~Female + n_genes + leiden, sca=sca) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [20]:
%%R
head(summaryDt_all)

       primerid component         contrast    Pr..Chisq.       ci.hi
1 0610009B22Rik         C       FemaleTRUE  2.790238e-01  0.02581479
2 0610009B22Rik         C  leidenActivated  4.312328e-01  0.01056834
3 0610009B22Rik         C leidenInterferon  3.173157e-01  0.05781854
4 0610009B22Rik         C leidenMetabolism  2.007469e-01  0.04768522
5 0610009B22Rik         C          n_genes 1.359192e-160 -0.14035658
6 0610009B22Rik         C      (Intercept)            NA  0.92054609
         ci.lo         coef           z
1 -0.007422376  0.009196208   1.0845832
2 -0.024795586 -0.007113623  -0.7885123
3 -0.178739232 -0.060460344  -1.0018702
4 -0.009972459  0.018856383   1.2819742
5 -0.159401431 -0.149879006 -30.8490154
6  0.890599447  0.905572769 118.5368237


In [21]:
%%R -o female_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table

In [22]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [23]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MAST_raw_new_LT.csv')
female_all.to_csv('./write/MAST_female_new_LT.csv')

In [24]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [25]:
!pip list

Package                Version            
---------------------- -------------------
alembic                1.3.0              
anndata                0.7.1              
anndata2ri             1.0.2              
async-generator        1.10               
attrs                  19.3.0             
backcall               0.1.0              
bleach                 3.1.0              
blinker                1.4                
certifi                2019.11.28         
certipy                0.1.3              
cffi                   1.13.2             
chardet                3.0.4              
conda                  4.7.12             
conda-package-handling 1.6.0              
cryptography           2.8                
cycler                 0.10.0             
decorator              4.4.1              
defusedxml             0.6.0              
entrypoints            0.3                
get-version            2.1                
gprofiler-official     1.0.0              
h5py       