## 04b LSK bycluster differential analysis MAST full

Celltype:

**LSK - by cluster**


Run this model:

`zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca)`

Comparisons:

all cells
- male vs female in all cells controlling for chemical treatment etc
- chemical treatments controlling for sex and n_genes

each cluster
- male vs female in all cells controlling for chemical treatment etc
- chemical treatments controlling for sex and n_genes


done with this docker image:

docker run --rm -d --name test_eva -p 8883:8888 -e JUPYTER_ENABLE_LAB=YES -v /Users/efast/Documents/:/home/jovyan/work r_scanpy:vs4


In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])?  y


In [2]:
import scanpy as sc
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import colors
import seaborn as sb
from gprofiler import GProfiler

import rpy2.rinterface_lib.callbacks
import logging

from rpy2.robjects import pandas2ri
import anndata2ri

In [3]:
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython

plt.rcParams['figure.figsize']=(8,8) #rescale figures
sc.settings.verbosity = 3
#sc.set_figure_params(dpi=200, dpi_save=300)
sc.logging.print_versions()

scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.17.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22.2.post1 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1


In [4]:
%%R
# Load libraries from correct lib Paths for my environment - ignore this!
.libPaths(.libPaths()[c(3,2,1)])

# Load all the R libraries we will be using in the notebook
library(scran)
library(ggplot2)
library(plyr)
library(MAST)

In [5]:
# load data

adata = sc.read('./sc_objects/MPP_preprocessed.h5ad', cache = True)

In [6]:
#Create new Anndata object for use in MAST with non-batch corrected data as before
adata_raw = adata.copy()
adata_raw.X = adata.raw.X
adata_raw.obs['n_genes'] = (adata_raw.X > 0).sum(1) # recompute number of genes expressed per cell
adata = None

In [7]:
adata_raw.obs.head()

Unnamed: 0,assignment,batch,counts,demux_type,hto_type,rna_type,sample,select_cells,n_counts,log_counts,n_genes,percent_mito,Female,Female_cat,Female_str,sex_sample,rXist,leiden,umap_density_sample,umap_density_assignment
AAACGAAGTTGGACCC-0,MPP3/4,batch1,902.0,singlet,background,signal,ct,1.0,13508.0,9.511259,3988,0.057805,True,True,True,ct_true,3.158928,1,0.637315,0.847134
AAAGGATCACGCTGAC-0,MPP,batch1,869.0,singlet,background,signal,ct,1.0,18172.0,9.807637,4622,0.049802,True,True,True,ct_true,3.078445,0,0.783623,0.335968
AAAGGATGTAGTCTGT-0,MPP3/4,batch1,694.0,singlet,background,signal,ct,1.0,8688.0,9.070044,3151,0.059602,True,True,True,ct_true,3.20668,0,0.866588,0.514442
AAAGGGCAGCAGCGAT-0,MPP,batch1,848.0,singlet,background,signal,ct,1.0,8510.0,9.049468,2986,0.056965,False,False,False,ct_false,-0.049406,0,0.675819,0.257006
AAAGGTATCTTCGACC-0,MPP3/4,batch1,3446.0,singlet,signal,signal,ct,1.0,15875.0,9.672815,4297,0.046851,True,True,True,ct_true,2.978796,3,0.668953,0.709803


### Run MAST on total female cells - Select genes expressed in >5% of cells (no adaptive thresholding)

In [8]:
%%R -i adata_raw

#Convert SingleCellExperiment to SingleCellAssay type as required by MAST
sca <- SceToSingleCellAssay(adata_raw, class = "SingleCellAssay")

#Scale Gene detection rate
colData(sca)$n_genes = scale(colData(sca)$n_genes)

# filter genes based on hard cutoff (have to be expressed in at least 5% of all cells)
freq_expressed <- 0.05
expressed_genes <- freq(sca) > freq_expressed
sca <- sca[expressed_genes,]

#rename the sample to condition and make the ct the control
cond<-factor(colData(sca)$sample)
cond<-relevel(cond,"ct")
colData(sca)$condition<-cond

#Create data subsets for the different subpopulations 0-activated, 1- quiescent, 2-metabolism
sca_0 <- subset(sca, with(colData(sca), leiden=='0'))
sca_1 <- subset(sca, with(colData(sca), leiden=='1'))
sca_2<- subset(sca, with(colData(sca), leiden=='2'))
sca_3<- subset(sca, with(colData(sca), leiden=='3'))
sca_4<- subset(sca, with(colData(sca), leiden=='4'))
sca_5<- subset(sca, with(colData(sca), leiden=='5'))
sca_6<- subset(sca, with(colData(sca), leiden=='6'))
sca_7<- subset(sca, with(colData(sca), leiden=='7'))

#Filter out non-expressed genes in the subsets
print("Dimensions before subsetting:")
print(dim(sca_0))
print(dim(sca_1))
print(dim(sca_2))
print(dim(sca_3))
print(dim(sca_4))
print(dim(sca_5))
print(dim(sca_6))
print(dim(sca_7))
print("")

sca_0_filt = sca_0[rowSums(assay(sca_0)) != 0, ]
sca_1_filt = sca_1[rowSums(assay(sca_1)) != 0, ]
sca_2_filt = sca_2[rowSums(assay(sca_2)) != 0, ]
sca_3_filt = sca_3[rowSums(assay(sca_3)) != 0, ]
sca_4_filt = sca_4[rowSums(assay(sca_4)) != 0, ]
sca_5_filt = sca_5[rowSums(assay(sca_5)) != 0, ]
sca_6_filt = sca_4[rowSums(assay(sca_6)) != 0, ]
sca_7_filt = sca_5[rowSums(assay(sca_7)) != 0, ]

print("Dimensions after subsetting:")
print(dim(sca_0_filt))
print(dim(sca_1_filt))
print(dim(sca_2_filt))
print(dim(sca_3_filt))
print(dim(sca_4_filt))
print(dim(sca_5_filt))
print(dim(sca_6_filt))
print(dim(sca_7_filt))

[1] "Dimensions before subsetting:"
[1] 9988 1724
[1] 9988 1715
[1] 9988 1431
[1] 9988 1219
[1] 9988 1096
[1] 9988  733
[1] 9988  198
[1] 9988   75
[1] ""
[1] "Dimensions after subsetting:"
[1] 9987 1724
[1] 9988 1715
[1] 9988 1431
[1] 9988 1219
[1] 9988 1096
[1] 9987  733
[1] 9986 1096
[1] 9940  733


#### everything

background:  
`zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca) # this runs the model`

a formula with the measurement variable (gene expression) on the LHS (left hand side) and 
predictors present in colData on the RHS
expression of genes controlling for cluster, condition, sex + n_genes
questions I can ask:
sex differences controlling for treatments
sex differences controlling for clusters - not necessary analyze all the clusters
overall gene expression changes in treatment


In [9]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [10]:
%%R
head(summaryDt_all)

       primerid component        contrast  Pr..Chisq.        ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 0.522246061  0.008639874 -0.01702627
2 0610009B22Rik         C   conditionGCSF 0.389144188  0.012637791 -0.03247755
3 0610009B22Rik         C conditiondmPGE2 0.002807226 -0.012209495 -0.05855493
4 0610009B22Rik         C   conditionindo 0.780917069  0.018997649 -0.02528676
5 0610009B22Rik         C    conditionpIC 0.004000110 -0.010186267 -0.05347368
6 0610009B22Rik         C         n_genes 0.000000000 -0.146569743 -0.15958655
          coef           z
1 -0.004193196  -0.6404168
2 -0.009919881  -0.8619068
3 -0.035382211  -2.9926513
4 -0.003144554  -0.2783468
5 -0.031829971  -2.8823901
6 -0.153078148 -46.0984931


In [11]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [12]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [13]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_all.csv')
female_all.to_csv('./write/MPP_MAST_female_all.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_all.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_all.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_all.csv')
indo_all.to_csv('./write/MPP_MAST_indo_all.csv')

#### cluster 0

In [14]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [15]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [16]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_0) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [17]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.        ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 9.246865e-01  0.029627914 -0.02689569
2 0610009B22Rik         C   conditionGCSF 8.683093e-01  0.048916755 -0.04127037
3 0610009B22Rik         C conditiondmPGE2 8.855204e-01  0.121182045 -0.14044286
4 0610009B22Rik         C   conditionindo 1.268386e-01  0.009556984 -0.07789180
5 0610009B22Rik         C    conditionpIC 1.448011e-01  0.011387514 -0.07837585
6 0610009B22Rik         C         n_genes 3.434844e-51 -0.143352234 -0.18155685
          coef            z
1  0.001366111   0.09474016
2  0.003823194   0.16617276
3 -0.009630407  -0.14429247
4 -0.034167407  -1.53156822
5 -0.033494166  -1.46267607
6 -0.162454542 -16.66840738


In [18]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [19]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [20]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_0.csv')
female_all.to_csv('./write/MPP_MAST_female_0.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_0.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_0.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_0.csv')
indo_all.to_csv('./write/MPP_MAST_indo_0.csv')

#### cluster 1

In [21]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [22]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [23]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_1) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [24]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.        ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 4.874937e-01  0.034752273 -0.01657955
2 0610009B22Rik         C   conditionGCSF 6.374720e-03 -0.016202727 -0.09851058
3 0610009B22Rik         C conditiondmPGE2 4.189599e-05 -0.052274193 -0.14742852
4 0610009B22Rik         C   conditionindo 5.668172e-02  0.001122156 -0.08085728
5 0610009B22Rik         C    conditionpIC 2.084126e-01  0.015682932 -0.07184023
6 0610009B22Rik         C         n_genes 4.475820e-35 -0.109721722 -0.14901110
          coef           z
1  0.009086361   0.6938753
2 -0.057356652  -2.7316221
3 -0.099851358  -4.1134243
4 -0.039867560  -1.9063070
5 -0.028078647  -1.2575674
6 -0.129366410 -12.9069751


In [25]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [26]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [27]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_1.csv')
female_all.to_csv('./write/MPP_MAST_female_1.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_1.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_1.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_1.csv')
indo_all.to_csv('./write/MPP_MAST_indo_1.csv')

#### cluster 2

In [28]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [29]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [30]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_2) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [31]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.       ci.hi        ci.lo
1 0610009B22Rik         C      FemaleTRUE 1.569078e-01  0.06000048 -0.009410614
2 0610009B22Rik         C   conditionGCSF 2.215770e-01  0.08063364 -0.018353553
3 0610009B22Rik         C conditiondmPGE2 6.738794e-02  0.14859984 -0.004396470
4 0610009B22Rik         C   conditionindo 3.264536e-01  0.07352890 -0.024188497
5 0610009B22Rik         C    conditionpIC 8.369710e-01  0.06168840 -0.049881808
6 0610009B22Rik         C         n_genes 2.394047e-54 -0.16323686 -0.202575379
          coef           z
1  0.025294933   1.4285082
2  0.031140046   1.2331568
3  0.072101687   1.8473218
4  0.024670200   0.9896437
5  0.005903297   0.2074075
6 -0.182906120 -18.2258725


In [32]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [33]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [34]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_2.csv')
female_all.to_csv('./write/MPP_MAST_female_2.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_2.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_2.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_2.csv')
indo_all.to_csv('./write/MPP_MAST_indo_2.csv')

#### cluster 3

In [35]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [36]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [37]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_3) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [38]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.       ci.hi        ci.lo
1 0610009B22Rik         C      FemaleTRUE 2.239121e-01  0.01125951 -0.048014166
2 0610009B22Rik         C   conditionGCSF 5.939550e-01  0.06313577 -0.036158912
3 0610009B22Rik         C conditiondmPGE2 3.891061e-01  0.06712875 -0.026171538
4 0610009B22Rik         C   conditionindo 2.375743e-02  0.10337077  0.007410254
5 0610009B22Rik         C    conditionpIC 3.682567e-01  0.07421308 -0.027544802
6 0610009B22Rik         C         n_genes 5.968189e-26 -0.11042659 -0.158334976
         coef           z
1 -0.01837733  -1.2153424
2  0.01348843   0.5324925
3  0.02047861   0.8603903
4  0.05539051   2.2626683
5  0.02333414   0.8988802
6 -0.13438078 -10.9952154


In [39]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [40]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [41]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_3.csv')
female_all.to_csv('./write/MPP_MAST_female_3.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_3.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_3.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_3.csv')
indo_all.to_csv('./write/MPP_MAST_indo_3.csv')

#### cluster 4

In [42]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [43]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [44]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_4) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [45]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.       ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 3.568153e-01  0.02052194 -0.05815414
2 0610009B22Rik         C   conditionGCSF 8.104646e-02  0.47901843 -0.02314259
3 0610009B22Rik         C conditiondmPGE2 3.027419e-01  0.22457713 -0.06801481
4 0610009B22Rik         C   conditionindo 9.760643e-01  0.25487731 -0.24706703
5 0610009B22Rik         C         n_genes 7.906074e-39 -0.15249893 -0.19459720
6 0610009B22Rik         C     (Intercept)           NA  0.77168387  0.47601809
          coef            z
1 -0.018816100  -0.93748646
2  0.227937921   1.77931021
3  0.078281161   1.04875247
4  0.003905139   0.03049714
5 -0.173548064 -16.15970919
6  0.623850979   8.27099732


In [46]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [47]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [48]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_4.csv')
female_all.to_csv('./write/MPP_MAST_female_4.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_4.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_4.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_4.csv')
indo_all.to_csv('./write/MPP_MAST_indo_4.csv')

#### cluster 5

In [49]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [50]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [51]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_5) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [52]:
%%R
head(summaryDt_all)

       primerid component        contrast   Pr..Chisq.        ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 9.361584e-02  0.006175291 -0.08598613
2 0610009B22Rik         C   conditionGCSF 1.000000e+00 -0.060540833 -0.50901154
3 0610009B22Rik         C conditiondmPGE2 1.000000e+00           NA          NA
4 0610009B22Rik         C   conditionindo 1.000000e+00  0.344282726  0.06120433
5 0610009B22Rik         C    conditionpIC 1.000000e+00           NA          NA
6 0610009B22Rik         C         n_genes 4.243372e-23 -0.123405931 -0.17436250
         coef          z
1 -0.03990542  -1.697309
2 -0.28477618  -2.489131
3          NA         NA
4  0.20274353   2.807491
5          NA         NA
6 -0.14888422 -11.453192


In [53]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [54]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [55]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_5.csv')
female_all.to_csv('./write/MPP_MAST_female_5.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_5.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_5.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_5.csv')
indo_all.to_csv('./write/MPP_MAST_indo_5.csv')

#### cluster 6

In [56]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [57]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [58]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_6) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [59]:
%%R
head(summaryDt_all)

       primerid component      contrast Pr..Chisq.       ci.hi      ci.lo
1 0610009B22Rik         C    FemaleTRUE 0.10105019  0.01002705 -0.1118778
2 0610009B22Rik         C conditionGCSF 0.63827942  0.23384126 -0.3799781
3 0610009B22Rik         C conditionindo 0.62368761  0.23009341 -0.3822517
4 0610009B22Rik         C       n_genes 0.00257002 -0.03436084 -0.1569249
5 0610009B22Rik         C   (Intercept)         NA  0.74202129  0.5372715
6 0610009B22Rik         D    FemaleTRUE 0.53504709  0.40185442 -0.7470822
         coef          z
1 -0.05092538 -1.6375377
2 -0.07306842 -0.4666242
3 -0.07607913 -0.4870207
4 -0.09564289 -3.0589158
5  0.63964638 12.2460064
6 -0.17261389 -0.5889220


In [60]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [61]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [62]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_6.csv')
female_all.to_csv('./write/MPP_MAST_female_6.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_6.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_6.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_6.csv')
indo_all.to_csv('./write/MPP_MAST_indo_6.csv')

#### cluster 7

In [63]:
%%R
# list all variables 
ls()

 [1] "adata_raw"         "cond"              "dmPGE2_all"       
 [4] "expressed_genes"   "female_all"        "freq_expressed"   
 [7] "GCSF_all"          "indo_all"          "MAST_raw_all"     
[10] "pIC_all"           "result_all_dmPGE2" "result_all_Female"
[13] "result_all_GCSF"   "result_all_indo"   "result_all_pIC"   
[16] "sca"               "sca_0"             "sca_0_filt"       
[19] "sca_1"             "sca_1_filt"        "sca_2"            
[22] "sca_2_filt"        "sca_3"             "sca_3_filt"       
[25] "sca_4"             "sca_4_filt"        "sca_5"            
[28] "sca_5_filt"        "sca_6"             "sca_6_filt"       
[31] "sca_7"             "sca_7_filt"        "summaryCond_all"  
[34] "summaryDt_all"     "zlmCond_all"      


In [64]:
%%R
# remove previous variables

rm(zlmCond_all)
rm(summaryDt_all)
rm(summaryCond_all)
rm(MAST_raw_all)

In [65]:
%%R 
#Define & run hurdle model 
zlmCond_all <- zlm(formula = ~condition + Female + n_genes, sca=sca_7) # this runs the model
summaryCond_all <- summary(zlmCond_all, doLRT=TRUE) # extracts the data, gives datatable with summary of fit, doLRT=TRUE extracts likelihood ratio test p-value
summaryDt_all <- summaryCond_all$datatable # reformats into a table

In [66]:
%%R
head(summaryDt_all)

       primerid component        contrast Pr..Chisq.       ci.hi       ci.lo
1 0610009B22Rik         C      FemaleTRUE 0.17215376  0.03564091 -0.20239787
2 0610009B22Rik         C   conditionGCSF 1.00000000  0.47774010  0.12466209
3 0610009B22Rik         C conditiondmPGE2 1.00000000  0.19183367 -0.10632129
4 0610009B22Rik         C   conditionindo 1.00000000  0.38206226 -0.01790786
5 0610009B22Rik         C    conditionpIC 1.00000000          NA          NA
6 0610009B22Rik         C         n_genes 0.01136561 -0.02423996 -0.16466593
         coef          z
1 -0.08337848 -1.3730437
2  0.30120109  3.3439822
3  0.04275619  0.5621278
4  0.18207720  1.7844570
5          NA         NA
6 -0.09445295 -2.6366116


In [67]:
%%R -o female_all -o GCSF_all -o dmPGE2_all -o indo_all -o pIC_all

# reformat for female
result_all_Female <- merge(summaryDt_all[contrast=='FemaleTRUE' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='FemaleTRUE' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_Female[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
female_all = result_all_Female[result_all_Female$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
female_all = female_all[order(female_all$FDR),] # sorts the table


# reformat for GCSF
result_all_GCSF <- merge(summaryDt_all[contrast=='conditionGCSF' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionGCSF' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_GCSF[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
GCSF_all = result_all_GCSF[result_all_GCSF$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
GCSF_all = GCSF_all[order(GCSF_all$FDR),] # sorts the table


# reformat for dmPGE2
result_all_dmPGE2 <- merge(summaryDt_all[contrast=='conditiondmPGE2' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditiondmPGE2' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_dmPGE2[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
dmPGE2_all = result_all_dmPGE2[result_all_dmPGE2$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
dmPGE2_all = dmPGE2_all[order(dmPGE2_all$FDR),] # sorts the table


# reformat for indo
result_all_indo <- merge(summaryDt_all[contrast=='conditionindo' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionindo' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_indo[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
indo_all = result_all_indo[result_all_indo$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
indo_all = indo_all[order(indo_all$FDR),] # sorts the table

# reformat for pIC
result_all_pIC <- merge(summaryDt_all[contrast=='conditionpIC' & component=='H',.(primerid, `Pr(>Chisq)`)], #P-vals
                  summaryDt_all[contrast=='conditionpIC' & component=='logFC', .(primerid, coef)],
                  by='primerid') #logFC coefficients
#Correct for multiple testing (FDR correction) and filtering
result_all_pIC[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')] # create column named FDR - probably that p.adjust function
pIC_all = result_all_pIC[result_all_pIC$FDR<0.01,, drop=F] # create new table where rows with FDR<0.01 are droped
pIC_all = pIC_all[order(pIC_all$FDR),] # sorts the table

In [68]:
%%R -o MAST_raw_all

MAST_raw_all <- summaryDt_all

In [69]:
# save files as .csvs

MAST_raw_all.to_csv('./write/MPP_MAST_raw_7.csv')
female_all.to_csv('./write/MPP_MAST_female_7.csv')
GCSF_all.to_csv('./write/MPP_MAST_GCSF_7.csv')
pIC_all.to_csv('./write/MPP_MAST_pIC_7.csv')
dmPGE2_all.to_csv('./write/MPP_MAST_dmPGE2_7.csv')
indo_all.to_csv('./write/MPP_MAST_indo_7.csv')

In [70]:
sc.logging.print_versions()
pd.show_versions()

scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.17.3 scipy==1.3.0 pandas==0.25.3 scikit-learn==0.22.2.post1 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.76-linuxkit
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 41.6.0.post20191101
Cython           : None
pytest           : 5.3.5
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.

In [71]:
%%R

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    tools     stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] Matrix_1.2-17               MAST_1.12.0                
 [3] plyr_1.8.4                  ggplot2_3.2.1              
 [5] scran_1.14.1                SingleCellExperiment_1.8.0 
 [7] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
 [9] BiocParallel_1.20.0         matrixStats_