# Differential expression
What genes are upregulated in tumor vs. pnc hyperplastic cells?  
**Prerequisites**  
Perform batch correction on Shiraishi *et al* data. See batch-correction-scgen.ipynb.

## Introduction
Concordance across DEG approaches is low. Pseudobulk analyses outperform cell-level analyses. [source](https://www.sc-best-practices.org/conditions/differential_gene_expression.html).  
However, we have only 1 scRNA sample per phenotype (gnp, pnc, tumor). Therefore, we have to do it at the cell level.  
To estimate concordance, we will perform a few different analyses:
- T-test (https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html)
- generalized linear model (GLM) on single cell data (MAST, implemented here)
- GLM on bulk RNA-seq (edgeR, to be implemented in a new notebook).


## MAST
Following the tutorial at https://www.sc-best-practices.org/conditions/differential_gene_expression.html#single-cell-specific

In [None]:
Sys.setenv(LANGUAGE = "en") # set language to "ja" if you prefer

library(zellkonverter)
library(MAST)

In [None]:
path='out/shiraishi_merge.h5ad'
data = readH5AD(path, verbose = TRUE)
data

In [None]:
# create a MAST object
sca <- SceToSingleCellAssay(data, class = "SingleCellAssay")

sca <- sca[,colData(sca)$annotation %in% c("ProliferativeCells", "DifferentiatedCells")]
sca

In [None]:
# add a column to the data which contains scaled number of genes that are expressed in each cell
cdr2 <- colSums(assay(sca)>0)
colData(sca)$ngeneson <- scale(cdr2)
# store the columns that we are interested in as factors
label <- factor(colData(sca)$sample)
# set the reference level
label <- relevel(label,"pnc")
colData(sca)$label <- label
# define and fit the model
zlmCond <- zlm(formula = ~ngeneson + label, 
               sca=sca, 
               method='bayesglm', 
               ebayes=T, 
               strictConvergence=F,
              )

In [None]:
# perform likelihood-ratio test for the condition that we are interested in    
summaryCond <- summary(zlmCond, doLRT='labeltumor')

In [None]:
# get the table with log-fold changes and p-values
summaryDt <- summaryCond$datatable
result <- merge(summaryDt[contrast=='labeltumor' & component=='H',.(primerid, `Pr(>Chisq)`)], # p-values
                 summaryDt[contrast=='labeltumor' & component=='logFC', .(primerid, coef)],
                 by='primerid') # logFC coefficients
# MAST uses natural logarithm so we convert the coefficients to log2 base to be comparable to edgeR
result[,coef:=result[,coef]/log(2)]
# do multiple testing correction
result[,FDR:=p.adjust(`Pr(>Chisq)`, 'fdr')]
result = result[result$FDR<0.01,, drop=F]

result <- stats::na.omit(as.data.frame(result))

In [None]:
library(data.table)
colnames(result) <- c('gene','p','log2FC','FDR')
setorder(result,-log2FC)
fwrite(result,'out/deg/tumor_pnc_mast_deg.tsv', sep='\t')