# Perform MAST Differential Expression Tests

In this notebook, we retrieve our CD4 and CD8 T cells and our cell type labels, then perform MAST differential gene expression tests. Comparisons will be carried out for each drug treatment at each timepoint compared to the DMSO-only control for each timepoint within each cell type.

To balance cell counts, we'll group cells by treatment or control and cell type, then use the minimum number of cells across all samples. For example, to test CD4 Naive T cells under Bortezomib treatment, we'll examine the number of CD4 Naive cells in Bortezomib and DMSO at 4, 24, and 72 hours and randomly sample based on the minimum counts from all 6 samples.

We'll then perform comparisons between treatment and control at each timepoint (e.g. CD4 Naive w/Bortezomib @ 4 hr vs. CD4 Naive w/DMSO @4 hr). 

For MAST, we need to also include Cellular Detection Rate (CDR) as a cofactor to control for gene expression differences between samples.

## Load packages

hise: The Human Immune System Explorer R SDK package  
purrr: Functional programming tools  
furrr: Parallelization of functional programming using `futures`  
dplyr: Dataframe handling functions  
ggplot2: plotting functions  
Seurat: single cell genomics methods  
MAST: single cell differential expression tests

In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(purrr)
quiet_library(furrr)
quiet_library(dplyr)
quiet_library(ggplot2)
quiet_library(Seurat)
quiet_library(MAST)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the Seurat objects and cell type labels based on file UUIDs. This will be placed in the `cache/` subdirectory by default.

In [2]:
file_uuids <- list(
    "7bdac6ef-e5e5-4150-b4f3-9c1a1e250334", # CD4 T cell Seurat object
    "46438bc4-cde6-4ae6-b349-9c513dd9d16f", # CD8 T cell Seurat object
    "ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04", # CD4 type labels
    "4d6aade9-288c-452f-8f0d-ac59e539f4cc"  # CD8 type labels
)

In [3]:
fres <- cacheFiles(file_uuids)

## Select cells


In [4]:
cd4_labels <- read.csv("cache/ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04/cd4_cell_type_labels_2023-09-05.csv")
cd8_labels <- read.csv("cache/4d6aade9-288c-452f-8f0d-ac59e539f4cc/cd8_cell_type_labels_2023-09-05.csv")

In [5]:
all_labels <- rbind(cd4_labels, cd8_labels)

In [6]:
head(all_labels)

Unnamed: 0_level_0,barcodes,treatment,timepoint,predicted.celltype.l1.score,predicted.celltype.l1,predicted.celltype.l2.score,predicted.celltype.l2,predicted.celltype.l3.score,predicted.celltype.l3,aifi_cell_type
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
1,2da9d348fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.7379073,CD4 Naive,0.7379073,CD4 Naive,t_cd4_naive
2,2daec6d2fb8111eda35df29f570c0793,dmso,24,1,CD4 T,1.0,CD4 Naive,1.0,CD4 Naive,t_cd4_naive
3,2db119d2fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.6491493,CD4 TCM,0.4892181,CD4 TCM_1,t_cd4_naive
4,2db582c4fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.8972198,CD4 Naive,0.8972198,CD4 Naive,t_cd4_naive
5,2db6727efb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.3939763,CD4 TCM,0.2974696,CD4 Naive,t_cd4_naive
6,2dc35a20fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.6306972,CD4 Naive,0.6306972,CD4 Naive,t_cd4_naive


Exclude untreated cells - we won't use these for our treatment comparisons

In [7]:
all_labels <- all_labels %>%
  filter(treatment != "untreated")

Get counts of each cell type for each sample:

In [8]:
count_summary <- all_labels %>%
  group_by(treatment, timepoint, aifi_cell_type) %>%
  summarise(n_cells = n(),
            .groups = "keep") %>%
  ungroup()

Add a column for DMSO counts per type and timepoint

In [9]:
count_summary <- count_summary %>%
  ungroup() %>%
  group_by(aifi_cell_type, timepoint) %>%
  mutate(n_dmso = n_cells[treatment == "dmso"]) %>%
  ungroup() %>%
  filter(treatment != "dmso")

Regroup by treatment and cell type, and use treatment and DMSO counts to find minimums for sampling

In [10]:
type_minimums <- count_summary %>%
  group_by(treatment, aifi_cell_type) %>%
  mutate(n_sample = min(c(n_cells, n_dmso)))

In [11]:
comp_list <- map(
    1:nrow(type_minimums),
    function(i) {
        as.list(type_minimums[i,])
    }
)

## Sample cells for each test

Here, we'll sample cells for comparisons and generate a table of foreground and background cells to use for analysis.

In [12]:
sampled_comp_cells <- map(
    comp_list,
    function(comp) {
        set.seed(3030)
        
        tp <- comp$timepoint
        ct <- comp$aifi_cell_type
        
        fg_treat <- comp$treatment
        bg_treat <- "dmso"

        n_sample <- comp$n_sample

        fg_cells <- all_labels %>%
          filter(treatment == fg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)
        bg_cells <- all_labels %>%
          filter(treatment == bg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)

        rbind(bg_cells, fg_cells)
    }
)

## Build matrices for each test

Now, we'll use the selected cells to build a data matrix for each comparison.

We'll use these together with the cell metadata to run MAST. Because some steps in the analysis are single-threaded, I've found that it's quite efficient to build a list of datasets and run MAST on each comparison using its own core, rather than rely on the built-in parallelization provided by MAST. This may not be possible on larger datasets, where duplication of data in RAM would be prohibitive (i.e. the shared DMSO control data), but on the scale used for this analysis we should get away with it.

In [13]:
cd4_so <- readRDS("cache/7bdac6ef-e5e5-4150-b4f3-9c1a1e250334/filtered_cd4_te_seurat.rds")
cd8_so <- readRDS("cache/46438bc4-cde6-4ae6-b349-9c513dd9d16f/filtered_cd8_te_seurat.rds")

In [14]:
all_so <- merge(cd4_so, cd8_so)

In [15]:
all_so <- NormalizeData(
    all_so, 
    normalization.method = "LogNormalize"
)

In [16]:
all_mat <- all_so[["RNA"]]@data
rm(cd4_so)
rm(cd8_so)
rm(all_so)

In [17]:
sampled_comp_mats <- map(
    sampled_comp_cells,
    function(meta) {
        all_mat[,meta$barcodes]
    }
)

## Filter genes prior to testing

Before we perform differential tests, we'll remove genes that have low expression in either test group. For our purposes, we'll remove any gene that isn't expressed in at least 5% of either the treatment or DMSO control cells. Note that we don't require that the gene be expressed in 5% of **both** groups - a difference between low/no expression in one group and > 5% expression in another group would still be good to capture.

In [18]:
min_gene_frac <- 0.05

In [19]:
filtered_comp_mats <- map2(
    sampled_comp_cells,
    sampled_comp_mats,
    function(meta, mat) {
        fg_meta <- meta %>%
          filter(treatment != "dmso")
        bg_meta <- meta %>%
          filter(treatment == "dmso")

        fg_mat <- mat[,fg_meta$barcodes]
        bg_mat <- mat[,bg_meta$barcodes]

        # Running diff on the pointers of transposed sparse matrices is a pretty fast way to get # non-zero
        fg_fracs <- diff(t(fg_mat)@p) / ncol(fg_mat)
        bg_fracs <- diff(t(bg_mat)@p) / ncol(bg_mat)

        keep_genes <- fg_fracs > min_gene_frac | bg_fracs > min_gene_frac

        mat[keep_genes,]
    }
)

In [20]:
map_int(filtered_comp_mats, nrow)

## Compute CDR values

To account for technical factors that may affect gene detection in each cell, we'll also need to calculate Cellular Detection Rate values for each cell (CDR), and use these as a cofactor in our MAST equation.

Since we just filtered genes, we'll compute the CDR based on the remaining genes used for analysis for each cell, and add that to our metadata

In [21]:
sampled_comp_cells <- map2(
    sampled_comp_cells,
    filtered_comp_mats,
    function(meta, mat) {
        gene_counts <- diff(mat@p)

        meta %>%
          mutate(cdr = scale(gene_counts))
    }
)

## Run MAST tests

Now, we have all of the pieces we need to run our MAST tests easily. We'll run these across multiple cores using `furrr`'s `future_map2()`.

This takes around 15-16 minutes to run.

In [22]:
future::plan(future::multisession, workers = 12)

In [23]:
quietly <- function(...) { suppressMessages(suppressWarnings(...)) }

In [24]:
zlm_res_list <- future_map2(
    sampled_comp_cells,
    filtered_comp_mats,
    function(meta, mat) {
        treatments <- unique(meta$treatment)
        fg_treat <- treatments[treatments != "dmso"]
        treat_levels <- c("dmso", fg_treat)
        meta$treatment <- factor(meta$treatment, levels = treat_levels)
        
        fdat <- data.frame(x = rownames(mat))
        rownames(fdat) <- rownames(mat)
        
        sca <- FromMatrix(
            exprsArray = as.matrix(mat),
            cData = meta,
            fData = fdat)

        zlm_res <- quietly(
            zlm(formula = formula(~ treatment + cdr), 
                sca = sca,
                method = "bayesglm",
                ebayes = TRUE,
                parallel = FALSE
            )
        )

        zlm_res
    }
)

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et is log-transformed.

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et is log-transformed.

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et is log-transformed.

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et is log-transformed.

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et is log-transformed.

`fData` has no primerid.  I'll make something up.

`cData` has no wellKey.  I'll make something up.

Assuming data assay in position 1, with name et i

### Extract results based on treatment

This takes an additional 6 minutes or so.

In [25]:
lrt_list <- paste0("treatment", type_minimums$treatment)

mast_res_list <- future_map2(
    zlm_res_list,
    lrt_list,
    function(zlm_res, lrt_vars) {
        suppressMessages(
            MAST::summary(
                object = zlm_res, 
                doLRT = lrt_vars)
        )
    }
)

### Format results

In [26]:
formatted_mast_res <- map2(
    mast_res_list,
    comp_list,
    function(res, comp) {
        all_res <- res$datatable %>%
          as.data.frame() %>%
          filter(contrast == paste0("treatment", comp$treatment))

        split_res <- split(all_res, all_res$component)

        split_res$H %>%
          filter(contrast == paste0("treatment", comp$treatment)) %>%
          dplyr::rename(
              gene = primerid,
              nomP = `Pr(>Chisq)`) %>%
          mutate(
              fg = comp$treatment,
              bg = "dmso",
              aifi_cell_type = comp$aifi_cell_type,
              timepoint = comp$timepoint,
              n_sample = comp$n_sample,
              logFC = split_res$logFC$coef,
              coef_C = split_res$C$coef,
              coef_D = split_res$D$coef) %>%
          select(aifi_cell_type, timepoint, fg, bg, n_sample,
                 gene, coef_C, coef_D, logFC, nomP) %>%
          mutate(adjP = p.adjust(nomP, method = "BH"))
                 
    }
)

In [27]:
all_res <- do.call(rbind, formatted_mast_res)

In [28]:
head(all_res)

Unnamed: 0_level_0,aifi_cell_type,timepoint,fg,bg,n_sample,gene,coef_C,coef_D,logFC,nomP,adjP
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,t_cd4_cm,4,bortezomib,dmso,648,A1BG-AS1,0.028411773,-0.12714495,-0.00860394,0.7403701,0.999237
2,t_cd4_cm,4,bortezomib,dmso,648,AAGAB,0.032287238,-0.25574995,-0.05429509,0.1455986,0.9918874
3,t_cd4_cm,4,bortezomib,dmso,648,AAK1,-0.009058149,0.04769567,0.0148158,0.8762539,0.999237
4,t_cd4_cm,4,bortezomib,dmso,648,AAMDC,0.062294844,0.19250673,0.04313545,0.1602309,0.9918874
5,t_cd4_cm,4,bortezomib,dmso,648,AAMP,0.018080684,-0.16673307,-0.01001203,0.7332528,0.999237
6,t_cd4_cm,4,bortezomib,dmso,648,AARS,0.019996604,-0.13135182,-0.01430162,0.6785156,0.999237


In [29]:
sig_res <- all_res %>%
  mutate(treat_time = paste0(fg, "_", timepoint)) %>%
  filter(adjP < 0.01)

In [30]:
table(sig_res$treat_time, sig_res$aifi_cell_type)

                  
                   t_cd4_cm t_cd4_em t_cd4_naive t_cd4_treg t_cd8_memory
  bortezomib_24         528      129         247         16          256
  bortezomib_72        3532     1640        5802        370         1090
  dexamethasone_24      560      182         870         63          105
  dexamethasone_4       433      167         580         42           87
  lenalidomide_24       197       14         686          0            4
  lenalidomide_4        114       12         188          0            0
  lenalidomide_72        36        0         292          0            1
                  
                   t_cd8_naive
  bortezomib_24            486
  bortezomib_72           3315
  dexamethasone_24         290
  dexamethasone_4          204
  lenalidomide_24          176
  lenalidomide_4            88
  lenalidomide_72           43

## Generate output files

For downstream use, we'll output the table of aifi_cell_type labels for each cell.

In [31]:
dir.create("output")

In [32]:
write.csv(all_res,
          paste0("output/all_mast_deg_",Sys.Date(),".csv"),
          quote = FALSE, row.names = FALSE)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference.

In [33]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq MAST DEG", Sys.Date())

In [34]:
out_files <- list.files(
    "output",
    full.names = TRUE
)
out_list <- as.list(out_files)

In [35]:
out_list

In [36]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = file_uuids,
    store = "project",
    doPrompt = FALSE
)

In [37]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] MAST_1.26.0                 SingleCellExperiment_1.22.0
 [3] SummarizedExperiment_1.30.2 Biobase_2.60.0             
 [5] GenomicRanges_1.52.0        GenomeInfoDb_1.36.1        
 [7] IRanges_2.34.1              S4Vectors_0.38.1           
 [9] BiocGenerics_0.46.0         MatrixGenerics_1.12.3 