# Linear modeling for differential ADT expression

In this notebook, we retrieve our CD4 and CD8 T cells and our cell type labels, then perform differential expression of ADT data using linear models. Comparisons will be carried out for each drug treatment at each timepoint compared to the DMSO-only control for each timepoint within each cell type.

To balance cell counts, we'll group cells by treatment or control and cell type, then use the minimum number of cells across all samples. For example, to test CD4 Naive T cells under Bortezomib treatment, we'll examine the number of CD4 Naive cells in Bortezomib and DMSO at 4, 24, and 72 hours and randomly sample based on the minimum counts from all 6 samples.

We'll then perform comparisons between treatment and control at each timepoint (e.g. CD4 Naive w/Bortezomib @ 4 hr vs. CD4 Naive w/DMSO @4 hr).

## Load packages

hise: The Human Immune System Explorer R SDK package  
purrr: Functional programming tools  
dplyr: Dataframe handling functions  
Seurat: single cell genomics methods  

In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(purrr)
quiet_library(dplyr)
quiet_library(Seurat)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the Seurat objects and cell type labels based on file UUIDs. This will be placed in the `cache/` subdirectory by default.

In [2]:
file_uuids <- list(
    "7bdac6ef-e5e5-4150-b4f3-9c1a1e250334", # CD4 T cell Seurat object
    "46438bc4-cde6-4ae6-b349-9c513dd9d16f", # CD8 T cell Seurat object
    "ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04", # CD4 type labels
    "4d6aade9-288c-452f-8f0d-ac59e539f4cc"  # CD8 type labels
)

In [None]:
fres <- cacheFiles(file_uuids)

## Select cells


In [None]:
cd4_labels <- read.csv("cache/ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04/cd4_cell_type_labels_2023-09-05.csv")
cd8_labels <- read.csv("cache/4d6aade9-288c-452f-8f0d-ac59e539f4cc/cd8_cell_type_labels_2023-09-05.csv")

In [None]:
all_labels <- rbind(cd4_labels, cd8_labels)

In [None]:
head(all_labels)

Exclude untreated cells - we won't use these for our treatment comparisons

In [None]:
all_labels <- all_labels %>%
  filter(treatment != "untreated")

Get counts of each cell type for each sample:

In [None]:
count_summary <- all_labels %>%
  group_by(treatment, timepoint, aifi_cell_type) %>%
  summarise(n_cells = n(),
            .groups = "keep") %>%
  ungroup()

Add a column for DMSO counts per type and timepoint

In [None]:
count_summary <- count_summary %>%
  ungroup() %>%
  group_by(aifi_cell_type, timepoint) %>%
  mutate(n_dmso = n_cells[treatment == "dmso"]) %>%
  ungroup() %>%
  filter(treatment != "dmso")

Regroup by treatment and cell type, and use treatment and DMSO counts to find minimums for sampling

In [None]:
type_minimums <- count_summary %>%
  group_by(treatment, aifi_cell_type) %>%
  mutate(n_sample = min(c(n_cells, n_dmso)))

In [None]:
comp_list <- map(
    1:nrow(type_minimums),
    function(i) {
        as.list(type_minimums[i,])
    }
)

## Sample cells for each test

Here, we'll sample cells for comparisons and generate a table of foreground and background cells to use for analysis.

In [None]:
sampled_comp_cells <- map(
    comp_list,
    function(comp) {
        set.seed(3030)
        
        tp <- comp$timepoint
        ct <- comp$aifi_cell_type
        
        fg_treat <- comp$treatment
        bg_treat <- "dmso"

        n_sample <- comp$n_sample

        fg_cells <- all_labels %>%
          filter(treatment == fg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)
        bg_cells <- all_labels %>%
          filter(treatment == bg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)

        rbind(bg_cells, fg_cells)
    }
)

## Build matrices for each test

Now, we'll use the selected cells to build a data matrix for each comparison.

We'll use these together with the cell metadata to run `lm()`.

In [None]:
cd4_so <- readRDS("cache/7bdac6ef-e5e5-4150-b4f3-9c1a1e250334/filtered_cd4_te_seurat.rds")
cd8_so <- readRDS("cache/46438bc4-cde6-4ae6-b349-9c513dd9d16f/filtered_cd8_te_seurat.rds")

In [None]:
all_so <- merge(cd4_so, cd8_so)

In [None]:
DefaultAssay(all_so) <- "ADT"

In [15]:
all_so <- NormalizeData(
    all_so, 
    normalization.method = "CLR",
    
)

In [16]:
all_mat <- all_so[["RNA"]]@data
rm(cd4_so)
rm(cd8_so)
rm(all_so)

In [17]:
sampled_comp_mats <- map(
    sampled_comp_cells,
    function(meta) {
        all_mat[,meta$barcodes]
    }
)

In [3]:
type_labels <- read.csv("../04_perturb_tea-seq_cell-type_labeling/data/aifi_cell_type_labels.csv") %>%
  select(barcodes, aifi_cell_type)

In [4]:
so <- readRDS("../03_perturb_tea-seq_preprocessing/data/seurat_objects/allcells_filtered_tea_so.rds")

In [6]:
adt_mat <- so[["ADT"]]@data

In [7]:
meta <- so@meta.data %>%
  left_join(type_labels) %>%
  filter(!is.na(aifi_cell_type)) %>%
  filter(aifi_cell_type != "t_cd8_mait")

[1m[22mJoining, by = "barcodes"


Downsampling to use the same number of cells per treatment/dmso per cell type across timepoints - this allows us to compare between time points, but isn't very good for comparisons between cell types, so be careful with interpretation.

In [8]:
downsample <- meta %>%
  group_by(aifi_cell_type, treatment, timepoint) %>%
  summarise(n_cells = n(), .groups = "keep") %>%
  group_by(aifi_cell_type, treatment) %>%
  summarise(min_cells = min(n_cells), .groups = "keep") %>%
  group_by(aifi_cell_type) %>%
  mutate(downsample = ifelse(min_cells[treatment == "dmso"] < min_cells, min_cells[treatment == "dmso"], min_cells))

In [9]:
downsample

aifi_cell_type,treatment,min_cells,downsample
<chr>,<chr>,<int>,<int>
t_cd4_cm,bortezomib,1158,660
t_cd4_cm,dexamethasone,1112,660
t_cd4_cm,dmso,660,660
t_cd4_cm,lenalidomide,1435,660
t_cd4_cm,untreated,4944,660
t_cd4_em,bortezomib,502,301
t_cd4_em,dexamethasone,481,301
t_cd4_em,dmso,301,301
t_cd4_em,lenalidomide,668,301
t_cd4_em,untreated,1979,301


In [10]:
conditions <- meta %>%
  filter(treatment != "untreated") %>%
  select(aifi_cell_type, treatment, timepoint) %>%
  unique() %>%
  mutate(fg = treatment,
         bg = ifelse(treatment != "dmso", "dmso", "untreated"))

In [11]:
conditions[13,]

Unnamed: 0_level_0,aifi_cell_type,treatment,timepoint,fg,bg
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>
11641,t_cd4_naive,dmso,72,dmso,untreated


In [12]:
lm_res <- map_dfr(
    1:nrow(conditions),
    function(i) {
        set.seed(3030)
        
        ct <- conditions$aifi_cell_type[i]
        tp <- conditions$timepoint[i]
        fg <- conditions$fg[i]
        bg <- conditions$bg[i]
        
        ds <- downsample %>%
          filter(aifi_cell_type == ct,
                 treatment == fg)
        ds <- ds$downsample
        
        meta <- meta %>%
          filter(aifi_cell_type == ct,
                 # include 0 here so we don't lose untreated cells for dmso comparisons
                 timepoint %in% c(tp, 0))
        
        # filtering here based on treatment will drop the untreated cells from non-dmso comparisons
        fg_meta <- meta %>%
          filter(treatment == fg) %>%
          sample_n(ds)
        bg_meta <- meta %>%
          filter(treatment == bg) %>%
          sample_n(ds)
        
        fg_mat <- adt_mat[,fg_meta$barcodes]
        bg_mat <- adt_mat[,bg_meta$barcodes]
        
        lm_mat <- cbind(fg_mat, bg_mat)
        lm_meta <- rbind(fg_meta, bg_meta)
        
        map_dfr(rownames(lm_mat),
            function(feat) {
                dat <- data.frame(
                    treatment = lm_meta$treatment, 
                    val = lm_mat[feat,]
                )
                names(dat)[2] <- feat
                
                lm_res <- lm(
                    formula = as.formula(paste0("`",feat,"` ~ treatment")), 
                    data = dat)
                
                coef <- summary(lm_res)$coefficients
                
                data.frame(
                    aifi_cell_type = ct,
                    timepoint = tp,
                    fg = fg,
                    bg = bg,
                    n_downsample = ds,
                    feature = feat,
                    estimate = coef[2,1],
                    std_error = coef[2,2],
                    t_value = coef[2,3],
                    nomP = coef[2,4],
                    fg_mean = mean(fg_mat[feat,]),
                    bg_mean = mean(bg_mat[feat,]),
                    fc = mean(fg_mat[feat,]) / mean(bg_mat[feat,])
                )
        })
    }
)

In [13]:
lm_res$adjP <- p.adjust(lm_res$nomP, method = "BH")

In [14]:
head(lm_res)

Unnamed: 0_level_0,aifi_cell_type,timepoint,fg,bg,n_downsample,feature,estimate,std_error,t_value,nomP,fg_mean,bg_mean,fc,adjP
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,t_cd4_cm,72,lenalidomide,dmso,660,CD11c,-0.007439843,0.006111401,-1.217371,0.223681,0.1085019,0.11594174,0.9358312,0.5858312
2,t_cd4_cm,72,lenalidomide,dmso,660,CD278,0.010920693,0.012857006,0.8493963,0.395815,0.54506449,0.5341438,1.0204452,0.738955
3,t_cd4_cm,72,lenalidomide,dmso,660,CD11b,0.001346363,0.002957545,0.45523,0.6490187,0.02949852,0.02815216,1.0478245,0.8873589
4,t_cd4_cm,72,lenalidomide,dmso,660,CD16,0.006983361,0.005876437,1.1883664,0.2349032,0.12810037,0.12111701,1.057658,0.6025386
5,t_cd4_cm,72,lenalidomide,dmso,660,CD21,0.007479,0.008476136,0.8823597,0.3777432,0.30070889,0.29322989,1.0255056,0.7247399
6,t_cd4_cm,72,lenalidomide,dmso,660,CD27,-0.114406107,0.020247241,-5.6504543,1.958662e-08,2.13313049,2.24753659,0.9490971,3.972035e-07


In [15]:
write.csv(
    lm_res,
    "data/dep_treatment-ds_lm_results.csv")

In [16]:
sessionInfo()

R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] H5weaver_1.2.0     rhdf5_2.42.0       Matrix_1.5-3       data.table_1.14.6 
[5] SeuratObject_4.1.3 Seurat_4.3.0       purrr_1.0.0        dplyr_1.0.10      

loaded via a namespace (and not attached):
  [1] Rtsne_0.16             colorspace_2.0-3       deldir_1.0-6          
  [4] ellipsis_0.3