# Linear modeling for differential ADT expression

In this notebook, we retrieve our CD4 and CD8 T cells and our cell type labels, then perform differential expression of ADT data using linear models. Comparisons will be carried out for each drug treatment at each timepoint compared to the DMSO-only control for each timepoint within each cell type.

To balance cell counts, we'll group cells by treatment or control and cell type, then use the minimum number of cells across all samples. For example, to test CD4 Naive T cells under Bortezomib treatment, we'll examine the number of CD4 Naive cells in Bortezomib and DMSO at 4, 24, and 72 hours and randomly sample based on the minimum counts from all 6 samples.

We'll then perform comparisons between treatment and control at each timepoint (e.g. CD4 Naive w/Bortezomib @ 4 hr vs. CD4 Naive w/DMSO @4 hr).

## Load packages

hise: The Human Immune System Explorer R SDK package  
purrr: Functional programming tools  
dplyr: Dataframe handling functions  
Seurat: single cell genomics methods  

In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(purrr)
quiet_library(dplyr)
quiet_library(Seurat)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the Seurat objects and cell type labels based on file UUIDs. This will be placed in the `cache/` subdirectory by default.

In [2]:
file_uuids <- list(
    "7bdac6ef-e5e5-4150-b4f3-9c1a1e250334", # CD4 T cell Seurat object
    "46438bc4-cde6-4ae6-b349-9c513dd9d16f", # CD8 T cell Seurat object
    "ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04", # CD4 type labels
    "4d6aade9-288c-452f-8f0d-ac59e539f4cc"  # CD8 type labels
)

In [3]:
fres <- cacheFiles(file_uuids)

## Select cells


In [4]:
cd4_labels <- read.csv("cache/ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04/cd4_cell_type_labels_2023-09-05.csv")
cd8_labels <- read.csv("cache/4d6aade9-288c-452f-8f0d-ac59e539f4cc/cd8_cell_type_labels_2023-09-05.csv")

In [5]:
all_labels <- rbind(cd4_labels, cd8_labels)

In [6]:
head(all_labels)

Unnamed: 0_level_0,barcodes,treatment,timepoint,predicted.celltype.l1.score,predicted.celltype.l1,predicted.celltype.l2.score,predicted.celltype.l2,predicted.celltype.l3.score,predicted.celltype.l3,aifi_cell_type
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
1,2da9d348fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.7379073,CD4 Naive,0.7379073,CD4 Naive,t_cd4_naive
2,2daec6d2fb8111eda35df29f570c0793,dmso,24,1,CD4 T,1.0,CD4 Naive,1.0,CD4 Naive,t_cd4_naive
3,2db119d2fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.6491493,CD4 TCM,0.4892181,CD4 TCM_1,t_cd4_naive
4,2db582c4fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.8972198,CD4 Naive,0.8972198,CD4 Naive,t_cd4_naive
5,2db6727efb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.3939763,CD4 TCM,0.2974696,CD4 Naive,t_cd4_naive
6,2dc35a20fb8111eda35df29f570c0793,dmso,24,1,CD4 T,0.6306972,CD4 Naive,0.6306972,CD4 Naive,t_cd4_naive


Exclude untreated cells - we won't use these for our treatment comparisons

In [7]:
all_labels <- all_labels %>%
  filter(treatment != "untreated")

Get counts of each cell type for each sample:

In [8]:
count_summary <- all_labels %>%
  group_by(treatment, timepoint, aifi_cell_type) %>%
  summarise(n_cells = n(),
            .groups = "keep") %>%
  ungroup()

Add a column for DMSO counts per type and timepoint

In [9]:
count_summary <- count_summary %>%
  ungroup() %>%
  group_by(aifi_cell_type, timepoint) %>%
  mutate(n_dmso = n_cells[treatment == "dmso"]) %>%
  ungroup() %>%
  filter(treatment != "dmso")

Regroup by treatment and cell type, and use treatment and DMSO counts to find minimums for sampling

In [10]:
type_minimums <- count_summary %>%
  group_by(treatment, aifi_cell_type) %>%
  mutate(n_sample = min(c(n_cells, n_dmso)))

In [11]:
comp_list <- map(
    1:nrow(type_minimums),
    function(i) {
        as.list(type_minimums[i,])
    }
)

## Sample cells for each test

Here, we'll sample cells for comparisons and generate a table of foreground and background cells to use for analysis.

In [12]:
sampled_comp_cells <- map(
    comp_list,
    function(comp) {
        set.seed(3030)
        
        tp <- comp$timepoint
        ct <- comp$aifi_cell_type
        
        fg_treat <- comp$treatment
        bg_treat <- "dmso"

        n_sample <- comp$n_sample

        fg_cells <- all_labels %>%
          filter(treatment == fg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)
        bg_cells <- all_labels %>%
          filter(treatment == bg_treat,
                 timepoint == tp,
                 aifi_cell_type == ct) %>%
          sample_n(n_sample)

        rbind(bg_cells, fg_cells)
    }
)

## Build matrices for each test

Now, we'll use the selected cells to build a data matrix for each comparison.

We'll use these together with the cell metadata to run `lm()`.

In [13]:
cd4_so <- readRDS("cache/7bdac6ef-e5e5-4150-b4f3-9c1a1e250334/filtered_cd4_te_seurat.rds")
cd8_so <- readRDS("cache/46438bc4-cde6-4ae6-b349-9c513dd9d16f/filtered_cd8_te_seurat.rds")

In [14]:
all_so <- merge(cd4_so, cd8_so)

In [15]:
DefaultAssay(all_so) <- "ADT"

In [18]:
all_so <- NormalizeData(
    all_so, 
    normalization.method = "CLR",
    margin = 2
)

Normalizing across cells



In [19]:
all_mat <- all_so[["ADT"]]@data
rm(cd4_so)
rm(cd8_so)
rm(all_so)

In [20]:
sampled_comp_mats <- map(
    sampled_comp_cells,
    function(meta) {
        all_mat[,meta$barcodes]
    }
)

In [23]:
lm_res <- map2_dfr(
    sampled_comp_cells,
    sampled_comp_mats,
    function(meta, mat) {
        set.seed(3030)
        
        treatments <- unique(meta$treatment)
        fg_treat <- treatments[treatments != "dmso"]
        treat_levels <- c("dmso", fg_treat)
        meta$treatment <- factor(meta$treatment, levels = treat_levels)
        
        ct <- meta$aifi_cell_type[1]
        tp <- meta$timepoint[1]
        fg <- fg_treat
        bg <- "dmso"
        
        ds <- nrow(meta) / 2
        
        map_dfr(rownames(mat),
            function(feat) {
                dat <- data.frame(
                    treatment = meta$treatment, 
                    val = mat[feat,]
                )
                names(dat)[2] <- feat
                
                lm_res <- lm(
                    formula = as.formula(paste0("`",feat,"` ~ treatment")), 
                    data = dat)
                
                coef <- summary(lm_res)$coefficients
                
                data.frame(
                    aifi_cell_type = ct,
                    timepoint = tp,
                    fg = fg,
                    bg = bg,
                    n_downsample = ds,
                    feature = feat,
                    estimate = coef[2,1],
                    std_error = coef[2,2],
                    t_value = coef[2,3],
                    nomP = coef[2,4],
                    fg_mean = mean(dat[[feat]][dat$treatment == fg]),
                    bg_mean = mean(dat[[feat]][dat$treatment != fg]),
                    fc = mean(dat[[feat]][dat$treatment == fg]) / mean(dat[[feat]][dat$treatment != fg])
                )
        })
    }
)

In [27]:
lm_res$adjP <- p.adjust(lm_res$nomP, method = "BH")
lm_res$logFC <- log2(lm_res$fc)

## Generate output files

For downstream use, we'll output the table of DEP results.

In [37]:
dir.create("output")

“'output' already exists”


In [38]:
write.csv(lm_res,
          paste0("output/all_lm_dep_",Sys.Date(),".csv"),
          quote = FALSE, row.names = FALSE)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference.

In [39]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq lm DEP", Sys.Date())

In [40]:
out_files <- list.files(
    "output",
    full.names = TRUE
)
out_list <- as.list(out_files)

In [41]:
out_list

In [42]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = file_uuids,
    store = "project",
    doPrompt = FALSE
)

In [43]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SeuratObject_4.1.3 Seurat_4.3.0.1     dplyr_1.1.3        purrr_1.0.2       
[5] hise_2.16.0       

loaded via a namespace (and not attached):
  [1] bitops_1.0-7           deldir_1.0-9           pbapply_1.7-2         
  [4] gridExtra_2.3          rlang_1.1.1            magrittr_2.0.3        
  [7] RcppAnnoy_