# Intersect DEG sets to identify possible conflicts

In this notebook, we compare the outputs of our previous MAST DEG analysis for each condition to identify overlapping genes and perform statistical tests to determine if these overlaps occur more frequently than expected by chance.

To simplify this analysis somewhat, we'll restrain comparisons to gene sets within each cell type, rather than comparing changes across different cell types.

To identify directional effects, we'll consider up and down-regulated genes separately for this analysis.

Because the MAST results are somewhat dependent on cell number for statistical power, we'll use the top 500 up and down-regulated genes based on ranking by nominal P-value.

After selecting and overlapping these gene sets, hypergeometric tests will be used to determine if overlaps occur more frequently than expected by chance. 

For use with R's `phyper`, the variables will be as follows:  
```
phyper(
    q = n_ol - 1,                  # successes
    m = n_fg_genes,                # number of possible successes
    n = n_expr_genes - n_fg_genes, # number of non-successes
    k = n_bg_genes,                # number of draws
    lower.tail = FALSE
)
```
Where `n_ol` is the number of overlapping genes between the two sets, `n_expr_genes` is the number of genes detected in both conditions (see MAST notebook for threshhold), `n_fg_genes` is the number of genes in one of the two sets (designated as foreground, fg, arbitrarily), and `n_bg_genes` is the number of genes in the second set (background, bg). Note that the `fg` and `bg` designations should be reversible.

## Load packages

hise: The Human Immune System Explorer R SDK package  
purrr: Functional programming tools  
dplyr: Dataframe handling functions  
tibble: A modern dataframe implementation  

In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(purrr)
quiet_library(dplyr)
quiet_library(tidyr)
quiet_library(tibble)

## Set parameters

The main parameter to consider for this analysis is the number of DEGs to consider per condition and differential expression direction, `n_top`.

In [2]:
n_top <- 500

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the DEG results labels based on file UUIDs. This will be placed in the `cache/` subdirectory by default.

In [3]:
file_uuid <- list("fc83b89f-fd26-43b8-ac91-29c539703a45")

In [4]:
fres <- cacheFiles(file_uuid)

submitting request as query ID first...

retrieving files using fileIDS...



## Load and prepare DEG results

In [5]:
deg <- read.csv("cache/fc83b89f-fd26-43b8-ac91-29c539703a45/all_mast_deg_2023-09-06.csv")

Add a direction column to make it easy to split up and down-regulated genes.

In [6]:
deg <- deg %>%
  mutate(
      direction_sign = ifelse(
          is.na(logFC),
          sign(coef_D),
          sign(logFC)
      ),
      direction = ifelse(
          direction_sign == 1,
          "up", "dn"
      )
  )

Add a column for result grouping and split the gene sets by those groups. For this analysis, we need to group by treatment and timepoint *within* each cell type.

In [7]:
deg <- deg %>%
  mutate(
      treat_time = ifelse(
          timepoint == 4,
          paste0(fg, "_0", timepoint), # add a 0 for 4hr to help with sorting
          paste0(fg, "_", timepoint)
      )
  )
# split by cell type
type_deg <- split(deg, deg$aifi_cell_type)
# split within each cell type
type_deg <- map(
    type_deg,
    function(deg) { split(deg, deg$treat_time) }
)

## Perform overlaps

In [8]:
type_overlaps <- map2( # For each cell type
    type_deg, names(type_deg),
    function(cond_list, cell_type) {
        print(cell_type)
        map2_dfr( # For each condition as "foreground"
            cond_list, names(cond_list),
            function(fg_deg, fg_cond) {
                
                all_fg_deg_up <- fg_deg %>%
                  filter(direction == "up")
                all_fg_deg_dn <- fg_deg %>%
                  filter(direction == "dn")
                
                map2_dfr( # Compare to every condition as "background"
                    cond_list, names(cond_list),
                    function(bg_deg, bg_cond) {
                        
                        common_genes <- intersect(fg_deg$gene, bg_deg$gene)
                        n_common <- length(common_genes)
                        
                        fg_deg_up <- all_fg_deg_up %>%
                          filter(gene %in% common_genes) %>%
                          arrange(nomP) %>%
                          head(n_top)
                        fg_deg_dn <- all_fg_deg_dn %>%
                          filter(gene %in% common_genes) %>%
                          arrange(nomP) %>%
                          head(n_top)
                        
                        bg_deg_up <- bg_deg %>%
                          filter(gene %in% common_genes) %>%
                          filter(direction == "up") %>%
                          arrange(nomP) %>%
                          head(n_top)
                        bg_deg_dn <- bg_deg %>%
                          filter(gene %in% common_genes) %>%
                          filter(direction == "dn") %>%
                          arrange(nomP) %>%
                          head(n_top)
                        
                        ol <- list(fg_up_bg_up = intersect(fg_deg_up$gene, bg_deg_up$gene),
                                   fg_up_bg_dn = intersect(fg_deg_up$gene, bg_deg_dn$gene),
                                   fg_dn_bg_dn = intersect(fg_deg_dn$gene, bg_deg_dn$gene),
                                   fg_dn_bg_up = intersect(fg_deg_dn$gene, bg_deg_up$gene)
                               )
                        ol <- map(ol, sort)
                        n_ol <- unname(map_int(ol, length))
                        ol <- map_chr(ol, paste, collapse = ";")

                        tibble(
                            aifi_cell_type = cell_type,
                            fg_treatment = sub("_.+", "", fg_cond),
                            fg_timepoint = as.numeric(sub(".+_", "", fg_cond)),
                            fg_direction = c("up", "up", "dn", "dn"),
                            bg_treatment = sub("_.+", "", bg_cond),
                            bg_timepoint = as.numeric(sub(".+_", "", bg_cond)),
                            bg_direction = c("up", "dn", "dn", "up"),
                            n_common = n_common,
                            n_fg = n_top,
                            n_bg = n_top,
                            n_ol = n_ol,
                            ol_genes = ol
                        )
                        
                    })

            })
    })

[1] "t_cd4_cm"
[1] "t_cd4_em"
[1] "t_cd4_naive"
[1] "t_cd4_treg"
[1] "t_cd8_memory"
[1] "t_cd8_naive"


## Perform hypergeometric tests

In [9]:
type_overlap_stats <- map(
    type_overlaps,
    function(type_ol) {
           type_ol %>%
              mutate(nomP = phyper(n_ol - 1, n_fg, n_common - n_fg, n_bg, lower.tail = FALSE))
    })

## Remove self-comparisons and adjust P-values

In [10]:
all_overlap_stats <- do.call(rbind, type_overlap_stats)
all_overlap_stats <- all_overlap_stats %>%
  filter(paste(fg_treatment, fg_timepoint) != paste(bg_treatment, bg_timepoint))

In [11]:
all_overlap_stats <- all_overlap_stats %>%
  mutate(adjP = p.adjust(nomP, method = "BH"))

## Generate output files

In [12]:
out_overlap_stats <- all_overlap_stats %>%
  rename(group1_treatment = fg_treatment,
         group1_timepoint = fg_timepoint,
         group1_direction = fg_direction,
         n_group1 = n_fg,
         group2_treatment = bg_treatment,
         group2_timepoint = bg_timepoint,
         group2_direction = bg_direction,
         n_group2 = n_bg) %>%
  select(aifi_cell_type, 
         group1_treatment, group1_timepoint, group1_direction,
         group2_treatment, group2_timepoint, group2_direction,
         n_common, n_group1, n_group2, n_ol, nomP, adjP, ol_genes)

In [13]:
dir.create("output")

“'output' already exists”


In [14]:
out_file <- paste0("output/mast_deg_overlap_analysis_", Sys.Date(), ".csv")
write.csv(
    out_overlap_stats,
    out_file,
    row.names = FALSE,
    quote = FALSE
)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference.

In [15]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq MAST Overlap Analysis", Sys.Date())

In [16]:
uploadFiles(
    files = list(out_file),
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = file_uuid,
    store = "project",
    doPrompt = FALSE
)

[1] "Cannot determine the current notebook."
[1] "1) /home/jupyter/repro-vrd-tea-seq/02-mast-deg-testing/02-R_deg_result_overlaps.ipynb"
[1] "2) /home/jupyter/repro-vrd-tea-seq/figures/Supp-Fig-10_treatment_comparisons.ipynb"
[1] "3) /home/jupyter/repro-vrd-tea-seq/figures/Figure-R_all_type_overlaps (1).ipynb"


Please select (1-3)  1


In [17]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_3.2.1 tidyr_1.3.0  dplyr_1.1.3  purrr_1.0.2  hise_2.16.0 

loaded via a namespace (and not attached):
 [1] crayon_1.5.2     vctrs_0.6.3      httr_1.4.7       cli_3.6.1       
 [5] rlang_1.1.1      generics_0.1.3   assertthat_0.2.1 jsonlite_1.8.7  
 [9] glue_1.6.2       RCurl_1.98-1.12  htmltools_0.5.6