# Enrichment of TF motifs near DEGs

### Concept

This notebook is centered around a simple approach to testing for enrichment of motifs near differentially expressed genes.

Here, we take a gene-centric question: how many DEGs have a given motif associated with them based on ArchR's nearestGene annotation for peaks?

DEG -> nearestGene peak -> Cisbp annotation

We'll ignore whether the peak is differentially accessible or correlated with gene expression.


### Test variables
To test for enrichment, we'll use hypergeometric tests:

The intersection of all nearestGene values and all tested genes in DEG results as the set of possible genes. (n_all_genes)

The number of possible successes is the number of genes with a nearestGene annotation for a given motif. (n_motif_genes)

The number of draws is the number of DEGs. (n_deg)

The number of successes is the number of DEGs with a nearestGene annotation for a given motif. (n_ol)

For use with R's `phyper`, the variables will be as follows:  
```
phyper(
    q = n_ol - 1,                    # successes
    m = n_motif_genes,               # number of possible successes
    n = n_all_genes - n_motif_genes, # number of non-successes
    k = n_deg,                       # number of draws
    lower.tail = FALSE
)
```

### FDR cutoff for DEGs

In [1]:
fdr_cutoff <- 0.01

In [2]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(ArchR)
quiet_library(dplyr)
quiet_library(purrr)
quiet_library(ggplot2)


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          /   \     |   _  \      /      ||  |  |  | |   _  \     
         /  ^  \    |  |_) 

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the DEG results and motif annotations based on their file UUIDs. These will be placed in the `cache/` subdirectory by default.

In [3]:
deg_uuid <- list("fc83b89f-fd26-43b8-ac91-29c539703a45")

In [4]:
deg_res <- cacheFiles(deg_uuid)
deg_file <- list.files(
    paste0("cache/",deg_uuid),
    recursive = TRUE, full.names = TRUE
)

submitting request as query ID first...

retrieving files using fileIDS...



[1] "Initiating file download for all_mast_deg_2023-09-06.csv"
[1] "Download successful."


In [5]:
peak_uuids <- list(
    "44367c6e-74f5-489e-ae42-7c9320fa9d1a",
    "08e92bce-208e-463a-8297-76d2d5e2e404",
    "b5f0fe70-ef2b-4bba-9bd5-942f11220dd2",
    "9eb067cc-0f7b-40c3-ac87-55b4ecc98119",
    "a6c975fb-04d7-4a7b-89a8-b3486745db94",
    "248ef256-14ea-4a7a-b3f4-af414966ff86"
)

In [6]:
peak_res <- cacheFiles(peak_uuids)
peak_files <- list.files(
    paste0("cache/",peak_uuids),
    recursive = TRUE, full.names = TRUE
)

[1] "Initiating file download for peak-GRanges-t_cd4_cm_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-GRanges-t_cd4_em_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-GRanges-t_cd4_naive_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-GRanges-t_cd4_treg_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-GRanges-t_cd8_memory_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-GRanges-t_cd8_naive_2023-10-16.rds"
[1] "Download successful."


In [7]:
anno_uuids <- list(
    "1a89076d-06c1-4daf-b4b1-a58d289c0689",
    "785d6c93-92d5-40ca-b8ba-82e7fbc48c38",
    "e3a8dec3-72c2-4130-bb5d-e79e6207c47c",
    "246d2418-70e6-4826-8654-2e128daaa347",
    "a4328ddf-916e-4217-9b97-c3cf6fe931f1",
    "dedd7406-21cd-4d8b-a55b-8d5ae6839b45"
)

In [8]:
anno_res <- cacheFiles(anno_uuids)
anno_files <- list.files(
    paste0("cache/",anno_uuids),
    recursive = TRUE, full.names = TRUE
)

[1] "Initiating file download for peak-motif-matches-t_cd4_cm_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-motif-matches-t_cd4_em_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-motif-matches-t_cd4_naive_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-motif-matches-t_cd4_treg_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-motif-matches-t_cd8_memory_2023-10-16.rds"
[1] "Download successful."
[1] "Initiating file download for peak-motif-matches-t_cd8_naive_2023-10-16.rds"
[1] "Download successful."


### Get peak annotations and metadata per cell type

In [9]:
peak_types <- sub(".+-t(.+)_20.+", "t\\1", peak_files)

In [10]:
get_peak_meta <- function(peak_file) {
    peaks <- readRDS(peak_file)
    meta <- as.data.frame(elementMetadata(peaks))
    meta
}

In [11]:
type_peak_meta <- map(
    peak_files,
    get_peak_meta
)
names(type_peak_meta) <- peak_types

In [12]:
anno_types <- sub(".+-t(.+)_20.+", "t\\1", anno_files)

In [13]:
type_peak_anno <- map(
    anno_files,
    readRDS
)
names(type_peak_anno) <- anno_types

### Generate gene lists for each motif

In [14]:
motif_gene_sets <- map2(
    type_peak_anno,
    type_peak_meta,
    function(peak_anno, peak_meta) {
        res <- map(colnames(peak_anno),
            function(motif) {
                target_meta <- peak_meta[peak_anno[,motif],]
                genes <- target_meta$nearestGene
                unique(genes)
        })
        names(res) <- colnames(peak_anno)
        res
    }
)

In [15]:
n_motif_genes_list <- map(motif_gene_sets, map_int, length)

### Get DEG per condition

In [16]:
all_deg <- read.csv(deg_file)

In [17]:
head(all_deg)

Unnamed: 0_level_0,aifi_cell_type,timepoint,fg,bg,n_sample,gene,coef_C,coef_D,logFC,nomP,adjP
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,t_cd4_cm,4,bortezomib,dmso,648,A1BG-AS1,0.028411773,-0.12714495,-0.00860394,0.7403701,0.999237
2,t_cd4_cm,4,bortezomib,dmso,648,AAGAB,0.032287238,-0.25574995,-0.05429509,0.1455986,0.9918874
3,t_cd4_cm,4,bortezomib,dmso,648,AAK1,-0.009058149,0.04769567,0.0148158,0.8762539,0.999237
4,t_cd4_cm,4,bortezomib,dmso,648,AAMDC,0.062294844,0.19250673,0.04313545,0.1602309,0.9918874
5,t_cd4_cm,4,bortezomib,dmso,648,AAMP,0.018080684,-0.16673307,-0.01001203,0.7332528,0.999237
6,t_cd4_cm,4,bortezomib,dmso,648,AARS,0.019996604,-0.13135182,-0.01430162,0.6785156,0.999237


In [18]:
all_deg <- all_deg %>%
  mutate(
      dir_sign = ifelse(
          is.na(logFC),
          sign(coef_D),
          sign(logFC)),
      direction = ifelse(
          dir_sign == 1,
          "up", "dn")
    )

In [19]:
all_deg$test_group <- paste0(all_deg$fg, "_", all_deg$timepoint, "_", all_deg$aifi_cell_type)

In [20]:
split_deg <- split(all_deg, all_deg$test_group)

### Test for enrichment

In [21]:
overlap_results <- map_dfr(
    split_deg,
    function(deg) {
        ct <- deg$aifi_cell_type[1]
        tp <- deg$timepoint[1]
        treat <- deg$fg[1]

        peak_anno <- type_peak_anno[[ct]]
        peak_meta <- type_peak_meta[[ct]]
        
        all_genes <- intersect(
            deg$gene,
            peak_meta$nearestGene
        )
        n_all_genes <- length(all_genes)
        
        sig_deg <- deg %>%
          filter(adjP < fdr_cutoff)
        n_sig_deg <- nrow(sig_deg)

        motif_genes <- motif_gene_sets[[ct]]
        tf_names <- sub("_.+", "", names(motif_genes))
        motif_genes <- motif_genes[tf_names %in% deg$gene]
        
        motif_genes <- map(
            motif_genes,
            function(mg) {
                intersect(mg, all_genes)
            }
        )
        n_motif_genes <- map_int(motif_genes, length)
        
        up_deg <- sig_deg %>%
          filter(direction == "up")
        n_up_deg <- nrow(up_deg)

        dn_deg <- sig_deg %>%
          filter(direction == "dn")
        n_dn_deg <- nrow(dn_deg)
        
        sig_ol <- map(
            motif_genes,
            function(gene_set) {
                length(intersect(gene_set, sig_deg$gene))
            }
        )
        
        sig_ol <- map(
            motif_genes,
            function(gene_set) {
                sort(intersect(gene_set, sig_deg$gene))
            }
        )
        n_sig_ol <- map(sig_ol, length)
        sig_ol <- map(sig_ol, paste, collapse = ";")
        up_ol <- map(
            motif_genes,
            function(gene_set) {
                sort(intersect(gene_set, up_deg$gene))
            }
        )
        n_up_ol <- map(up_ol, length)
        up_ol <- map(up_ol, paste, collapse = ";")
        dn_ol <- map(
            motif_genes,
            function(gene_set) {
                sort(intersect(gene_set, dn_deg$gene))
            }
        )
        n_dn_ol <- map(dn_ol, length)
        dn_ol <- map(dn_ol, paste, collapse = ";")
        
        sig_hyper_res <- pmap_dfr(
            list(
                n_ol = n_sig_ol,
                ol = sig_ol,
                n_motif_genes = n_motif_genes,
                motif_names = names(n_motif_genes)
            ),
            function(n_ol, ol, n_motif_genes, motif_names) {
                tf <- sub("_.+", "", motif_names)
                
                data.frame(
                    treatment = treat,
                    timepoint = tp,
                    aifi_cell_type = ct,
                    direction = "all",
                    motif_id = motif_names,
                    tf_gene = tf,
                    tf_logFC = deg$logFC[deg$gene == tf],
                    tf_adjP = deg$adjP[deg$gene == tf],
                    n_all_genes = n_all_genes,
                    n_motif_genes = n_motif_genes,
                    n_deg = n_sig_deg,
                    n_ol = n_ol,
                    nomP = phyper(n_ol - 1, n_motif_genes, n_all_genes - n_motif_genes, n_sig_deg, lower.tail = FALSE),
                    ol_genes = ol
                )
            }
        )
        up_hyper_res <- pmap_dfr(
            list(
                n_ol = n_up_ol,
                ol = up_ol,
                n_motif_genes = n_motif_genes,
                motif_names = names(n_motif_genes)
            ),
            function(n_ol, ol, n_motif_genes, motif_names) {
                tf <- sub("_.+", "", motif_names)

                data.frame(
                    treatment = treat,
                    timepoint = tp,
                    aifi_cell_type = ct,
                    direction = "up",
                    motif_id = motif_names,
                    tf_gene = tf,
                    tf_logFC = deg$logFC[deg$gene == tf],
                    tf_adjP = deg$adjP[deg$gene == tf],
                    n_all_genes = n_all_genes,
                    n_motif_genes = n_motif_genes,
                    n_deg = n_up_deg,
                    n_ol = n_ol,
                    nomP = phyper(n_ol - 1, n_motif_genes, n_all_genes - n_motif_genes, n_up_deg, lower.tail = FALSE),
                    ol_genes = ol
                )
            }
        )
        dn_hyper_res <- pmap_dfr(
            list(
                n_ol = n_dn_ol,
                ol = dn_ol,
                n_motif_genes = n_motif_genes,
                motif_names = names(n_motif_genes)
            ),
            function(n_ol, ol, n_motif_genes, motif_names) {
                tf <- sub("_.+", "", motif_names)

                data.frame(
                    treatment = treat,
                    timepoint = tp,
                    aifi_cell_type = ct,
                    direction = "dn",
                    motif_id = motif_names,
                    tf_gene = tf,
                    tf_logFC = deg$logFC[deg$gene == tf],
                    tf_adjP = deg$adjP[deg$gene == tf],
                    n_all_genes = n_all_genes,
                    n_motif_genes = n_motif_genes,
                    n_deg = n_dn_deg,
                    n_ol = n_ol,
                    nomP = phyper(n_ol - 1, n_motif_genes, n_all_genes - n_motif_genes, n_dn_deg, lower.tail = FALSE),
                    ol_genes = ol
                )
            }
        )
        
        rbind(sig_hyper_res,
              up_hyper_res,
              dn_hyper_res)
    }
)

In [22]:
overlap_results$adjP <- p.adjust(overlap_results$nomP, method = "BH")

In [23]:
dir.create("output")

“'output' already exists”


In [24]:
write.csv(
    overlap_results,
    paste0("output/deg_motif_enrichment_",Sys.Date(),".csv"),
    row.names = FALSE,
    quote = FALSE
)

## Store results in HISE

In [32]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq T Cell DEG TF Motif Enrichment", Sys.Date())

In [33]:
out_list <- list(paste0("output/deg_motif_enrichment_",Sys.Date(),".csv"))
input_ids <- c(c(deg_uuid, peak_uuids, anno_uuids))

In [34]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = input_ids,
    store = "project",
    doPrompt = FALSE
)

In [35]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] purrr_1.0.2                 dplyr_1.1.3                
 [3] rhdf5_2.44.0                SummarizedExperiment_1.30.2
 [5] Biobase_2.60.0              MatrixGenerics_1.12.3      
 [7] Rcpp_1.0.11                 Matrix_1.6-1.1             
 [9] GenomicRanges_1.52.0        GenomeInfoDb