# Modified GSEA of MAST DEGs against Hallmark Pathways: Remove PSM genes

This notebook contains analysis of bortezomib DEGs after removal of proteasomal (PSM) genes.

The goal of this analysis is to determine which pathway enrichment results are strongly dependent on the set of genes related to the proteasome, and to identify additional pathways that may be enriched when the dominant effects on this set of strongly related genes are removed.

The analysis itself is identical to the generalized notebook for GSEA of MAST DEGs, with the addition of the **DEG Filtering** section, which selects only Bortezomib conditions and removes the set of proteasomal genes described in Mao, 2021:  

Mao, Y. Structure, Dynamics and Function of the 26S Proteasome. in Macromolecular Protein Complexes III: Structure and Function (eds. Harris, J. R. & Marles-Wright, J.) 1–151 (Springer International Publishing, 2021).

## Setup

For this analysis, we'll compare our DEGs to the MSigDB Hallmark Gene Sets, available in the `msidbr` package. We'll need to install this package if it's not already present.

In [1]:
ip <- installed.packages()
if(!"msigdbr" %in% rownames(ip)) {
    install.packages("msigdbr", upgrade = "never")
}

## Load packages

hise: The Human Immune System Explorer R SDK package  
purrr: Functional programming tools  
dplyr: Dataframe handling functions  
fgsea: Fast Gene Set Enrichment Analysis  
msigdbr: MSigDB gene sets

In [2]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(purrr)
quiet_library(dplyr)
quiet_library(fgsea)
quiet_library(msigdbr)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the MAST DEG results file based on its UUID. This will be placed in the `cache/` subdirectory by default.

In [3]:
file_uuid <- list(
    "fc83b89f-fd26-43b8-ac91-29c539703a45"
)

In [4]:
fres <- cacheFiles(file_uuid)

submitting request as query ID first...

retrieving files using fileIDS...



In [5]:
psm_genes <- read.csv("../common/gene_sets/mao_proteasome_genes.csv")

### Prepare DEG lists

To rank genes, we'll convert nomP to -log10(nomP), and incorporate the direction of differential expression by multiplying by the direction of effect size (sign(logFC)).

In [6]:
all_deg <- read.csv("cache/fc83b89f-fd26-43b8-ac91-29c539703a45/all_mast_deg_2023-09-06.csv")
all_deg$treat_time_type <- paste0(
    all_deg$fg, "_", 
    all_deg$timepoint, "_", 
    all_deg$aifi_cell_type)

Prior to ranking, we'll need to resolve missing `logFC` values. These can occur if one of the groups used for DEG analysis had no expression of the gene.

In [7]:
all_deg %>%
  filter(is.na(logFC)) %>%
  head()

Unnamed: 0_level_0,aifi_cell_type,timepoint,fg,bg,n_sample,gene,coef_C,coef_D,logFC,nomP,adjP,treat_time_type
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,t_cd4_em,4,bortezomib,dmso,180,TFDP1,,3.196997,,3.817725e-05,0.1481468,bortezomib_4_t_cd4_em
2,t_cd4_treg,4,bortezomib,dmso,78,ABCA3,,-2.453613,,0.01337154,0.9999222,bortezomib_4_t_cd4_treg
3,t_cd4_treg,4,bortezomib,dmso,78,AC005070.3,,-2.259963,,0.03110707,0.9999222,bortezomib_4_t_cd4_treg
4,t_cd4_treg,4,bortezomib,dmso,78,AC006504.5,,-2.470421,,0.01243915,0.9999222,bortezomib_4_t_cd4_treg
5,t_cd4_treg,4,bortezomib,dmso,78,AC007686.3,,-2.265379,,0.03041017,0.9999222,bortezomib_4_t_cd4_treg
6,t_cd4_treg,4,bortezomib,dmso,78,AC010754.1,,-2.272969,,0.02952072,0.9999222,bortezomib_4_t_cd4_treg


When this occurs, we can use the sign of `coef_D` to determine the direction of expression change, rather than using the missing `logFC` value.

In [8]:
all_deg <- all_deg %>%
  mutate(direction = ifelse(
      is.na(logFC),
      sign(coef_D), # if missing logFC, use coef_D
      sign(logFC) # otherwise, use logFC
  ))

We also need to avoid nomP values of 0. These will cause NA values due to log transformation. We'll convert these to `1e-300` so that they have a non-zero value.

In [9]:
all_deg <- all_deg %>%
  mutate(nomP = ifelse(
      nomP == 0,
      1e-300, # if zero, change to 1e-300
      nomP # otherwise, keep the value
  ))

## DEG Filtering

Filter to select bortezomib conditions

In [10]:
all_deg <- all_deg %>%
  filter(fg == "bortezomib")

Filter to remove proteasomal genes

In [11]:
all_deg <- all_deg %>%
  filter(!gene %in% psm_genes$gene)

## Rank and Split for analysis

In [12]:
deg_list <- split(all_deg, all_deg$treat_time_type)

In [13]:
deg_list <- map(
    deg_list,
    function(deg) {
        deg %>%
          mutate(rank_val = -log10(nomP) * direction) %>%
          arrange(desc(rank_val))
    }
)

In [14]:
rank_list <- map(
    deg_list,
    function(deg) {
        v <- deg$rank_val
        names(v) <- deg$gene
        v
    }
)

## Prepare Gene Sets

For use with `fastgsea`, we need a named list of the Hallmark gene sets.

In [15]:
hallmark <- msigdbr(species = "human", category = "H")

In [16]:
hallmark_list <- split(hallmark, hallmark$gs_name)
hallmark_list <- map(hallmark_list, "gene_symbol")

We'll also need a data.frame with the gene sets for our output files. We'll also include labels for display that are specified in `common/gene_sets/hallmark_names.csv`.

In [17]:
hallmark_names <- read.csv("../common/gene_sets/hallmark_names.csv")

In [18]:
hallmark_df <- data.frame(
    pathway = names(hallmark_list),
    n_pathway_genes = map_int(hallmark_list, length),
    pathway_genes = map_chr(hallmark_list, paste, collapse = ";")
)
hallmark_df <- hallmark_df %>%
  left_join(hallmark_names)

[1m[22mJoining with `by = join_by(pathway)`


## Run GSEA

In [19]:
parallel_param <- BiocParallel::MulticoreParam(
    workers = 4, 
    progressbar = FALSE
)

In [20]:
fgsea_res <- map(
    rank_list,
    function(ranks) {
        fgsea(
            pathways = hallmark_list,
            stats    = ranks,
            minSize  = 10,
            maxSize  = 500,
            BPPARAM  = parallel_param
        )
    }
)

“There are ties in the preranked stats (0.04% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.”


### Format results

In [21]:
deg_meta <- map(
    deg_list,
    function(deg) {
        list(
            fg = deg$fg[1],
            bg = deg$bg[1],
            timepoint = deg$timepoint[1],
            aifi_cell_type = deg$aifi_cell_type[1]
        )
    }
)

In [22]:
head(fgsea_res[[1]])

pathway,pval,padj,log2err,ES,NES,size,leadingEdge
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<list>
HALLMARK_ADIPOGENESIS,0.001786125,0.01529758,0.45505987,0.6234839,1.6915193,120,"UBC , TALDO1, NMT1 , BAZ2A , SOD1 , ACO2 , RTN3 , GBE1 , MAP4K3, GPD2 , RREB1 , SQOR , YWHAG , PEX14 , RIOK3"
HALLMARK_ALLOGRAFT_REJECTION,0.007759173,0.04558514,0.40701792,-0.4353999,-1.467458,101,"CCND2 , PTPRC , CD2 , LCP2 , ETS1 , ITK , CD3G , CD40LG , IL2RA , HLA-E , SOCS1 , FYB1 , B2M , TIMP1 , STAT1 , ST8SIA4, GBP2 , IFNAR2 , IRF4 , STAT4 , TRAT1 , CD47 , CD3E , ITGAL , IL2RB , IL7 , GPR65 , TAP1 , LCK , NPM1 , IL4R , TLR1 , CD74 , WAS , ACVR2A"
HALLMARK_ANDROGEN_RESPONSE,0.165165165,0.33751142,0.19002331,-0.3814166,-1.2004249,66,"ARID5B , IQGAP2 , MYL12A , FKBP5 , B2M , CDK6 , STK39 , ACTN1 , GPD1L , TNFAIP8, INPP4B , MAF , RPS6KA3, PTK2B , LMAN1"
HALLMARK_APICAL_JUNCTION,0.088414634,0.24444046,0.26635066,-0.4091203,-1.3024985,70,"PTPRC , ITGB1 , ACTB , FYB1 , ACTN1 , EVL , PTEN , MYL12B, ICAM2"
HALLMARK_APICAL_SURFACE,0.521348315,0.61258427,0.08312913,-0.3940781,-0.953058,17,"GATA3 , IL2RB , FLOT2 , AKAP7 , CROCC , MAL , B4GALT1"
HALLMARK_APOPTOSIS,0.508333333,0.61258427,0.06011861,0.3659088,0.9659397,91,"DAP3 , SQSTM1, GSR , SOD1 , DNAJA1, HSPB1 , BAX , LMNA , MADD , BID , ADD1"


In [23]:
formatted_fgsea_res <- map2_dfr(
    fgsea_res,
    deg_meta,
    function(res, meta) {
        res %>%
          mutate(
              leadingEdge = map_chr(leadingEdge, paste, collapse = ";"),
              fg = meta$fg,
              bg = meta$bg,
              timepoint = meta$timepoint,
              aifi_cell_type = meta$aifi_cell_type
          ) %>%
          left_join(hallmark_df, by = "pathway") %>%
          rename(nomP = pval,
                 adjP = padj,
                 n_leadingEdge = size) %>%
          select(fg, bg, timepoint, aifi_cell_type,
                 pathway_label, NES, nomP, adjP, 
                 n_leadingEdge, n_pathway_genes,
                 leadingEdge, pathway_genes) %>%
          arrange(desc(NES))

    }
)

## Write output file

Write the metadata as a .csv for later use. We remove `row.names` and set `quote = FALSE` to simplify the outputs and increase compatibility with other tools.

In [24]:
dir.create("output")

In [25]:
out_file <- paste0("output/bortezomib_no-PSM_hallmark_gsea_res_", Sys.Date(), ".csv")
write.csv(
    formatted_fgsea_res,
    out_file,
    row.names = FALSE,
    quote = FALSE
)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference, which we used above when the DEG results were retrieved (`file_uuid`)

In [26]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq Hallmark Bor No-PSM GSEA Analysis", Sys.Date())

In [27]:
out_list <- as.list(out_file)

In [28]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = file_uuid,
    store = "project",
    doPrompt = FALSE
)

In [29]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] msigdbr_7.5.1 fgsea_1.26.0  dplyr_1.1.3   purrr_1.0.2   hise_2.16.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.3          generics_0.1.3      bitops_1.0-7       
 [4] lattice_0.21-8      digest_0.6.33       magrittr_2.0.3     
 [7] evaluate_0.21       grid_4.3.1          pbdZMQ_0.3-10      
