# Transcriptomics Tutorials
This series of notebooks is created to showcase transcript analysis on files. The series consists of the following notebooks:
- Notebook 1: Expression Data Transformation
- Notebook 2: Differential Expression Analysis
- Notebook 3: Gene Set Enrichment Analysis
- Notebook 4: Gene Co-Expression Analysis
- Notebook 5: Gene Regulatory Network

# Notebook 3: Gene Set Enrichment Analysis

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

In this notebook, we will use `WebGestalt` (WEB-based GEne SeT AnaLysis Toolkit) and its R package- <a href="https://cran.r-project.org/web/packages/WebGestaltR/index.html">WebGestaltR</a> to perform Gene Set Enrichment Analysis (GSEA).

## 1. Preparing your environment

<b>Launch spec:</b> 
- App name: JupyterLab with Python, R, Stata, ML
- Kernel: R
- Instance type: mem1_ssd1_v2_x16
- cost: < $0.25
- runtime: =~ 15 min


<b>Data description:</b> File input for this notebook is a table of differential expression analysis (DESeq2) results obtained from a prior step. This file contains the DESeq2 results columns for 26,260 genes.


<b>Package dependency:</b>

| Package | License | 
| --- | --- |
| tidyverse | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| WebGestaltR | <a href="https://cran.r-project.org/web/licenses/LGPL-2">LGPL-2 </a>, <a href="https://cran.r-project.org/web/licenses/LGPL-2.1">LGPL-2.1 </a>, <a href="https://cran.r-project.org/web/licenses/LGPL-3">LGPL-3 </a> |
| org.Hs.eg.db | <a href="https://opensource.org/licenses/Artistic-2.0">artistic-2.0 </a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~10 minutes_

In [None]:
# Update and install dependencies for WebGestaltR
system("apt-get update")
system("yes | apt-get install libfontconfig1-dev")

In [None]:
# Install the library WebGestaltR from CRAN (Package for GSEA)
# install.packages("WebGestaltR")         
# Intall the library tidyverse from CRAN (Required for data handling)
# install.packages("tidyverse")           
# Install the library org.Hs.eg.db from Bioconductor (Human-specific annotation package)
# BiocManager::install("org.Hs.eg.db")    

**Declare input and output file names**

In notebook 2: Differential Expression Analysis, we used DESeq2 to perform Differential Expression Analysis and saved the results on the DNAnexus platform. Select the files to be downloaded and the filename of the output files of this notebook

In [None]:
# Input file
de_file <- "CPTAC-3_deseq2_all_genes.csv"

# Output file
enrichment_file <- "CPTAC-3_top_GO_terms_differential_expression.csv"

**Download Data**

We download the input files using CLI dx-toolbox command, `dx download <file_name>`.

In [None]:
system(paste("dx download", de_file))

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create Snapshot. Once a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 2. Load Libraries

In [None]:
library("WebGestaltR")
library("org.Hs.eg.db")
library("tidyverse")

## 3. Load, transform, and filter data

In [14]:
# Read in differential gene expression analysis results
gene_set <- read_csv(
    file = de_file,
    show_col_types = FALSE,
    name_repair = "minimal")
colnames(gene_set)
dim(gene_set)

# Filter by log2 fold change and adjusted P-value
# Sort by log2FoldChange and assign rank
sig_gene_set <- gene_set %>%
    rename(ensembl_id = 1) %>%
    filter(log2FoldChange > 2) %>%
    filter(padj < 1.0e-10) %>%
    arrange(pvalue) %>%
    separate(ensembl_id, c("ensembl_gene_id", NA)) %>%
    mutate(rank = row_number())
    
head(sig_gene_set, n = 2)
tail(sig_gene_set, n = 2)
dim(sig_gene_set)

ensembl_gene_id,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,rank
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
ENSG00000135245,4759.353,5.962794,0.1381325,43.16722,0,0,1
ENSG00000185633,10623.994,7.803186,0.1456195,53.58615,0,0,2


ensembl_gene_id,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,rank
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
ENSG00000268658,51.50565,2.539657,0.3767215,6.741469,1.567931e-11,6.480948e-11,886
ENSG00000152207,675.44134,2.34944,0.350062,6.711496,1.926392e-11,7.902366e-11,887


By filtering the data based on `log2FoldChange` and `padj`, we reduce the number of genes from around 26k to around 1.5k.

In [None]:
# Create input for GSEA
gsea_gene_set <- sig_gene_set %>% select(ensembl_gene_id, rank)
head(gsea_gene_set, 3)

## 4. WebGestaltR

In [None]:
# We can select from a variety of databases for enrichment analysis
listGeneSet() %>% filter(str_detect(name, '^geneontology'))

In [None]:
# Run WebGestaltR function while suppressing warnings
enrichment_result <- suppressWarnings(WebGestaltR(
    enrichMethod = "GSEA",
    organism = "hsapiens",
    enrichDatabase = "geneontology_Biological_Process",
    interestGene = gsea_gene_set,
    interestGeneType = "ensembl_gene_id",
    sigMethod = "fdr",
    fdrThr = 0.1,
    minNum = 3,
    maxNum = 500,
    perNum = 1000,
    isOutput = FALSE,
    nThreads = 16,
    isParallel = TRUE
))

## 5. Review results and upload to project

In [None]:
# Review results
enrichment_result <- enrichment_result %>%
    arrange(FDR) %>%
    select(geneSet, description, enrichmentScore, normalizedEnrichmentScore,
           pValue, FDR, size, userId, link)

select(enrichment_result, -userId) %>% head(5)

In [None]:
# Export the data save it to our project
write_csv(enrichment_result, file = enrichment_file)
system(paste("dx upload", enrichment_file))