# Transcriptomics Tutorials
This series of notebooks is created to showcase transcript analysis on files. The series consists of the following notebooks:
- Notebook 1: Expression Data Transformation
- Notebook 2: Differential Expression Analysis
- Notebook 3: Gene Set Enrichment Analysis
- Notebook 4: Gene Co-Expression Analysis
- Notebook 5: Gene Regulatory Network

# Notebook 2: Differential Expression Analysis
This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

In this notebook, we will compare normal and primary tumor tissue samples to identify which genes are differentially expressed, using the DESeq2 algorithm. DESeq2 is a simple yet powerful tool that performs differential gene expression analysis based on the negative binomial distribution, using normalized gene expression counts data to identify genes that are differentially expressed in one set of sample with respect to another. Here, we will be comparing normal tissue samples with primary tumor samples in the kidney to understand which genes are differentially expressed.

## 1. Preparing your environment

<b>Launch spec:</b> 
- App name: JupyterLab with Python, R, Stata, ML
- Kernel: R
- Instance type: mem1_ssd1_v2_x16
- cost: < $0.25
- runtime: =~ 15 min


<b>Data description:</b> The inputs for this notebook are 
1. A matrix of samples and their respective gene expression counts.This file has the expression counts of 60,483 genes for 60 samples (30 normal, 30 tumor).
2. A summary file giving the file names and IDs of normal tissue and tumor samples.

<b>Package dependency:</b>

| Package | License | 
| --- | --- |
| tidyverse | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| DESeq2 | <a href="https://www.gnu.org/licenses/lgpl-3.0.en.html">LGPL (>= 3)</a>

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~10 minutes_

In [None]:
# Install the library tidyverse from CRAN 
# install.packages("tidyverse")
# Install the library DESeq2 from Bioconductor
# BiocManager::install("DESeq2")

**Declare input and output file names**

In notebook 1: Expression Data Transformation, we generated a counts matrix file from individual gene expression files (CPTAC-3_gene_expression_count_matrix.csv) and saved it in our project on the DNAnexus platform. We also have a manifest file (CPTAC-3_pheno_summary.csv) giving phenotype and file ids of the samples we're about to analyze. Select the files to be downloaded and the filename of the output file of this notebook.

In [None]:
# Input files
counts_file <- "CPTAC-3_gene_expression_count_matrix.csv"
pheno_file <- "CPTAC-3_pheno_summary.csv"

# Output file
deseq_results_file <- "CPTAC-3_deseq2_all_genes.csv"

**Download Data**

 We download these files using CLI dx-toolbox command, `dx download <file_name>`. 

In [None]:
system(paste("dx download", counts_file))
system(paste("dx download", pheno_file))

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 2. Load Libraries

In [None]:
library(DESeq2)
library(tidyverse)

## 3. Load Data

In [18]:
# Read the counts dataframe
counts_df <- read_csv(counts_file, show_col_types = FALSE)
colnames(counts_df)[1:5]
dim(counts_df)

In [19]:
# Read the summary table
summary_df <- read_csv(pheno_file, show_col_types = FALSE)
colnames(summary_df)
dim(summary_df)

## 4. Subset, annotate, and transform source data


#### Filter out low expression genes
Remove all genes where any sample indicates no expression (i.e, expression value is 0)

In [None]:
counts_df <- counts_df %>%
    filter(rowCounts(.[] == 0) == 0) %>%
    column_to_rownames(var = "gene")

head(counts_df, 3)
dim(counts_df)

#### Create phenotype dataframe
The first column of this dataframe is the sample ID and the second column is phenotype. Phenotype definition is binary (normal -vs- tumor), and each sample is either normal or tumor.

In [None]:
nor <- summary_df %>%
  select(normal_file_ids) %>%
  rename(sample_id = normal_file_ids) %>%
  mutate(sample_type = "normal")

tum <- summary_df %>%
  select(primary_tumor_file_ids) %>%
  rename(sample_id = primary_tumor_file_ids) %>%
  mutate(sample_type = "tumor")

pheno_table <- nor %>%
  bind_rows(tum)

head(pheno_table, 3)
dim(pheno_table)

## 5. Run DESeq

#### Build a DESeq dataset and perform analysis

In [None]:
# Sort counts by sample order in phenotype
sample_id <- pheno_table %>% pull(sample_id)
counts_df <- counts_df[, sample_id]

# Specify the sample condition (tumor vs. normal)
condition <- as.factor(pheno_table$sample_type)

# Create DESeq object and run DESeq
deseq_object <- DESeqDataSetFromMatrix(
    countData = counts_df,
    colData = DataFrame(condition),
    design = ~ condition)
deseq_dataset <- DESeq(deseq_object)

# Extract results
results <- results(deseq_dataset)
results

## 6. Plot results

#### Convert results to a data.frame

In [None]:
results_df <- data.frame(results) %>%
    rownames_to_column(var = "gene")
head(results_df, 3)
dim(results_df)

#### Create plot 

In [None]:
volcano_plot <- ggplot(results_df, aes(x = log2FoldChange, y = -log10(padj))) +
    geom_point(alpha = 0.5, size = 3) +
    labs(x = "log2 fold change", y = "-log10 adjusted p-value") +
    ggtitle("Differential Gene Expression, Volcano Plot")

volcano_plot

## 7. Export tabular results for all genes to the platform

In [None]:
write_csv(results_df, file = deseq_results_file)
system(paste("dx upload", deseq_results_file))