# Transcriptomics Tutorials
This series of notebooks is created to showcase transcript analysis on files. The series consists of the following notebooks:
- Notebook 1: Expression Data Transformation
- Notebook 2: Differential Expression Analysis
- Notebook 3: Gene Set Enrichment Analysis
- Notebook 4: Gene Co-Expression Analysis
- Notebook 5: Gene Regulatory Network

# Notebook 5: Gene Regulatory Network

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

In this notebook, we perform Gene Regulatory Network analysis to identify regulator genes and their target genes using the R package GENIE3. The GENIE3 (GEne Network Inference with Ensemble of trees) algorithm predicts Gene Regulatory Networks (GRN) based on regression tree-based (Random Forests or Extra-Trees) methods.

## 1. Preparing your environment

<b>Launch spec:</b> 
- App name: JupyterLab with Python, R, Stata, ML
- Kernel: R
- Instance type: mem1_ssd1_v2_x16
- cost: < $0.15
- runtime: =~ 8 min


<b>Data description:</b> File input for this notebook is a matrix of samples and their respective gene expression counts. This file has the expression counts of 60,483 genes for 60 samples (30 normal, 30 tumor).

<b>Package dependency:</b>

| Package | License | 
| --- | --- |
| tidyverse | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| GENIE3 | <a href="https://cran.r-project.org/web/licenses/GPL-2">GPL-2 </a>, <a href="https://cran.r-project.org/web/licenses/GPL-3">GPL-3 </a> |
| igraph | <a href="https://cran.r-project.org/web/licenses/GPL-2">GPL-2 </a>, <a href="https://cran.r-project.org/web/licenses/GPL-3">GPL-3 </a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~5 minutes_

In [None]:
# Install the libraries GENIE3 from Bioconductor and its dependencies from CRAN
# install.packages("doRNG")        # Dependency of doParallel
# install.packages("doParallel")   # To perform GENIE3 operations, in parallel
# BiocManager::install("GENIE3")   # Package for Gene Regulatory Network analysis
# Install the libraries igraph and tidyverse from CRAN
# install.packages("igraph")       # Package for plotting networks
# install.packages("tidyverse")    # Required for data handling

**Declare input and output file names**

In notebook 1: Expression Data Transformation, we generated a counts matrix file from individual gene expression files (CPTAC-3_gene_expression_count_matrix.csv) and saved the file in our project on the DNAnexus platform. Select the files to be downloaded and the filename of the output files of this notebook.

In [None]:
# Input files
counts_file <- "CPTAC-3_gene_expression_count_matrix.csv"

# Output file
wam_file <- "CPTAC-3_genie3_weighted_adjacency_matrix.csv"

**Download data**

We download the input file using CLI dx-toolbox command, `dx download <file_name>`.

In [None]:
system(paste("dx download", counts_file))

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create Snapshot. Once a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 2. Load Libraries

In [None]:
library(doRNG)
library(GENIE3)
library(igraph)
library(tidyverse)

## 3. Load Data

In [6]:
# Read in sample expression counts
counts_df <- read_csv(counts_file, show_col_types = FALSE)
colnames(counts_df)[1:5]
dim(counts_df)

## 4. Subset and transform source data

In [None]:
# Set seed for repeatable randomization 
set.seed(200)

# Find all transcripts which contain any instance of the value, 0
genes <- column_to_rownames(counts_df, var = "gene") %>%
    apply(1, FUN = min) %>%
    stack() %>%
    rename(ensembl_id = ind) %>%
    filter(values > 0)

# Subset counts table by removing transcripts containing 0,
# and then randomly sample a subset of transcripts (for the purpose of this tutorial)
counts_matrix <- counts_df %>%
    filter(gene %in% genes$ensembl_id) %>%
    separate(gene, c("gene", NA)) %>%
    column_to_rownames("gene") %>%
    sample_n(1000) %>%
    as.matrix()

#counts_matrix <- as.matrix(counts_df_subset)
head(counts_matrix, 3)
dim(counts_matrix)

## 5. Infer regulatory network (GENIE3)
<a href="https://bioconductor.org/packages/release/bioc/html/GENIE3.html">GENIE3</a> is an R package for inferring GRNs from gene expression counts data. For additional details regarding modeling and parameter selection, please see this public <a href="https://bioconductor.org/packages/release/bioc/vignettes/GENIE3/inst/doc/GENIE3.html">GENIE3 tutorial</a>.

#### Infer regulatory network and generate weighted adjacency matrix

In [None]:
weight_mat <- GENIE3(
    exprMatrix = counts_matrix,
    nCores = 16,
    returnMatrix = TRUE,
    verbose = FALSE)

head(weight_mat, 3)

#### Transform weighted adjacency matrix and create a data.frame of subsetted "edges"

In [None]:
# Created "linked list" from a weighted adjacency_matrix
wam_linked_list <- getLinkList(weightMatrix = weight_mat) %>%
    mutate_if(is.factor, as.character)

# Inner join the linked list for filtering purposes
regulatory_join <- inner_join(
    x = wam_linked_list,
    y = wam_linked_list,
    by = c("regulatoryGene" = "targetGene", "targetGene" = "regulatoryGene"))

# Find all cases where the "left" weight is greater than the "right" weight
# for a given set of edges with identifical vertices.
regulatory_left <- regulatory_join %>%
    filter(weight.x > weight.y) %>%
    select(-weight.y) %>%
    rename(weight = weight.x, regulator = regulatoryGene, target = targetGene)

# Find all cases where the "right" weight is greather than the "left" weight
# for a given set of edges with identifical vertices and rename columns
regulatory_right <- regulatory_join %>%
    filter(weight.x < weight.y) %>%
    select(-weight.x) %>%
    rename(weight = weight.y, target = regulatoryGene, regulator = targetGene)

# Bind rows together
network_df <- regulatory_left %>%
    bind_rows(regulatory_right) %>%
    distinct()
    
head(network_df, 3)
dim(network_df)

## 6. Build network graph

#### Build graph object and identify "hub" regulator genes

In [None]:
# Subset edges by weight, keeping top edges
q_weight <- quantile(network_df$weight, probs = (0.999))
network_subset_df <- filter(network_df, weight > q_weight)

# Create a graph from the subsetted data.frame
net <- graph_from_data_frame(network_subset_df, directed=T)

# Identify "hub" regulator genes
degree <- degree(net, mode="out")
q_degree <- quantile(degree, probs = c(0.90))
degree_color <- if_else(degree > q_degree, "blue", "yellow")
names(degree_color) <- names(degree)
hub_genes <- enframe(degree_color) %>% filter(value == "blue") %>% pull(name)

# Get a list of the genes regulated by hub genes
hub_target_df <- network_subset_df %>%
  filter(regulator %in% hub_genes) %>%
  arrange(regulator, desc(weight)) %>%
  rename(regulator_hub = regulator)

head(hub_target_df, 3)

#### Plot network with gene hubs (blue)

In [None]:
plot(
    x = net,
    vertex.size = 5,
    vertex.color = degree_color,
    vertex.label = NA,
    edge.arrow.size=0,
    edge.width = 2
)

## 7. Upload all edges to project

In [None]:
# Export the weighted adjacency matrix as a "linked list"
write_csv(wam_linked_list, file = wam_file)
system(paste("dx upload",  wam_file))