# Transcriptomics Tutorials
This series of notebooks is created to showcase transcript analysis on files. The series consists of the following notebooks:
- Notebook 1: Expression Data Transformation
- Notebook 2: Differential Expression Analysis
- Notebook 3: Gene Set Enrichment Analysis
- Notebook 4: Gene Co-Expression Analysis
- Notebook 5: Gene Regulatory Network


# Notebook 1: Expression Data Transformation
This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

In this notebook, we transform the individual gene expression files into a gene counts matrix, having one transcript per row and one sample per column.

## 1. Preparing your environment

<b>Launch spec:</b> 
- App name: JupyterLab with Python, R, Stata, ML
- Kernel: R
- Instance type: mem1_ssd1_v2_x16
- cost: < $0.2
- runtime: =~ 5 min


<b>Data description:</b> File input for this notebook is 
1. A set of 60 individual gene expression files stored in the `Input` folder in our project. 
2. A summary file giving the file names and IDs of normal tissue and tumor samples.

<b>Package dependency:</b> 

| Package | License | 
| --- | --- |
| tidyverse | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

In [None]:
# Install the library tidyverse from CRAN 
# install.packages("tidyverse")

**Declare input and output file names**

We have downloaded individual gene expression files from GDC and saved them to our DNAnexus project. We also created a manifest file containing details of the downloaded files, such as their file IDs, file names, date of download etc, and saved it as a csv in our project. Select the files to be downloaded and the filename of the output file of this notebook.

In [None]:
# Input files
pheno_file <- "CPTAC-3_pheno_summary.csv"
input_data_folder <- "Input"

# Output file
counts_file <- "CPTAC-3_gene_expression_count_matrix.csv"

**Download Data**
 
To download content from our project to local JupyterLab instance, we may use the following CLI dx-toolbox commands: 
- `dx download <file-name>` # for downloading a file
- `dx download -r <folder_name>` # for downloading the contents of a folder recursively.

In [None]:
file_download_cmd<- paste("dx download", pheno_file)
system(file_download_cmd)

folder_download_cmd<- paste("dx download -r" , input_data_folder)
system(folder_download_cmd) 

system(paste("ls",  input_data_folder), intern = TRUE)

**Load Library**

In [None]:
library(tidyverse)

## 2. Load data

In [6]:
summary_df <- read_csv(pheno_file, show_col_types = FALSE)
colnames(summary_df)
dim(summary_df)

Now, let's preview the data using the function, `head()`, which returns the column names, the column data type, and the first n rows (lines) of the data.

In [None]:
head(summary_df, 3)

## 3. Read the input files and form the counts matrix

The STAR gene counts file consists of 4 columns which correspond to different strandedness options:

- column 1: gene ID
- column 2: counts for unstranded RNA-seq
- column 3: counts for the 1st read strand aligned with RNA
- column 4: counts for the 2nd read strand aligned with RNA

We will use only the gene ID and unstranded RNA-seq counts columns for this analysis.

#### Read in the individual gene expression files and join them to form a tibble

In [None]:
# Transform summary file for easy iteration
# Extract normal sample file name and id
nor <- summary_df %>%
    select(normal_file_ids, normal_file_names) %>%
    rename(file_id = normal_file_ids, file_name = normal_file_names)

# Extract tumor sample file name and id
tum <- summary_df %>%
    select(primary_tumor_file_ids, primary_tumor_file_names) %>%
    rename(file_id = primary_tumor_file_ids, file_name = primary_tumor_file_names)

# Append normal and tumor file name and ids
samples <- nor %>%
    bind_rows(tum)

In [None]:
setwd(input_data_folder)

In [None]:
# Inititate data structure
tb_counts_long <- tibble()

# Iterate over sample data, and read in both 
# tumor and normal samples to create a "long" tibble
for (i in 1:nrow(samples)) {
    # Read in file
    tb_tmp <- read_tsv(file = samples$file_name[i],
                    col_names = TRUE,
                    show_col_types = FALSE) %>%
        select("#gene", "unstranded") %>%
        rename(gene = "#gene", value = "unstranded") %>%
        mutate(id = samples$file_id[i])
    
    # Add file contents to existing data structure
    tb_counts_long <- tb_counts_long %>%
        bind_rows(tb_tmp)
}

In [None]:
# Convert "long" tibble to "wide" tibble
# Remove the rows that don't have counts of genes, 
# i.e. remove any rows where the gene name does not start with "ENSG"
tb_counts_wide <- tb_counts_long %>%
    spread(id, value) %>%
    filter(str_detect(gene, "^ENSG"))

head(tb_counts_wide)
dim(tb_counts_wide)

## 4. Export and save the counts matrix
We upload content to our project on the DNAnexus platform from our local JupyterLab instance, using the CLI dx-toolbox command, `dx upload <file_name>`

In [None]:
# Export the counts matrix to Gene_Expression_count_all_samples.csv
write_csv(tb_counts_wide, counts_file)

In [None]:
# Upload the counts csv to the project
system(paste("dx upload", counts_file)