RNASeq data processing can be broadly divided into two types: 1) Reference Genome Based (Most commonly used) 2) De Novo, where no reference genome is available (uncommon, mostly used on novel species). For this course, we will be soley focussing on reference based RNA-Seq data analysis. Reference genomes for many species is available through consortia such as NCBI, Ensembl, etc. "Genome builds" are periodically released as more regions and genes are annotated within a genome (eg: hg19, hg39 (https://www.gencodegenes.org/human/)). 

After sequencing is completed, you have a collection of sequencing reads for each sample in the form of a *fastq file*. In a reference-based RNAseq experiment, each sequencing read from a fastq file is *mapped* against the reference genome. There are several programs/workflows available to map reads to a genome such as TopHat suite (https://ccb.jhu.edu/software/tophat/index.shtml), STAR aligner (https://github.com/alexdobin/STAR) and StringTie suite (https://ccb.jhu.edu/software/stringtie/). RNASeq read aligners are *splicing-aware*, meaning they are able to map reads that skip introns / exons due to splicing events occuring across RNA transcripts.

Common things to consider after read mapping is completed : 
1) What fraction of total reads successfully aligned to the genome ? Low fraction usually indicates contamination within your sample.
2) What fraction uniquely mapped to the genome ? If each read has been derived from RNA molecules, each read must correspond to a unique location within the genome. However, a singificant fraction of reads can be expected to NOT map uniquely to the genome, due to the existence of repetitve regions (eg: rRNA). It is difficult to trace back the region from where a read was sequenced if the read derived from a repetive region that is present across multiple locations within the genome.

Once every read has been sufficiently mapped to a corresponding location across the genome, we can quantify the number of reads mapping across each gene/transcript/exon. Since the number of reads generated across a gene can be directly corelated with the length of the gene and also with the sequencing depth, the read counts across a sample must be normalized. There are several ways to perform read count normalization: 1) RPKM/FPKM (Reads/Fragments Per Kilobase of transcript per Million mapped reads) 2) TPM (Transcripts Per Million mapped reads) 3) RPM (Reads Per Million mapped reads)

## Some important RNASeq data terminologies :
We will be using these terms throughout the course

1) Fastq: This is a standardized file format to store sequencing reads for each sample. 

2) Reference genome : A representation set of sequences assembled through previous studies that best represents the organization of genes within a genome

3) Annotation : Each gene within a reference genome is assigned coordinates; eg: chr1:10,000-12,000. Each gene is further *annotated* by defining the location of UTRs, exons and introns within the gene. Commonly used annotations are provided from Gencode (https://www.gencodegenes.org/human/) and UCSC (https://genome.ucsc.edu/cgi-bin/hgTables).

4) Alignment / Alignment Tool / Aligners : Alignment is the process of identifying the region within the genome from which a read was derived. This is done by matching the read sequence with the genome sequence to find a perfect match.

5) SAM / BAM files : SAM file if a standardized file format to store alignment records for each read (https://samtools.github.io/hts-specs/SAMv1.pdf). A BAM file is a binary or compressed version of SAM files. BAM files are not human readable, unlike SAM files.

6) Gene Expression : Broad term for the number of reads derived from a gene (could be a normalized count).

7) Differential Gene Expression (DGE) : A gene that a expression significantly higher/lower between two experimental conditions.

### The downloaded SRA files are next processed by following these steps:
1. `fastq-dump` in the SRA-toolkit to generate .fastq files
2. `FastQC`<sup>[3](#ref3)</sup> to perform Quality Controls and generate QC report for the input RNA-seq data
2. `STAR`<sup>[3](#ref3)</sup> for the read alignment
3. `featureCounts`<sup>[4](#ref4)</sup> for assigning reads to genes

#### 1. Extract fastq reads from SRA files using the NCBI SRA toolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software)

fastq-dump -I -split-spot < SRA Accession Number >

#### 2. Perform quality control checks on fastq files using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/download.html)

fastqc -o ../fastQC_output < fastq file >

#### 3. Perform genome alignment for fastq reads

STAR 
--genomeDir < path to STAR  genome index > 
--sjdbGTFfile < path to genome gtf file > 
--runThreadN < # of precessors to use > 
--outFileNamePrefix < output filename prefix, including the path > 
--readFilesIn < fastq filenames > 
--readFilesCommand zcat 
--outSAMtype BAM Unsorted 
--outSAMmode Full

#### 4. Perform read count on annotated genes using FeatureCounts (http://subread.sourceforge.net/)

featureCounts 
-T < # of threads > 
-t < gene feature to count reads upon eg: exon, transcript >
-g < group read counts by eg: gene, transcript >
-a < path to genome GTF file >
-o < output filename > 
< list of bam files to preform read counting on >

Can run featureCounts on multiple bam files at the same time. Results for each bam will outputted to the same file 
as a tab separated column.

## Example Dataset Features : 
The dataset provides brief tutorial / instructions to process and analyze the data. The commands and programs are also downloadable in the form of a docker image file ( This way you don't have to install and execute each program).

The first publicly available study profiling gene expression changes after ZIKV infection of human cells was deposited into NCBI's Gene Expression Omnibus (GEO) in March 2016. The raw data is available (ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP070/SRP070895/) from the Sequence Read Archive (SRA) with accession number GSE78711. In this study, gene expression was measured by RNA-seq using two platforms: MiSeq and NextSeq [[4]](#ref4) in duplicates. The total number of samples is eight, with four untreated samples and four infected samples. We first downloaded the raw sequencing files from SRA and then converted to FASTQ files. Quality Control (QC) for the RNA-Seq reads was assessed using FastQC [[5]](#ref5). The reports generated by FastQC were in HTML format and can be accessed through hyperlinks from the IPython notebook. The reads in the FASTQ files were aligned to the human genome with Spliced Transcripts Alignment to a Reference (STAR) [[6]](#ref6). STAR is a leading aligner that accomplishes the alignment step faster and more accurately than other current alternatives [[6]](#ref6). We next applied featureCounts [[7]](#ref7) to assign reads to genes, and then applied the edgeR Bioconductor package [[8]](#ref8) to compute counts per million (CPM) and reads per kilobase million (RPKM). The next steps are performed in Python within the IPython notebook. We first filtered out genes that are not expressed or lowly expressed. Subsequently, we performed principal component analysis (PCA) (Fig. 1). The PCA plots show that the samples cluster by infected vs. control cells, but also by platform. Next, we visualized the 800 genes with the largest variance using an interactive hierarchical clustering (HC) plot (Fig. 2). This analysis separates the groups of genes that are differentially expressed by infected vs. control from those that are differential by platform. The visualization of the clusters is implemented with an interactive external web-based data visualization tool called clustergrammer (http://amp.pharm.mssm.edu/clustergrammer/). Clustergrammer provides interactive searching, sorting and zoom capabilities.

The following step is to identify the differentially expressed genes (DEG) between the two conditions. This is achieved with a unique method we developed called the Characteristic Direction (CD) [[9]](#ref9). The CD method is a multivariate method that we have previously demonstrated to outperform other leading methods that compute differential expression between two conditions [[9]](#ref9). Once we have ranked the lists of DEG, we submit these for signature analysis using two tools: Enrichr [[10]](#ref10) and L1000CDS2 [[11]](#ref11). Enrichr queries the up and down gene sets against over 180,000 annotated gene sets belonging to 90 gene set libraries covering pathway databases, ontologies, disease databases, and more [[10]](#ref10). The results from this enrichment analysis confirm that the downregulated genes after ZIKV infection are enriched for genes involved in cell cycle-related processes (Fig. 3a). These genes are enriched for targets of the transcription factors E2F4 and FOXM1 (Fig. 3b). Both transcription factors are known to regulate cell proliferation and play central role in many cancers. The downregulation of cell cycle genes was already reported in the original publication; nevertheless, we obtained more interesting results for the enriched terms that appeared most significant for the upregulated genes. Particularly, the top two terms from the mouse genome informatics (MGI) Mammalian Phenotype Level 4 library are abnormal nervous system (MP0003861) and abnormal brain morphology (MP0002152) (Table S1). This library associates gene knockouts in mice with mammalian phenotypes. These enriched terms enlist a short set of genes that potentially link ZIKV infection with the concerning observed microcephaly phenotype. Finally, to identify small molecules that can potentially either reverse or mimic ZIKV-induced gene expression changes, we query the ZIKV-induced signatures against the LINCS L1000 data. For this, we utilize L1000CDS2 [[11]](#ref11), a search engine that prioritize small molecules given a gene expression signature as input. L1000CDS2 contains 30,000 significant signatures that were processed from the LINCS L1000 data with the CD method. The results suggest small molecules that could be tested in follow-up studies in human cells for potential efficacy against ZIKV (Table S2).

To ensure the reproducibility of the computational environment used for the whole RNA-Seq pipeline, we packaged all the software components used in this tutorial, including the command line tools, R packages, and Python packages into a Docker image. This Docker image is made publically available at https://hub.docker.com/r/maayanlab/zika/. The Docker image was created based on the specifications outlined on the official IPython’s Scipy Stack image (https://hub.docker.com/r/IPython/scipystack/). The additional command line tools, R scripts, and Python packages together with their dependencies were compiled and installed into the Docker image. The RNA-Seq pipeline Docker image was deployed onto our Mesos cluster, which allows users to run the IPython notebook interactively. The Docker image can also be downloaded and executed on local computers and servers, or deployed in the cloud if users have access to cloud provider services with a Docker Toolbox installed (https://www.docker.com/products/docker-toolbox). We also provide detailed instructions on how to download and execute the Docker image (https://hub.docker.com/r/maayanlab/zika/). 

The ‘Dockerization’ of the RNA-Seq pipeline facilitates reproducibility of the pipeline at the software level because the Docker image ensures that all versions of the software components are consistent and static. Dockerization also helps users to handle the complex installation of many dependencies required for the computational pipeline. Moreover, the Docker image can be executed on a single computer, clusters/servers and on the cloud. The only limitation of having a Docker image is that it prevents users from adding or altering the various steps which require additional software components and packages. However, advanced users can build their own Docker images based on our initial image to customize it for their needs.