# RNAseq in a nutshell: From FASTQ files to differential expression
## Part II

Last session we how to go from sequencing reads, stored in FASTQ format, to transcripts and gene counts. Remember that Salmon is one out of a few different methods to do that. In this session, we use the DESeq2 package to perform some exploratory analysis, examining the relationship between the samples and then calculate differential expression between different sample attributes, for example, treatment vs non-treatment or the source from which the sample was obtained.

RNAseq pipelines are composted of several steps, usually including sample normalization, filtering, variance transformation and statistical modeling to determine differential expression. 

First let’s load the data and required packages, we have already provided you with a gene level SummarizedExperiment object that contains the full airways dataset.

**Q1**: Run the code below to load the DESeq2 package and the object gse into your workspace.


In [None]:
library('DESeq2')
library('tidyverse')
load('gse.RData')

As a reminder, `gse` is a SummarizedExperiment variable, this is a special object used that is designed to store results of RNAseq experiments, in this case `gse` stores the output of running the Salmon pipeline. The results of many other mapping algorithms (e.g. Kalisko, RSEM, STAR) can also be loaded into a SummarizedExperiment object which can then be used as input to additional differential expression analysis pipelines other than DESeq2 (e.g. EdgeR).

**Q2**: Take a quick look at `gse`, use `head` to get a summary of the data stored in `gse` and the `colData` command to look at the metadata associated with each sample.

In [None]:
colData(gse)

When running the `colData` command, you will notice that the data type for sample attributes is `factor`, this is a very convenient data type used in R to define categorical variables. Factors have a range of operations that can be performed on them, for example, factors have predefined levels that can be defined and arranged in different ways. 

**Q3** Use the code below to change the names of factors associated with condition to more convenient labels.

In [None]:
levels(gse$condition)
levels(gse$condition) <- c("untrt", "trt")
colData(gse)

**Normalization:**<br>
The number of read counts detected for each gene will need to be normalized for three major factors:<br>
* Library size – unequal mixing of samples for sequencing, or sequencing samples on different NGS runs will result in differences in the total number reads called for each sample (this is commonly referred to as sequencing depth).
* Gene length – the longer the gene the more likely we are to identify RNA fragments for each it.
* Library composition – initial normalization methods (e.g. FPKM, RPKM) did not correct for this. Normalization methods that take this into account were only introduced after DESeq2. We will discuss the problem of library composition and how the DESeq2 pipeline accounts for it below.

**Q4**: First, let’s compare the total number or reads between samples. Recall that our object `se` contains several assays: `counts` with the raw counts for each gene, `abundance` with normalized counts (TPM) and `length` containing the estimated length for each gene/transcript. Use the code below to summarize the total `counts` (in millions) across samples. Add code to summarize `abundance`.


In [None]:
colSums(assays(gse)$counts) / 1e6

The `abundance` contains TPM values, counts which are normalized by gene length and library size, which is why the sum of the columns in the `abundance` assay results in the same amount. Previous normalization methods, like RPKM (Reads Per Kilobase of transcript, per Million) were very similar to TPM, only that they normalized first for library size and then for gene length, resulting in slight differences in the total number of counts per sample. TPM is considered a slightly improved metric. 

*Library composition:*<br>
TPM does not take into account library composition, to understand this problem, assume we are comparing two samples, A and B, where in sample A the gene HSF1 is very highly expressed, while completely absent from sample B. When normalizing for library size, HSF1 will “soak” a large number of reads that will be distributed across other genes in sample B resulting in false increased expression. The DESeq pipeline addresses this issue by using a *median ratio normalization*. We will not get into the details here, but just remember that scaling factors calculated as part of running the full DESeq2 pipeline below are a result of such normalization.

**Q6**: To start using the DESeq2 pipeline, use the code below to load the DESeq2 package and construct a DESeq object from the SummarizedExperiment object. The parameter design specifies the comparisons that DESeq will perform later when we start comparing samples. Note that the initial `dds` object, generated by running the code below, has a similar structure to a SummarizedExperiment object.


In [None]:
dds <- DESeqDataSet(gse, design = ~ donor + condition)
head(dds)

When building the DESeq object we need define the experimental design, in the case above we have used `~ donor + condition`, This will tell the algorithm which comparisons to make.

**Filtering:**<br>
Our current dataset contains many rows (genes) that are covered by very few read counts. It is always a good idea to filter out non-informative rows and reduce the size of our dataset.

**Q7**: Use the code below to filter out all genes that have less than 10 read counts across samples. How many genes have been filtered out?


In [None]:
nrow(dds)
keep <- rowSums(counts(dds)) > 10
dds <- dds[keep,]
nrow(dds)

**Variance stabilizing transformations:**<br>
In most RNAseq datasets we will notice a relationship between the mean and the variance whereby the variance will be higher for lowly expressed genes. This is simply a sampling problem, the higher the expression level of a gene is, the more reads we will identify for it, resulting in a more accurate estimation of the gene expression levels. 

**Q8**: To examine this problem, use the code below to plot the log2 TPM values between two samples.


In [None]:
tpm_mat <- as.data.frame(assays(dds)$abundance)
ggplot(data=tpm_mat,aes(x=SRR1039508,y=SRR1039509)) + geom_point(size=0.5) + scale_x_continuous(trans = 'log2') + scale_y_continuous(trans = 'log2')

This increased variance at lower expression values, will introduce two main artifacts:<br>
(1) It will be harder to identify significant differentially expressed genes at lower expression levels.<br> (2) When comparing samples to each other using the whole transcriptome, lowly expressed genes will have more impact on these comparisons.

DESeq2 offers two methods for variance stabilization that can be run outside of the main DESeq command (which we will run below):
* VST - variance stabilizing transformation 
* Rlog - regularized-logarithm transformation
Both methods will have a similar effect to a standard log2 transform for high expression values but will have a more dramatic effect on the low expressing genes by “shrinking” the values towards the middle.

**Q9**: Run the code below to generate two new objects containing transformed values using these two methods.


In [None]:
vsd <- vst(dds, blind = FALSE)
head(vsd)
rld <- rlog(dds, blind = FALSE)
head(rld)

We use the flag `blind = FALSE` to make sure that sample labels are taken into account when doing this transfomations. Note that when looking at the `vsd` and `rld` objects, we will only see one assay containing the transformed values. Those values can be accessed by simple `assay(vsd)` or `assay(rld)`.

Now let’s use our tidyverse skills to examine the effect of these transformations!

**Q10**: Complete the code below to construct a new *tidy* data frame with values from the two transformation methods and the non-transformed TPM values.

In [None]:
count_mat <- bind_rows(as_data_frame(log2(assays(dds)$abundance)) %>% mutate(transformation = "none"),
                       as_data_frame(assay(vsd)) %>% mutate(transformation = "vsd"),
                       ____________________________________________________________)

**Q11**: Repeat the plot from Q8 on `count_mat`, remove the log scaling of the axis, `vst` and `rld` already transforms the data to a logarithmic scale, and use the `transformation` column as your facet layer in ggplot.

Note how both methods reduced the inflated fold changes associated with low expression values.

**Examining the similarity between samples using dimensionality reduction:**<br>
After normalization we want to explore the relationship between samples by comparing whole transcriptome profiles. We do this to check that replicate samples are similar to each other, that there are no major batch effects (e.g. all samples processed on the same day tend to cluster together) and to identify the parameters that are responsible for the most noted difference between samples. There are two general ways to do this, either using clustering or dimensionality reduction. We will cover dimensionality reduction in more detail during the single cell RNAseq part of this course. Here we will use Principal Component Analysis (PCA), which is a popular and simple method for dimensionality reduction. 

PCA - Imagine each of our samples had only 3 genes, in this case we could make a 3D plot using each gene expression level as an axis. In such a plot, samples with similar expression profiles should appear together while samples with changed expression profiles will be separated. Our samples contain expression information about almost 20,000 genes, as we cannot plot each gene on a separate axis, PCA uses a simple mathematical transformation to project the data onto 2-3 dimensions. In brief, PCA iteratively calculates principal components (PC), which are transformations of the whole data matrix onto the plot axis, while trying to preserve a maximum amount of the variability in the data. Each PC would show the amount of variability captured in it. You can read more about PCA [online](https://en.wikipedia.org/wiki/Principal_component_analysis).

**Q12**: Use the code below to run a PCA on your rlog normalized samples, there are several PCA functions in R, here we will be using one that part of the DESeq2 package. The parameter `inrgroup` specifies which sample attributes to use on the legend (or, looks like this PCA function uses ggplot! So `ingroup` defines addtional aesthetic mappings associated with your samples). 


In [None]:
plotPCA(rld, intgroup = c("condition", "donor"))

**Q13**: What is the sample attribute that is responsible to the largest variance between samples? (hint: look at the principal component that captures the most of the variane). It also looks like one of the donors is an outlier, which one is it?

# Homework

**Running the full DESeq2 pipeline:**<br>
Now (finally) we are going to run the full DESeq2 pipeline to look for the differentially expressed genes that underlie the differences between samples that we observed in our PCA plot.

**Q14**: This is done in a single line of code, run it below:


In [None]:
dds <- DESeq(dds)

Although we have tested different data transformations above, when running the pipeline we always start from non-normalized counts. Running the code above has performed these steps:

* Median ratio normalization to correct for library size and composition.
* Variance stabilization
* Modeling the data using a negative binomial model to calculate gene-based p-values

We can look at the results for each one of the comparisons we defined above when constructing the initial DESeq object (remember `design =`?) using the code below.

**Q15** Run the code below, how many genes pass significance when using a false discovery rate of 0.05?

In [None]:
res <- results(dds,contrast=c("condition","trt","untrt"))
res

**Q16** Now go back to the PCA plot you produced above, there was one donor that seemed like an outlier, compare that donor to any other one (using code similar to the one above only changing the `contrast`), how many genes pass significance when using a false discovery rate of 0.05? Make a volcano plot and mark those genes with a different color.