## RNAseq in a nutshell: From FASTQ files to differential expression

Last week we learned to quantify gene expression using arrays. This technology is still used in and has some advantages, but a much more widely used approach is measuring transcriptome abundance using high throughput sequencing.

We will start this session by describing the structure and quality control measures used for fastq files. So, it is important that before you start the next two sessions, you understand how these files are generated as part of an RNAseq experimental pipeline. This is composed of two steps: library generation and high throughput sequencing. 

**Library generation:** RNA extracted from cells, model organism or tissue undergoes a procedure that converts the RNA into DNA while attempting to preserve the relative abundance of the different transcripts. It is important to remember that the total RNA in a cell is usually composed of mostly rRNA, which is usually depleted in the process of library prep.

**Q1** Take a look at figure 1a in the paper by Martin et al. which we provided for you on Canvas, when going from step 3 to 4 primers need to be added for the reverse transcriptase, usually two types are used, which primers would you used if you would like to make sure that you are sequencing only mRNA? (If you have never heard of cDNA before, look it up using google)

**High throughput sequencing:** Many technologies for massively parallel sequencing have been introduced over the years but definitely the most popular is the Illumina Sequencing-By-Synthesis (SBS) technology. The DNA fragments that undergo SBS are the ones depicted in step 7 in figure 1a mentioned above. 

**Q2** Take a look at an introduction video by <a href="https://www.youtube.com/watch?v=fCd6B5HRaZ8&ab_channel=Illumina" target="_blank">Illumina on SBS</a>, Illumina sequencing has limitations on the size of the fragments that can be sequenced, too short or too long DNA fragments cannot be accurately sequenced. Based on this video, can you guess which step in the process is responsible for this limitation?

Note that sequencing can be done using either single or paired-end reads, as the fragments are usually longer than the number of SBS cycles, a paired-end read would capture the full sequence of the fragment by sequencing the two ends and filling the middle part using the genomic of transcriptome reference sequence.

### Structure of the FASTQ file

Illumina sequencing reads are stored in FASTQ files, in these files every read is stored as four rows as follows:

* Line 1:	Sequence ID – will always start with @ and is a unique ID for the read, this line is important because in many cases each sequencing cluster will have more than one read (like paired end and sample barcode) which will need to be matched.
* Line 2: Sequence – The actual sequence consisting of the four nucleotides and N in cases where the precise nucleotide could not have been determined.
* Line 3: Quality ID – This line will always start with + which in some cases will be followed by and read quality score encoded in ASCII
* Line 4: Quality score – Note that this line will contain the same number of characters as the sequence line, every letter encodes the quality of the corresponding nucleotide in the sequence


**Q3** We have provided an example FASTQ file in your directory (sample.fastq), open a terminal and use your unix skills to look at the top lines of the file. What is the nucleotide sequence of the 10th read? How many reads total does this file contain? (remember that each read is four lines)

### Assessing FASTQ files quality and sequence composition with FASTQC 

FASTQC is an extremely convenient tool that you should use as your first step in testing the quality of an Illumina NGS run. FASTQC will analyze the nucleotide composition and sequence quality scores of your run and will output and html file for each FASTQ file summarizing the results. After installing FASTQC it can be easily run using the command:

fastqc *.fastq

This will produce an html file for each FASTQ file in your directory.

We have provided an example FASTQC output file for you (sample_fastqc.html), open this file (you might need to download it to your computer) and explore the different summary statistics. Take a look at the per base sequence content, note the difference in nucleotide distribution at the begging of the reads compared to the middle of the reads. This is also reflected in the k-mer count. 


**Q4** What is the 2nd most overrepresented kmer? Optional question - any thoughts on what generates this bias? (Think about library prep)

### Mapping of reads to the genome/transcriptome

The next step is to map the reads, representing RNA fragments, to a reference sequence library (as the genome) in order to convert the FASTQ files containing DNA fragments to a gene-based value associated with expression levels. In the early days of RNAseq mapping was done mostly against the genome using short read aligners. Note that unlike CHIP-seq that was covered at the beginning of the course, mapping transcript fragments to the genome will be less straightforward due to splicing. A popular splicing aware aligner is STAR (Spliced Transcripts Alignment to Reference) which is considered a state-of-the-art aligner for this task.

An alternative approach, which we will use here, if to map the reads on a database of already assembled transcripts. The three leading algorithms for doing this are:

Salmon - https://combine-lab.github.io/salmon/
<br> kallisto - https://pachterlab.github.io/kallisto/
<br> RSEM - http://deweylab.github.io/RSEM/

Here we will use Salmon for read mapping. Salmon take as input FASTQ and a reference transcriptome and produces quantification files that end with .sf. A short tutorial on how to use Salmon can be found here: https://combine-lab.github.io/salmon/getting_started/. We will not run this program on CoCalc due to memory constraints, so we have already provided you with Salmon output on two FASTQ files.

As an example, we will use the airways dataset that contains RNAseq data of four human airway smooth muscle cell lines treated and untreated with dexamethasone. Under the directory “quants” you will find the output of Salmon on two samples, the name of each directory corresponds to the sample name, the file quants.sf.gz contains the mapping information that we will load to our R workspace.

To load Salmon mappings into R we will use the library tximeta, this library extends a previous tximport library that added, in addition to loading and processing of transcript mapping data, also automatic annotation of metadata for commonly used transcriptomes. 

**Q5** Use the code below to load the library and a table containing meta data associated with the two samples.


In [None]:
library(tximeta)
library(SummarizedExperiment)
coldata <- read.csv("sample_table.csv", row.names=1, stringsAsFactors=FALSE)
coldata

In order to load this data using tximeta our coldata table will have to contain two additional columns, names, for the name of the sample and files, with the complete path to each quants.sf.gz file.

**Q6** Run the code blow to add these columns, take a minute to understand how the path to each file is generated, if needed run the file.path command separatly and take a look at the out it produces.


In [None]:
coldata$names = coldata$Run
coldata$files=file.path('quants' , coldata$names, "quant.sf.gz")
coldata

**Q7** now use the command below to load the data, note that the annotation for the transcriptome index that was used for mapping was automatically identified and loaded.

In [None]:
se <- tximeta(coldata2)

We have just constructed a SummarizedExperiment (SE) object, this is an important data structure that is used as the basic data structure to store RNAseq data in R. The structure is depicted in the figure below <br>
<img src="files/SEstructure.png" width="400" height="200">

colData conatins the data about the samples which we have provided to the tximeta function, rowRanges contains information about each gene/transcripts, this data was downloaded automatically because we used a standard transcriptome assembly. Assays part of the SE object countain the actual counts, there can be more than one assay associated with different processing levels, e.g. normalized and non normalized counts.

**Q8** Use the function dim to look at the dimensions of your current SE objst, how many transcripts are quantified? Also use head to get a summary of se.

**Q9** You will notice that there are three assays, use the code below to list their names

In [None]:
assayNames(se)

The three assays correspond to the output from the Salmon aligner, counts is non-normalized read counts per trasncript, abundance is TPM (transcripts per million) and length is the length of the transcript.

**Q10** Take a look at these assays using the command head(assays(se)$count), do this for abundance and length as well. Note that trascript length is not same for the two samples! The reason for this is that it is harder to capture the ends of the trasncripts and the Salmon algorithm estimates an effective length for each sample separatly. Also use the commant rowRanges(se) to get information about each transcript.

Next, because we are interested in gene and not transcript information, we will need to summarize transcript data to genes.

**Q11** Run the code below to generate gene level data, and then add a line that looks at the dimentions of your data using the commant dim. Note that the dimentions are now much smaller because transcript isoforms have been aggregated to single genes.

In [None]:
gse <- summarizeToGene(se)

There are some addtional functions that can be used to explore gene meta data.

**Q12** Run the code below to access gene sequence information 

In [None]:
rowRanges(gse)
seqinfo(rowRanges(gse))

Our next step will be to normalize the data and run statistical test to look for differential expressed genes, this will be the focus of the next session, but as the last step, recall class 23 about multiple hypothesis testing. We need to know how the data distributes in order to choose the right statistical test, for example we could have used a t-test if expression levels would have a normal distribution. 

**Q13** Make a histogram of the count assay matrix, use a log10 scale, does this look like a normal distribution?

This is close, but not really a normal distribution, apprently, gene expression data is best modeled using a negative binomial distribution, we will cover this in the next session!