This tutorial is a shortened version of an RNA-Seq course.
It is also very strongly inspired by another Galaxy transcriptomics tutorial
In the study of Guilgur et al. 2014 the authors compared the gene expression of a Drosophila melanogaster mutant line (fandango) with the gene expression of its wild-type background, using RNA-Seq.
Total RNA was isolated and used for preparing mRNA-seq libraries (Illumina Truseq, selecting for polyA tails). These libraries were sequenced in an Illumina HiSeq sequencing machine to obtain a collection of mRNA sequencing reads for each sample.
In this tutorial, we will deal with:
- Data
- QC of sequencing reads
- Estimating gene expression 2.1. Generating read count table from genome mappings 2.1.1. Mapping of reads to a Reference Genome 2.1.2. Visualizing alignment results 2.1.3. Counting reads per annotated gene 2.2. Mapping of reads to a Reference Transcriptome
- Differential gene expression analysis {: .agenda}
The original raw sequencing data for this study is available at the European Nucleotide Archive (ENA) under accession PRJEB5469.
There are 2 replicates for each condition (fandango and WT), 4 samples in total.
- Nowadays, 3 or more replicates are more commonly used. {: .comment}
To make this tutorial fast to execute, we selected just reads corresponding to a very limited number of example genes.
Create a new history for this RNA-seq exercise
Import the R1 FASTQ files (*R1.fq.gz) from Zenodo](https://doi.org/10.5281/zenodo.849901)
- Copy the link location
- Open the Galaxy Upload Manager
- Select Paste/Fetch Data
- Paste the link for each of the files into the text field
- Select fastqsanger.gz In the Type (set all) box
- Press Start
{: .tip}
{: .hands_on}
For more details on quality control and preprocessing, you can look at the NGS-QC tutorial
FastQC π§: Run FastQC on mut_lib1_R1 to control the quality of the reads
Is there anything that you find striking in the report?
Click to view answers
- You can see potential issues in %GC, sequence duplication and overrepresented sequences. These problems arise because this is a selected dataset. The overall quality of the sequences is good. In reality, this data has already been preprocessed, but the original quality was already quite good. It is more and more the case that raw data is ready to use right out of the machine, particularly when using single-end short reads. Nonetheless, it is always a good idea to check, in case some preprocessing is necessary.
{: .question}
Now that the reads are ready to use, how can we use them to estimate gene expression?
As the genome of Drosophila melanogaster is assembled, we can map the sequenced reads on this genome.
Do you want to learn more about the principles behind mapping? Follow this tutorial {: .comment}
Because in the case of a eukaryotic transcriptome, most reads originate from processed mRNAs lacking exons, they cannot be simply mapped back to the genome as we normally do for DNA data. Spliced mappers have been developed to efficiently map transcript-derived reads against genomes. TopHat was one of the first tools designed specifically to address this problem. Here, we will use HISAT2, a successor to TopHat2 that is faster with low memory requirements.
We can now map all the RNA sequences on the Drosophila melanogaster genome using HISAT2.
HISAT2 π§: Run HISAT2 with:
- "FASTQ" as "Input data format"
- "Individual unpaired reads"
- mut_lib1_R1 fastq (.gz or not) as "Reads"
- "Drosophila_BDGP6" as reference genome
- In the "Reads" choose "Multiple Datasets" and select the fastq files for all samples
- Press Execute {: .tip}
Fastq files can be big, and are therefore most often compressed as fastq.gz files. Many tools work directly with these files. {: .tip}
Efficient aligners such as Hisat2 require an initial indexing of the reference genome. This is a computationally intensive process and galaxy instances often offer a selection of genomes already indexed. Nonetheless, one needs to be aware of the exact version of the genome that is made available in the Galaxy isntance you're using. If in doubt, download your own genome as a fasta file and use it as reference from history in the alignment step. Just note that the alignment can become much slower (particularly for large genomes) because of the initial indexing step that must be performed. {: .comment}
Inspect the mapping statistics for mut_lib1_R1
- Click on "View details" (Warning icon)
- Click on "stderr" (Tool Standard Error)
How many reads were mapped 1 time?
And how many reads were mapped more than 1 time?
What is the overall alignment rate?
Click to view answer
- ....
- 180208 reads aligned exactly one time
- 17529 aligned more than one time
- The overall alignment rate is 99.79%
{: .question}
The alignment rate was so high because this was a selected dataset obtained from aligned reads. Alignment rates depend on the dataset and the organism. In bacteria and other organisms with small non-repetitive genomes have very high alignment rates, while for complex vertebrate or plant genomes this rate can go down. As a rule of thumb for higher mammals (mouse and human), alignment rates are usually over 80%. {: .comment}
{: .hands_on}
Hisat2 returns a BAM file with information about where the reads are mapped on the reference genome.
You can visualize these alignments using software such as IGV
IGV π§: Visualize the HISAT2 output BAM files of all 4 samples
What can you see when looking in the region 3L:15,039,818-15,045,581
What can you see when looking in the region X:20,693,493-20,698,320
Click to view answers
- Gene Rpn12R is diferentially expressed between the 2 conditions
- Gene run has splicing issues (less spliced in the mutant)
{: .question}
As a reference genome you can select dm6, but you should preferably download BDGP6 and load genome from file. {: .tip} {: .hands_on}
Manual inspection of alignments using IGV is important to have a broad overview of the experiment, particularly if we have a limited set of genes of interest. Nonetheless, we cannot inspect many genes this way. There are several ways to systematically assess quality of alignments. The fraction of aligned reads is one simple measure of quality, but it is very limited. Specialized tools such as qualimap and RSeqQC can produce a wide diversity of specialized plots e.g. fraction of reads mapping to exonic regions versus intronic or intergenic, positional bias of transcript coverage (potentially indicating eg. RNA degradation), etc... {: .comment}
Now that we know where reads map in the genome, we need to see to which genes they correspond, and count how many reads are mapped on each gene. For this, we need genome annotations, which indicate the positions of genes in the genome. These are usually in the form of GTF files.
We need to be carefull that the annotation being used was made on the same version of the genome used for alignment. {: .comment}
HTSeq-count is one of the most popular tool for gene quantification.
feature counts is another tool to generate these counts that can be much faster than htseq-count. {: .comment}
In principle, the counting of reads overlapping with genomic features is a fairly simple task. But there are some details that need to be decided, such how to handle multi-mapping reads. HTSeq-count offers 3 choices ("union", "intersection_strict" and "intersection_nonempty") to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature. The recommended mode is "union", which counts overlaps even if a read only shares parts of its sequence with a genomic feature and disregards reads that overlap more than one feature.
HTSeq-count π§: Run HTSeq-count on the BAM files from Hisat2 with
Drosophila_melanogaster.BDGP6.85.sample.gtf
as "GFF file"- The "Union" mode
- A "Minimum alignment quality" of 10
- "Stranded": No
By default, htseq-count assumes the library to be stranded. If your data is not stranded, you will loose about 50% of your counts. If you don't know, it is probably best to start with unstranded. {: .comment}
Inspect the result files
Which information does the resulting files contain?
Which feature has the most reads mapped on it?
Click to view answers
- The result files consist of a tabular file with two columns: the gene id and the number of reads mapped on the corresponding gene
- The gene with more reads in all samples is FBgn0000042 (Act5C, an actin gene).
{: .question} {: .hands_on}
An alternative to mapping to the genome is to map directly against a reference transcriptome (a fasta of the transcripts). This in fact may be the only option available, particularly for non-model organisms for which only the transcriptome is available (usually generated with RNA-Seq using short illumina reads). Nonetheless, this mapping can be troublesome because of the presence of (potentially many) similar alternatively spliced transcripts for the same gene. Aligners such as Hisat2 or BWA usually ignore reads with multiple alignments, or randomly attribute them to one of the possible positions. These behaviors can often lead to improper estimates of transcript counts.
Salmon provides transcript-level estimates of gene expression. It is very fast (mostly because it only aligns against the transcriptome), and uses more elaborate statistical methods to handle the presence of different alternative splice forms that difficult the attribution of a read to a transcript. Salmon can also take explicitly in consideration bias related to differences in transcript length and nucleotide composition.
Salmon π§: Run Salmon with:
- "Single-end" as library type
- fastq of the different samples as input FastQ
- "Drosophila_melanogaster.BDGP6.88.sample.cdna.fa" as reference transcriptome
- In the "Reads" choose "Multiple Datasets" and select the fastq files for all samples
- Press Execute {: .tip}
Inspect the result files
Which information does the resulting files contain?
Are the read counts always integer values?
Are alignments generated?
Click to view answers
- [The result files](http://salmon.readthedocs.io/en/latest/file_formats.html) consist of a tabular file with several columns: transcript id, transcript length, effective transcript length (a normalized length calculated by salmon), transcripts per million (TPM - a normalized value of read counts), and total number of reads attributed to the transcript
- No. Each read is attributed with a certain probability to each transcript, and thus a read can be attributed to multiple transcripts with fractional values.
- No. Just counts are generated.
{: .question}
By default, salmon calculates transcript-level estimates. If one wants to have gene level estimates we can pass as parameter to salmon a tabular file with the conversion of transcripts to gene, and salmon will calculate a gene-level estimate based on the expression of all its transcripts.
{: .comment}
{: .hands_on}
In the end of the previous process, we have generated tables of counts, with the number of reads mapping to each of the genes (or transcripts). We will now identify which genes are impacted by the mutation based on those counts. We could compare directly the count files and estimate genes differentially expressed. Nonetheless, the number of sequenced reads mapped to a gene depends on several factors such as the sample's sequencing depth (total number of reads), the gene length, it's nucleotide composition, among other confounding factors.
Salmon already provides normalized values (TPM) that takes into account some of these confounding factors. Nonetheless, there is yet no consensus on whether those estimates are the most appropriate for using in differential expression analysis.
{: .comment}
Either for within or for inter-sample comparison, the gene counts need to be normalized. This normalization step is performed by tools specialized in differential analysis using count data, such as DESeq2 and edgeR. These programs take as input tables of gene counts (non-normalized).
- DESeq2 π§: Run DESeq2 with:
"Genotype" as first factor with "Mutant" and "WT" as levels and selection of count files corresponding to both levels
You can select several files by keeping the CTRL (or COMMAND) key pressed and clicking on the interesting files {: .tip}
{: .hands_on}
The first output of DESeq2 is a tabular file. The columns are:
-
Gene identifiers
-
Mean normalized counts, averaged over all samples from both conditions
-
Logarithm (to basis 2) of the fold change
The log2 fold changes are based on primary factor level 1 vs. factor level 2. The order of factor levels is then important. For example, for the factor 'Genotype', DESeq2 computes fold changes of 'Mutant' samples against 'WT' samples.
-
Standard error estimate for the log2 fold change estimate
-
Wald statistic
-
p-value for the statistical significance of this change
-
p-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (FDR)
How many genes have a significant change in gene expression between these conditions?
Click to view answers
One gene, FBgn0036465 (Rpn12R). The gene with splicing issues (run, FBgn0003300) does not reach statistical significance.{: .question}
{: .hands_on}
In addition to the list of genes, DESeq2 outputs a graphical summary of the results, useful to evaluate the quality of the experiment:
-
Histogram of p-values for all tests
-
MA plot: global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.1) are colored in red.
-
Principal Component Analysis (PCA) with the first two axes
Each sample is plotted as an individual data point. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.
What is the first axis separating?
Click to view answers
- The first axis is separating the genotype
{: .question}
-
Heatmap of sample-to-sample distance matrix: overview over similarities and dissimilarities between samples
How are the samples grouped?
Click to view answers
- They are grouped by the genotype
{: .question}
-
Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)
DESeq2 π§: Run DESeq2 with:
- "Genotype" as first factor with "Mutant" and "WT" as levels and selection of count files corresponding to both levels
Use transcript counts and select in DESeq2 the transcript to gene file instead of the GTF {: .tip}
How do you compare the results with htseq-count?
Click to view answers
- The results are not identical, but similar. On more realistic cases, results may diverge more.
{: .question} {: .hands_on}
Since this is a very small dataset, most of the plots are not very meaningfull. To have a more realistic idea of such an analysis, let use a dataset that considers all genes. In their paper, Trapnell and colleagues (Trapnell et. al, 2012) created an artificial Drosophila melanogaster dataset with 2 conditions and 3 replicates each, where 300 genes were perturbed in-silico. The original "raw" data and processed files can be found here.
DESeq2 π§: Run DESeq2 with:
- "Condition" as first factor with "C1" and "C2" as levels and selection of count files corresponding to both levels
How many differentially expressed genes do you obtain (corrected p-value (FDR) < 0.05)?
Click to view answers
- 267 with FDR<0.05.
{: .question} {: .hands_on}
EdgeR is an alternative to DESeq2. It takes a full table of counts (with headers) and a design matrix with the relevant variables and their values per sample.
edgeR π§: Create a Design Matrix for edgeR with:
- "trapnell_counts.tab" as Expression Matrix
- Condition as Factor
- C1 as the first Factor Level and select appropriate columns in the expression matrix
- Add C2 as another Factor level and select appropriate columns in the expression matrix
What does this step generatet?
Click to view answers
- It contains a table with the factors relevant for the experiment and their value for each sample.
{: .question}
edgeR π§: Perform Differential Gene Expression Analysis with edgeR:
- "trapnell_counts.tab" as Expression Matrix
- The generated design matrix as Design Matrix
- C1-C2 as contrast
How many differential genes do you obtain (FDR<0.05)?
Click to view answers
- 277
{: .question}
You can select several plots as output (similar to the plots from DESeq2) {: .tip}
{: .hands_on}
In this tutorial, we went briefly through the steps in the data analysis of differential expression using RNA sequencing.
Workflow π§: Select History -> Extract Workflow
What are the steps in the analysis of differential expression using RNA-Seq?
Click to view answers
- QC of sequencing reads
- Filtering of reads (if necessary)
- Mapping of reads to genome (if available)
- Alternatively, mapping of reads to transcriptome (implicitly generates table of counts)
- Generate table of read counts per gene or transcript
- Perform differential expression test (includes data normalization)
{: .question} {: .hands_on}