Skip to content

Latest commit

Β 

History

History
491 lines (416 loc) Β· 24 KB

ref_based_diff.md

File metadata and controls

491 lines (416 loc) Β· 24 KB

layout: tutorial_hands_on topic_name: RNA-Seq tutorial_name: ref_based_diff

This tutorial is a shortened version of an RNA-Seq course.

It is also very strongly inspired by another Galaxy transcriptomics tutorial

Introduction

In the study of Guilgur et al. 2014 the authors compared the gene expression of a Drosophila melanogaster mutant line (fandango) with the gene expression of its wild-type background, using RNA-Seq.

Total RNA was isolated and used for preparing mRNA-seq libraries (Illumina Truseq, selecting for polyA tails). These libraries were sequenced in an Illumina HiSeq sequencing machine to obtain a collection of mRNA sequencing reads for each sample.

Agenda

In this tutorial, we will deal with:

  1. Data
  2. QC of sequencing reads
  3. Estimating gene expression 2.1. Generating read count table from genome mappings 2.1.1. Mapping of reads to a Reference Genome 2.1.2. Visualizing alignment results 2.1.3. Counting reads per annotated gene 2.2. Mapping of reads to a Reference Transcriptome
  4. Differential gene expression analysis {: .agenda}

Data

The original raw sequencing data for this study is available at the European Nucleotide Archive (ENA) under accession PRJEB5469.

There are 2 replicates for each condition (fandango and WT), 4 samples in total.

πŸ”© Comments

  • Nowadays, 3 or more replicates are more commonly used. {: .comment}

To make this tutorial fast to execute, we selected just reads corresponding to a very limited number of example genes.

✏️ Hands-on: Data upload

  • Create a new history for this RNA-seq exercise

  • Import the R1 FASTQ files (*R1.fq.gz) from Zenodo](https://doi.org/10.5281/zenodo.849901)

    πŸ’‘ Tip: Importing data via links

    • Copy the link location
    • Open the Galaxy Upload Manager
    • Select Paste/Fetch Data
    • Paste the link for each of the files into the text field
    • Select fastqsanger.gz In the Type (set all) box
    • Press Start
      {: .tip}

{: .hands_on}

QC of sequencing reads

For more details on quality control and preprocessing, you can look at the NGS-QC tutorial

✏️ Hands-on: Quality control

  • FastQC πŸ”§: Run FastQC on mut_lib1_R1 to control the quality of the reads

    ❓ Questions

    • Is there anything that you find striking in the report?

      Click to view answers
      1. You can see potential issues in %GC, sequence duplication and overrepresented sequences. These problems arise because this is a selected dataset. The overall quality of the sequences is good. In reality, this data has already been preprocessed, but the original quality was already quite good. It is more and more the case that raw data is ready to use right out of the machine, particularly when using single-end short reads. Nonetheless, it is always a good idea to check, in case some preprocessing is necessary.

    {: .question}

Estimating gene expression

Now that the reads are ready to use, how can we use them to estimate gene expression?

Generating read count table from genome mappings

As the genome of Drosophila melanogaster is assembled, we can map the sequenced reads on this genome.

πŸ”© Comment

Do you want to learn more about the principles behind mapping? Follow this tutorial {: .comment}

Because in the case of a eukaryotic transcriptome, most reads originate from processed mRNAs lacking exons, they cannot be simply mapped back to the genome as we normally do for DNA data. Spliced mappers have been developed to efficiently map transcript-derived reads against genomes. TopHat was one of the first tools designed specifically to address this problem. Here, we will use HISAT2, a successor to TopHat2 that is faster with low memory requirements.

Mapping of reads to a Reference Genome

We can now map all the RNA sequences on the Drosophila melanogaster genome using HISAT2.

✏️ Hands-on: Spliced mapping

  1. HISAT2 πŸ”§: Run HISAT2 with:

    • "FASTQ" as "Input data format"
    • "Individual unpaired reads"
    • mut_lib1_R1 fastq (.gz or not) as "Reads"
    • "Drosophila_BDGP6" as reference genome

    πŸ’‘ Tip: Running with multiple samples simultaneously

    • In the "Reads" choose "Multiple Datasets" and select the fastq files for all samples
    • Press Execute {: .tip}

    πŸ’‘ Tip: Compressed fastq files

    Fastq files can be big, and are therefore most often compressed as fastq.gz files. Many tools work directly with these files. {: .tip}

    πŸ”© Comment

    Efficient aligners such as Hisat2 require an initial indexing of the reference genome. This is a computationally intensive process and galaxy instances often offer a selection of genomes already indexed. Nonetheless, one needs to be aware of the exact version of the genome that is made available in the Galaxy isntance you're using. If in doubt, download your own genome as a fasta file and use it as reference from history in the alignment step. Just note that the alignment can become much slower (particularly for large genomes) because of the initial indexing step that must be performed. {: .comment}

  2. Inspect the mapping statistics for mut_lib1_R1

    • Click on "View details" (Warning icon)
    • Click on "stderr" (Tool Standard Error)

    ❓ Question

    • How many reads were mapped 1 time?

    • And how many reads were mapped more than 1 time?

    • What is the overall alignment rate?

      Click to view answer
      1. ....
      2. 180208 reads aligned exactly one time
      3. 17529 aligned more than one time
      4. The overall alignment rate is 99.79%

    {: .question}

    πŸ”© Comment

    The alignment rate was so high because this was a selected dataset obtained from aligned reads. Alignment rates depend on the dataset and the organism. In bacteria and other organisms with small non-repetitive genomes have very high alignment rates, while for complex vertebrate or plant genomes this rate can go down. As a rule of thumb for higher mammals (mouse and human), alignment rates are usually over 80%. {: .comment}

{: .hands_on}

Visualizing alignment results

Hisat2 returns a BAM file with information about where the reads are mapped on the reference genome.

You can visualize these alignments using software such as IGV

✏️ Hands-on: Use IGV to visualize HISAT2 results

  1. IGV πŸ”§: Visualize the HISAT2 output BAM files of all 4 samples

    ❓ Question

    1. What can you see when looking in the region 3L:15,039,818-15,045,581

    2. What can you see when looking in the region X:20,693,493-20,698,320

      Click to view answers
      1. Gene Rpn12R is diferentially expressed between the 2 conditions
      2. Gene run has splicing issues (less spliced in the mutant)

    {: .question}

    πŸ’‘ Tip: Selecting the appropriate genome

    As a reference genome you can select dm6, but you should preferably download BDGP6 and load genome from file. {: .tip} {: .hands_on}

πŸ”© Comment

Manual inspection of alignments using IGV is important to have a broad overview of the experiment, particularly if we have a limited set of genes of interest. Nonetheless, we cannot inspect many genes this way. There are several ways to systematically assess quality of alignments. The fraction of aligned reads is one simple measure of quality, but it is very limited. Specialized tools such as qualimap and RSeqQC can produce a wide diversity of specialized plots e.g. fraction of reads mapping to exonic regions versus intronic or intergenic, positional bias of transcript coverage (potentially indicating eg. RNA degradation), etc... {: .comment}

Counting reads per annotated gene

Now that we know where reads map in the genome, we need to see to which genes they correspond, and count how many reads are mapped on each gene. For this, we need genome annotations, which indicate the positions of genes in the genome. These are usually in the form of GTF files.

πŸ”© Comment

We need to be carefull that the annotation being used was made on the same version of the genome used for alignment. {: .comment}

HTSeq-count is one of the most popular tool for gene quantification.

πŸ”© Comment

feature counts is another tool to generate these counts that can be much faster than htseq-count. {: .comment}

In principle, the counting of reads overlapping with genomic features is a fairly simple task. But there are some details that need to be decided, such how to handle multi-mapping reads. HTSeq-count offers 3 choices ("union", "intersection_strict" and "intersection_nonempty") to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature. The recommended mode is "union", which counts overlaps even if a read only shares parts of its sequence with a genomic feature and disregards reads that overlap more than one feature.

✏️ Hands-on: Counting the number of reads per annotated gene

  1. HTSeq-count πŸ”§: Run HTSeq-count on the BAM files from Hisat2 with

    • Drosophila_melanogaster.BDGP6.85.sample.gtf as "GFF file"
    • The "Union" mode
    • A "Minimum alignment quality" of 10
    • "Stranded": No

    πŸ”© Comment

    By default, htseq-count assumes the library to be stranded. If your data is not stranded, you will loose about 50% of your counts. If you don't know, it is probably best to start with unstranded. {: .comment}

  2. Inspect the result files

    ❓ Question

    1. Which information does the resulting files contain?

    2. Which feature has the most reads mapped on it?

      Click to view answers
      1. The result files consist of a tabular file with two columns: the gene id and the number of reads mapped on the corresponding gene
      2. The gene with more reads in all samples is FBgn0000042 (Act5C, an actin gene).

    {: .question} {: .hands_on}

Mapping of reads to a Reference Transcriptome

An alternative to mapping to the genome is to map directly against a reference transcriptome (a fasta of the transcripts). This in fact may be the only option available, particularly for non-model organisms for which only the transcriptome is available (usually generated with RNA-Seq using short illumina reads). Nonetheless, this mapping can be troublesome because of the presence of (potentially many) similar alternatively spliced transcripts for the same gene. Aligners such as Hisat2 or BWA usually ignore reads with multiple alignments, or randomly attribute them to one of the possible positions. These behaviors can often lead to improper estimates of transcript counts.

Salmon provides transcript-level estimates of gene expression. It is very fast (mostly because it only aligns against the transcriptome), and uses more elaborate statistical methods to handle the presence of different alternative splice forms that difficult the attribution of a read to a transcript. Salmon can also take explicitly in consideration bias related to differences in transcript length and nucleotide composition.

✏️ Hands-on: Mapping of reads to a Reference Transcriptome using Salmon

  1. Salmon πŸ”§: Run Salmon with:

    • "Single-end" as library type
    • fastq of the different samples as input FastQ
    • "Drosophila_melanogaster.BDGP6.88.sample.cdna.fa" as reference transcriptome

    πŸ’‘ Tip: Running with multiple samples simultaneously

    • In the "Reads" choose "Multiple Datasets" and select the fastq files for all samples
    • Press Execute {: .tip}
  2. Inspect the result files

    ❓ Question

    1. Which information does the resulting files contain?

    2. Are the read counts always integer values?

    3. Are alignments generated?

      Click to view answers
      1. [The result files](http://salmon.readthedocs.io/en/latest/file_formats.html) consist of a tabular file with several columns: transcript id, transcript length, effective transcript length (a normalized length calculated by salmon), transcripts per million (TPM - a normalized value of read counts), and total number of reads attributed to the transcript
      2. No. Each read is attributed with a certain probability to each transcript, and thus a read can be attributed to multiple transcripts with fractional values.
      3. No. Just counts are generated.

    {: .question}

    πŸ”© Comment

    By default, salmon calculates transcript-level estimates. If one wants to have gene level estimates we can pass as parameter to salmon a tabular file with the conversion of transcripts to gene, and salmon will calculate a gene-level estimate based on the expression of all its transcripts.

    {: .comment}

{: .hands_on}

Differential gene expression analysis

In the end of the previous process, we have generated tables of counts, with the number of reads mapping to each of the genes (or transcripts). We will now identify which genes are impacted by the mutation based on those counts. We could compare directly the count files and estimate genes differentially expressed. Nonetheless, the number of sequenced reads mapped to a gene depends on several factors such as the sample's sequencing depth (total number of reads), the gene length, it's nucleotide composition, among other confounding factors.

πŸ”© Comment

Salmon already provides normalized values (TPM) that takes into account some of these confounding factors. Nonetheless, there is yet no consensus on whether those estimates are the most appropriate for using in differential expression analysis.

{: .comment}

Either for within or for inter-sample comparison, the gene counts need to be normalized. This normalization step is performed by tools specialized in differential analysis using count data, such as DESeq2 and edgeR. These programs take as input tables of gene counts (non-normalized).

✏️ Hands-on: Analysis of the differential gene expression (1)

  1. DESeq2 πŸ”§: Run DESeq2 with:
    • "Genotype" as first factor with "Mutant" and "WT" as levels and selection of count files corresponding to both levels

      πŸ’‘ Tip

      You can select several files by keeping the CTRL (or COMMAND) key pressed and clicking on the interesting files {: .tip}

{: .hands_on}

The first output of DESeq2 is a tabular file. The columns are:

  1. Gene identifiers

  2. Mean normalized counts, averaged over all samples from both conditions

  3. Logarithm (to basis 2) of the fold change

    The log2 fold changes are based on primary factor level 1 vs. factor level 2. The order of factor levels is then important. For example, for the factor 'Genotype', DESeq2 computes fold changes of 'Mutant' samples against 'WT' samples.

  4. Standard error estimate for the log2 fold change estimate

  5. Wald statistic

  6. p-value for the statistical significance of this change

  7. p-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (FDR)

✏️ Hands-on: Inspection of DESeq2 results

❓ Question

How many genes have a significant change in gene expression between these conditions?

Click to view answers One gene, FBgn0036465 (Rpn12R). The gene with splicing issues (run, FBgn0003300) does not reach statistical significance.

{: .question}

{: .hands_on}

In addition to the list of genes, DESeq2 outputs a graphical summary of the results, useful to evaluate the quality of the experiment:

  1. Histogram of p-values for all tests

  2. MA plot: global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.1) are colored in red.

  3. Principal Component Analysis (PCA) with the first two axes

    Each sample is plotted as an individual data point. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.

    ❓ Question

    What is the first axis separating?

    Click to view answers
    1. The first axis is separating the genotype

    {: .question}

  4. Heatmap of sample-to-sample distance matrix: overview over similarities and dissimilarities between samples

    ❓ Question

    How are the samples grouped?

    Click to view answers
    1. They are grouped by the genotype

    {: .question}

  5. Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)

✏️ Hands-on (optional): DESeq2 with Salmon results

  1. DESeq2 πŸ”§: Run DESeq2 with:

    • "Genotype" as first factor with "Mutant" and "WT" as levels and selection of count files corresponding to both levels

    πŸ’‘ Tip

    Use transcript counts and select in DESeq2 the transcript to gene file instead of the GTF {: .tip}

    ❓ Question

    How do you compare the results with htseq-count?

    Click to view answers
    1. The results are not identical, but similar. On more realistic cases, results may diverge more.

    {: .question} {: .hands_on}

Since this is a very small dataset, most of the plots are not very meaningfull. To have a more realistic idea of such an analysis, let use a dataset that considers all genes. In their paper, Trapnell and colleagues (Trapnell et. al, 2012) created an artificial Drosophila melanogaster dataset with 2 conditions and 3 replicates each, where 300 genes were perturbed in-silico. The original "raw" data and processed files can be found here.

✏️ Hands-on (optional): DESeq2 with Trapnell counts

  1. DESeq2 πŸ”§: Run DESeq2 with:

    • "Condition" as first factor with "C1" and "C2" as levels and selection of count files corresponding to both levels

    ❓ Question

    How many differentially expressed genes do you obtain (corrected p-value (FDR) < 0.05)?

    Click to view answers
    1. 267 with FDR<0.05.

    {: .question} {: .hands_on}

EdgeR is an alternative to DESeq2. It takes a full table of counts (with headers) and a design matrix with the relevant variables and their values per sample.

✏️ Hands-on (optional): edgeR with Trapnell data

  1. edgeR πŸ”§: Create a Design Matrix for edgeR with:

    • "trapnell_counts.tab" as Expression Matrix
    • Condition as Factor
    • C1 as the first Factor Level and select appropriate columns in the expression matrix
    • Add C2 as another Factor level and select appropriate columns in the expression matrix

    ❓ Question

    What does this step generatet?

    Click to view answers
    1. It contains a table with the factors relevant for the experiment and their value for each sample.

    {: .question}

  2. edgeR πŸ”§: Perform Differential Gene Expression Analysis with edgeR:

    • "trapnell_counts.tab" as Expression Matrix
    • The generated design matrix as Design Matrix
    • C1-C2 as contrast

    ❓ Question

    How many differential genes do you obtain (FDR<0.05)?

    Click to view answers
    1. 277

    {: .question}

    πŸ’‘ Tip

    You can select several plots as output (similar to the plots from DESeq2) {: .tip}

{: .hands_on}

Workflow

In this tutorial, we went briefly through the steps in the data analysis of differential expression using RNA sequencing.

✏️ Hands-on (optional): Extract workflow from History

  1. Workflow πŸ”§: Select History -> Extract Workflow

    ❓ Question

    What are the steps in the analysis of differential expression using RNA-Seq?

    Click to view answers
    1. QC of sequencing reads
    2. Filtering of reads (if necessary)
    3. Mapping of reads to genome (if available)
    4. Alternatively, mapping of reads to transcriptome (implicitly generates table of counts)
    5. Generate table of read counts per gene or transcript
    6. Perform differential expression test (includes data normalization)

    {: .question} {: .hands_on}