| layout | enable | title | zenodo_link | questions | objectives | time_estimation | key_points | contributors | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tutorial_hands_on |
false |
De novo transcriptome assembly, annotation, and differential expression analysis |
|
|
3H |
|
|
Introduction
{:.no_toc}
As a result of the development of novel sequencing technologies, the years between 2008 and 2012 saw a large drop in the cost of sequencing. Per megabase and genome, the cost dropped to 1/100,000th and 1/10,000th of the price, respectively. Prior to this, only transcriptomes of organisms that were of broad interest and utility to scientific research were sequenced; however, these developed in 2010s high-throughput sequencing (also called next-generation sequencing) technologies are both cost- and labor- effective, and the range of organisms studied via these methods is expanding.
Examining non-model organisms can provide novel insights into the mechanisms underlying the "diversity of fascinating morphological innovations" that have enabled the abundance of life on planet Earth. In animals and plants, the "innovations" that cannot be examined in common model organisms include mimicry, mutualism, parasitism, and asexual reproduction. De novo transcriptome assembly is often the preferred method to studying non-model organisms, since it is cheaper and easier than building a genome, and reference-based methods are not possible without an existing genome. The transcriptomes of these organisms can thus reveal novel proteins and their isoforms that are implicated in such unique biological phenomena.
Agenda
In this tutorial, we will cover:
- TOC {:toc}
{: .agenda}
Read cleaning (20 minutes)
Known sequencing biases:
- Unknown nucleotides (Ns)
- Bad quality nucleotides
- Hexamers biases (Illumina. Now corrected ?)
Why do we need to correct those?
- To remove a lot of sequencing errors (detrimental to the vast majority of assemblers)
- Because most de-bruijn graph based assemblers can’t handle unknown nucleotides
Get data
{% icon hands_on %} Hands-on: Data upload
Create a new history for this tutorial
{% include snippets/create_new_history.md %}
Import the 12
fq.gzinto aList of Pairscollection namedfastq_raw
Option 1: from a shared data library (ask your instructor)
Option 2: from Zenodo using the URLs given below
https://zenodo.org/record/3541678/files/A1_left.fq.gz https://zenodo.org/record/3541678/files/A1_right.fq.gz https://zenodo.org/record/3541678/files/A2_left.fq.gz https://zenodo.org/record/3541678/files/A2_right.fq.gz https://zenodo.org/record/3541678/files/A3_left.fq.gz https://zenodo.org/record/3541678/files/A3_right.fq.gz https://zenodo.org/record/3541678/files/B1_left.fq.gz https://zenodo.org/record/3541678/files/B1_right.fq.gz https://zenodo.org/record/3541678/files/B2_left.fq.gz https://zenodo.org/record/3541678/files/B2_right.fq.gz https://zenodo.org/record/3541678/files/B3_left.fq.gz https://zenodo.org/record/3541678/files/B3_right.fq.gz{% include snippets/import_via_link.md collection=true collection_type="List of Pairs" collection_name="fastq_raw" %} {% include snippets/import_from_data_library.md %}
Rename the datasets
Check that the datatype
{% include snippets/change_datatype.md datatype="datatypes" %}
Add to each database a tag corresponding to ...
{% include snippets/add_tag.md %}
{: .hands_on}
Quality control
{% icon hands_on %} Hands-on: Task description
- FastQC {% icon tool %} with the following parameters:
- "Short read data from your current history":
fastq_raw(collection)
{: .hands_on}
Read cleaning with Trimmomatic
{% icon hands_on %} Hands-on: Task description
Trimmomatic {% icon tool %} with the following parameters:
- "Single-end or paired-end reads?":
Paired-end (as collection)- "Select FASTQ dataset collection with R1/R2 pair":
fastq_raw- "Perform initial ILLUMINACLIP step?":
Yes- "Adapter sequences to use":
TruSeq3 (additional seqs) (paired-ended, for MiSeq and HiSeq)- In "Trimmomatic Operation":
- {% icon param-repeat %} "Insert Trimmomatic Operation"
- "Select Trimmomatic operation to perform":
Cut bases off end of a read, if below a threshold quality (TRAILING)- {% icon param-repeat %} "Insert Trimmomatic Operation"
- "Select Trimmomatic operation to perform":
Cut bases off start of a read, if below a threshold quality (LEADING)- {% icon param-repeat %} "Insert Trimmomatic Operation"
- "Select Trimmomatic operation to perform":
Sliding window trimming (SLIDINGWINDOW)- {% icon param-repeat %} "Insert Trimmomatic Operation"
- "Select Trimmomatic operation to perform":
Drop reads with average quality lower than a specific level (AVGQUAL)
- "Minimum length of reads to be kept":
25- {% icon param-repeat %} "Insert Trimmomatic Operation"
- "Select Trimmomatic operation to perform":
Drop reads below a specified length (MINLEN)
- "Minimum length of reads to be kept":
50- "Output trimmomatic log messages?":
YesRename the Dataset Collection
Trimmomatic on collection XX: paired->fastqc_cleaned{% icon comment %} Comment
You can check the Trimmomatic log files to get the number of read before and after the cleaning
Input Read Pairs: 10000 Both Surviving: 8804 (88.04%) Forward Only Surviving: 491 (4.91%) Reverse Only Surviving: 456 (4.56%) Dropped: 249 (2.49%){: .comment}
{% include snippets/rename_collection.md %}
{: .hands_on}
Quality control after cleaning
{% icon hands_on %} Hands-on: Task description
- FastQC {% icon tool %} with the following parameters:
- "Short read data from your current history":
fastqc_cleaned(collection)
{: .hands_on}
Assembly (120 minutes - computing)
Assembly with Trinity
{% icon hands_on %} Hands-on: Task description
Trinity {% icon tool %} with the following parameters:
- "Are you pooling sequence datasets?":
Yes
- "Paired or Single-end data?":
Paired-end collection
- "Strand specific data":
No- "Run in silico normalization of reads":
No- In "Additional Options":
- "Use the genome guided mode?":
NoRename the Trinity output
Trinity on data 52, data 51, and others: Assembled Transcripts->transcriptome_raw.fasta{% include snippets/rename_dataset.md %}
{: .hands_on}
Assembly assessment / cleaning
Checking of the assembly statistics
{% icon hands_on %} Hands-on: Task description
Trinity Statistics {% icon tool %} with the following parameters:
- "Trinity assembly":
transcriptome_raw.fasta{% icon comment %} Comment
This step, even with this toy dataset, will take around 2 hours {: .comment}
{: .hands_on}
Remapping on the raw transcriptome
{% icon hands_on %} Hands-on: Task description
Align reads and estimate abundance {% icon tool %} with the following parameters:
- "Transcripts":
transcriptome_raw.fasta- "Paired or Single-end data?":
Paired
- "Left/Forward strand reads" ->
Multiple datasets
- Click on the Folder button at the right
- Type to Search:
left- Select the 6
Trimmomatic on ..._left.fq.gz- "Right/Reverse strand reads" ->
Multiple datasets
- Click on the Folder button at the right
- Type to Search:
right- Select the 6
Trimmomatic on ..._left.fq.gz- "Strand specific data":
Yes- "Abundance estimation method":
Salmon- In "Additional Options":
- "Trinity assembly?":
YesRename the 6
* isoforms counts:(
- Check in the information panel (i icon) the lineage of your file (ex:
A1_left.fq.gz... )- Rename the datasets:
A1_raw,A2_raw,A3_raw,B1_raw,B2_raw,B3_raw.{% icon comment %} Comment
If you check at the Standard Error messages of your outputs. You can get the
Mapping rate
- Click on one dataset
- Click on the little i icon
- Click on Tool Standard Error: stderr
[2019-11-14 15:44:21.500] [jointLog] [info] Mapping rate = 44.4358%{: .comment}
{% icon comment %} Comment
At this stage, you can now delete some useless datasets
Trimmomatic on collection XX: unpairedAlign reads and estimate abundance on *: genes countsNote that the dataset are just hidden. You can delete them permanently and make some room in the history options (the little wheel icon) {: .comment}
{: .hands_on}
Merge the mapping tables and compute normalizations
{% icon hands_on %} Hands-on: Task description
- Build expression matrix {% icon tool %} with the following parameters:
- "Abundance estimates":
A1_raw,A2_raw,A3_raw,B1_raw,B2_raw,B3_raw- "Abundance estimation method":
Salmon
{: .hands_on}
{% icon question %} Questions
What are the three tables?
{% icon solution %} Solution
- `estimated RNA-Seq fragment isoform counts (raw counts)``
matrix of isoform TPM expression values (not cross-sample normalized)matrix of TMM-normalized expression values{: .solution}
{: .question}
Compute contig Ex90N50 statistic and Ex90 transcript count
{% icon hands_on %} Hands-on: Task description
- Compute contig Ex90N50 statistic and Ex90 transcript count {% icon tool %} with the following parameters:
- "Expression matrix":
Build expression matrix: matrix of TMM-normalized expression values- "Transcripts":
transcriptome_raw.fasta- Click on the visulization icon on the dataset
Compute contig Ex90N50 statistic and Ex90 transcript count: ExN50 statistics
- Scatterplot - Creates a 2D-scatterplot from tabular datapoints
- "X Column": select the Columns
1- "Y Column": select the Columns
2
{: .hands_on}
What we get
What we should get with a real dataset
Transcriptome annotation completeness
{% icon hands_on %} Hands-on: Task description
- Busco {% icon tool %} with the following parameters:
- "Sequence to analyse":
transcriptome_raw.fasta- "Mode":
transcriptome- "Lineage":
eukaryota_odb9
{: .hands_on}
Filter low expression transcripts
{% icon hands_on %} Hands-on: Task description
Filter low expression transcripts {% icon tool %} with the following parameters:
- "Trinity assembly":
transcriptome_raw.fasta- "Expression matrix":
Build expression matrix: matrix of isoform TPM expression values (not cross-sample normalized)- "Minimum expression level required across any sample":
1.0- "Isoform filtering method":
Keep all isoforms above a minimum percent of dominant expression
- "Minimum percent of dominant isoform expression":
1{% icon comment %} Comment
If you check at the Standard Error messages of your outputs. You can get the
Retainedrate
- Click on one dataset
- Click on the little i icon
- Click on Tool Standard Error: stderr
Retained 2096 / 2102 = 99.71% of total transcripts.{: .comment}
Rename the output
Filter low expression transcripts on data 42 and data 14: filtered low expression transcripts->transcriptome_filtered.fasta
{: .hands_on}
Checking of the assembly statistics after cleaning
{% icon hands_on %} Hands-on: Task description
- Trinity Statistics {% icon tool %} with the following parameters:
- "Trinity assembly":
transcriptome_filtered.fasta
{: .hands_on}
Annotation
Generate gene to transcript map
{% icon hands_on %} Hands-on: Task description
- Generate gene to transcript map {% icon tool %} with the following parameters:
- "Trinity assembly":
transcriptome_filtered.fasta
{: .hands_on}
Peptide prediction
{% icon hands_on %} Hands-on: Task description
- TransDecoder {% icon tool %} with the following parameters:
- "Transcripts":
transcriptome_filtered.fasta- In "Training Options":
- "Select the training method":
Train with the top longest ORFs
{: .hands_on}
Similarity search
{% icon hands_on %} Hands-on: Task description
Diamond {% icon tool %} with the following parameters:
- "What do you want to align?":
Align amino acid query sequences (blastp)- "Input query file in FASTA or FASTQ format":
TransDecoder on data XXX: pep- "Select a reference database":
Uniprot Swissprot- "Format of output file":
BLAST Tabular- In "Method to restrict the number of hits?":
Maximum number of target sequences
- "The maximum number of target sequence per query to report alignments for":
1Rename the Diamond output
Diamond on data XXX->Diamond (blastp)Diamond {% icon tool %} with the following parameters:
- "What do you want to align?":
Align DNA query sequences (blastx)- "Input query file in FASTA or FASTQ format":
transcriptome_filtered.fasta- "Select a reference database":
Uniprot Swissprot- "Format of output file":
BLAST Tabular- In "Method to restrict the number of hits?":
Maximum number of target sequences
- "The maximum number of target sequence per query to report alignments for":
1Rename the Diamond output
Diamond on data XXX->Diamond (blastx){% icon comment %} Comment
Note that you can both use Diamond {% icon tool %} or the NCBI BLAST+ blastp {% icon tool %} and NCBI BLAST+ blast {% icon tool %} {: .comment}
{: .hands_on}
Find signal peptides
{% icon hands_on %} Hands-on: Task description
- SignalP 3.0 {% icon tool %} with the following parameters:
- "Fasta file of protein sequences":
TransDecoder on data XXX: pep
{: .hands_on}
Find transmembrane domains
{% icon hands_on %} Hands-on: Task description
- TMHMM 2.0 {% icon tool %} with the following parameters:
- "FASTA file of protein sequences":
TransDecoder on data XXX: pep
{: .hands_on}
Search again profile database
{% icon hands_on %} Hands-on: Task description
- hmmscan {% icon tool %} with the following parameters:
- "Sequence file":
TransDecoder on data XXX: pep
{: .hands_on}
Transcriptome annotation using Trinotate
{% icon hands_on %} Hands-on: Task description
- Trinotate {% icon tool %} with the following parameters:
- "Transcripts":
transcriptome_filtered.fasta- "Peptides":
TransDecoder on data XXX: pep- "Genes to transcripts map":
Generate gene to transcript map on data XXX: Genes to transcripts map- "BLASTP: Peptides vs Uniprot.SwissProt":
Diamond (blastp)- "BLASTX: Transcripts vs Uniprot.SwissProt":
Diamond (blastx)- "HMMER hmmscan: Peptides vs PFAM":
Table of per-domain hits from HMM matches of TransDecoder on data XXX: pep against the profile database- "TMHMM on Peptides":
TMHMM results- "SignalP on Peptides":
SignalP euk results- "Let Galaxy downloading the Trinotate Pre-generated Resource SQLite database":
Yes
{: .hands_on}
Differential Expression (DE) Analysis
Remapping on the filtered transcriptome using
{% icon hands_on %} Hands-on: Task description
Align reads and estimate abundance {% icon tool %} with the following parameters:
- "Transcripts":
transcriptome_filtered.fasta- "Paired or Single-end data?":
Paired
- "Left/Forward strand reads" ->
Multiple datasets
- Click on the Folder button at the right
- Type to Search:
left- Select the 6
Trimmomatic on ..._left.fq.gz- "Right/Reverse strand reads" ->
Multiple datasets
- Click on the Folder button at the right
- Type to Search:
right- Select the 6
Trimmomatic on ..._left.fq.gz- "Strand specific data":
Yes- "Abundance estimation method":
Salmon- In "Additional Options":
- "Trinity assembly?":
YesRename the 6
* isoforms counts:(
- Check in the information panel (i icon) the lineage of your file (ex:
A1_left.fq.gz... )- Rename the datasets:
A1,A2,A3,B1,B2,B3.{% icon comment %} Comment
If you check at the Standard Error messages of your outputs. You can get the
Mapping rate
- Click on one dataset
- Click on the little i icon
- Click on Tool Standard Error: stderr
[2019-11-14 15:44:21.500] [jointLog] [info] Mapping rate = 44.4358%{: .comment}
{% icon comment %} Comment
At this stage, you can now delete some useless datasets
Align reads and estimate abundance on *: genes countsNote that the dataset are just hidden. You can delete them permanently and make some room in the history options (the little wheel icon) {: .comment}
{: .hands_on}
Merge the mapping tables and compute a TMM normalization
{% icon hands_on %} Hands-on: Task description
- Build expression matrix {% icon tool %} with the following parameters:
- "Abundance estimates":
A1,A2,A3,B1,B2,B3- "Abundance estimation method":
Salmon- Describe samples and replicates {% icon tool %} with the following parameters:
- "Samples"
- "1: Samples":
- "Full sample name":
A1- "Condition":
A- "2: Samples":
- "Full sample name":
A2- "Condition":
A- ...:
- "6: Samples":
- "Full sample name":
B3- "Condition":
B
{: .hands_on}
RNASeq samples quality check
{% icon hands_on %} Hands-on: Task description
- RNASeq samples quality check {% icon tool %} with the following parameters:
- "Expression matrix":
Build expression matrix: estimated RNA-Seq fragment isoform counts (raw counts)- "Samples description":
Describe samples
{: .hands_on}
Differential expression analysis
{% icon hands_on %} Hands-on: Task description
- Differential expression analysis {% icon tool %} with the following parameters:
- "Expression matrix":
Build expression matrix: estimated RNA-Seq fragment isoform counts (raw counts)- "Sample description":
Describe samples(the last one)- "Differential analysis method":
DESeq2
{: .hands_on}
Extract and cluster differentially expressed transcripts
{% icon hands_on %} Hands-on: Task description
Extract and cluster differentially expressed transcripts {% icon tool %} with the following parameters:
- In "Additional Options":
- "Expression matrix":
Build expression matrix: estimated RNA-Seq fragment isoform counts (raw counts)- "Sample description":
Describe samples- "Differential expression results":
Differential expression results on data XXX and data XXX- "p-value cutoff for FDR":
1- "Run GO enrichment analysis":
No{% icon comment %} Comment
"p-value cutoff for FDR":
1Don't do this at home! It's because we have a Toy Dataset. The cutoff should be around0.001{: .comment}
{: .hands_on}
Partition genes into expression clusters
{% icon hands_on %} Hands-on: Task description
- Partition genes into expression clusters {% icon tool %} with the following parameters:
- "RData file":
Extract and cluster differentially expressed transcripts: RData file- "Method for partitioning genes into clusters":
Cut tree based on x percent of max(height) of tree
{: .hands_on}
Conclusion
{:.no_toc}
Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the pipeline used.


