topics/microbiome/tutorials/metatranscriptomics/slides.html

---
layout: tutorial_slides
logo: "GTN"

title: "Introduction to metatranscriptomics"
questions:
- "How to analyze metatranscriptomics data?"
- "What information can be extracted of metatranscriptomics data?"
- "How to assign taxa and function to the identified sequences?"
contributors:
  - subinamehta
  - pratikdjagtap
  - shiltemann

redirect_from:
- /topics/metagenomics/tutorials/metatranscriptomics/slides
---

## Why study the microbiome?

.pull-left[

- Health care research
 - Humans are full of microorganisms
 - Skin, gut, oral cavity, nasal cavity, eyes, ..
 - Affects health, drug efficacy, etc

]
.pull-right[

.image-100[ ![Image of a human with various pie charts pointing to various regions of the body where microbe populations live](../../images/human_microbiome.png) ]

]

- Sometimes referred to as your **second genome**
 - ~10 times more cells than you
 - ~100 times more genes than you
 - ~1000s different species


---

## Why study the microbiome?

- Environmental studies
  - Microbes in the soil affect plants and animals
  - Improve agriculture

.image-75[ ![Rhizodeposition: image of a tree converting sun and co2 into fixed carbon used as food for soil microbes.](../../images/environmentalstudies.png) ]


---

## Meta- Omics

![meta-momics diagram](../../images/metatranscriptomics/meta-omics.png)


---

## This Tutorial: ASaiM pipeline

.pull-left[

- Quality Control
  - Assess Quality
  - Trim and Filter raw reads
  - Filter ribosomal RNA (rRNA)

- Community profiling (Who?)
  - Determine composition of sample
  - Visualisation

- Functional Analysis (What?)

]

.pull-right[
.image-90[![ASiaiM diagram](../../images/asaim-wf.png)]
]


.footnote[Batut et al Gigascience. 2018 7(6) doi: 10.1093/gigascience/giy057]


???

For this short tutorial, while the workflow is running, these slides can be useful
to explain the tools that are being run in that section. After explaining the tools,
the workflows should be far enough along to start showing the results


---

## Input: Cellulose Degradation in a Biogas Reactor

![Workflow graph showing biogas reactor extract being transferred to cellulose and incubated. Time series samples are taken and run through a mass spectrometer and genomic sequencer.](../../images/metatranscriptomics/experimental_setup.png)


???

A 100 µl aliquot of an enriched community from a biogas reactor was transferred to 27 anaerobic bottles containing a rich medium and 10g/L of cellulose as sole carbon source and incubated at 65 °C.

Three bottles were collected at 9 different time points (0, 8, 13, 18, 23, 28, 33, 38 and 43 h) and processed in triplicates. Metatranscriptomic analysis was performed on all time points. Metaproteomics analysis on 4 data points.

---


## Input Format: FastQ Files


- Four lines per read

![Image of a fastq file with label on the first line, sequence on the second, + on the third, and quality scores on the fourth as ascii chars. A callout shows that Base=T, quality=colon, and that means a score of 25.](../../../sequence-analysis/images/quality-control/fastq_fig.jpg)

???

- Four lines per read
  - `@` + identifier on first line, just like fasta
  - sequence
  - `+`
  - quality score characters

Segue: so what do the quality chars mean?

---

## FastQ: Quality score

- Each character denotes a different Phred score

  ![Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina](../../../sequence-analysis/images/quality_score_encoding.png)

- Phred scores are logarithmic

.small[
Phred Quality Score | Probability of incorrect base call | Base call accuracy
--- | --- | ---
10 | 1 in 10 | 90%
20 | 1 in 100 | 99%
30 | 1 in 1000 | 99.9%
40 | 1 in 10,000 | 99.99%
...
]

???

- Logarithmic scale
- Different flavours of encoding exist

---

## Preprocessing

In this tutorial we start with some preprocessing steps

![preprocessing workflow](../../images/metatranscriptomics/workflow_qc.png)

---

## Preprocessing: Tools

In this tutorial we start with some preprocessing steps


![preprocessing workflow](../../images/metatranscriptomics/workflow_qc_tools.png)


???
| Step | Tools |
|:------|-------|
|Quality Control reports &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | FastQC {% icon tool %} and MultiQC {% icon tool %} |
|Trimming and Filtering | Cutadapt {% icon tool %} |
|Filter ribosomal RNA | SortMeRNA {% icon tool %} |
|Interlace FastQ files | FastQ interlacer {% icon tool %} |

---

## Quality Reports: FastQC

- Generate a web report with quality metrics of your FastQ file

![Screenshot of FastQC report, showing the table of contents with green checks on nearly every result, and the base statistics and per-base sequence quality graphs shown.](../../../sequence-analysis/images/fastqc-report.png)

.footnote[see also our [dedicated QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %})  ]

---

## Quality Reports: FastQC

- Many different QC plots
- Example: Per-base sequence Quality plot

![Fastqc quality score plot, most results are in the green region but the box portion of the box and whisker plot start to dip into the yellow, medium quality (less than 30) region near 34+ base position in read. The whiskers begin extending to the red region (less than 20) by base 31 and get progressively worse.](../../../sequence-analysis/images/per_base_sequence_quality_good.png)


.footnote[explanation of different plots: [dedicated QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %})  ]

???
- in the per-base sequence quality plot, a boxplot of the base quality (y-axis) per position in the read (x-axis) is drawn
- often you might observe a drop in quality towards the end of the reads, and may consider trimming ends
- this example is very good

---

## Quality Reports: FastQC

- Many more plots
- See [QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}) for more information

![Montage of several different fastq reports showing sequence quality graphs, and a numb er of other line graphs.](../../../sequence-analysis/images/quality-control/all_plots.png)


---

## Quality Reports: MultiQC

.pull-left[
- Combine multiple FastQC reports into one report

- Also for outputs of other tools

- Great when sequencing large numbers of samples
]

.pull-right[
![Multiqc's report showing an aggregation of multiple samples. An overview at the top provides context for the 4 samples, and a sequence quality histogram shows 4 samples with similarly behaving quality scores](../../images/metatranscriptomics/multiqc.png)
]

---

## Read Trimming and Filtering: Cutadapt

- Trim low-quality bases from reads
- Filter reads based on length, mean quality score, ..
- Remove adapters/primers

  .image-75[![adapter trimming](../../../sequence-analysis/images/quality-control/trimming.png)]

- Many tools: CutAdapt {% icon tool %}, TrimGalore {% icon tool %}, Trimmomatic {% icon tool%} ..

???
These are some examples of ways to trim and filter data, but many more are possible and depend on your experiment what is necessary

---

## SortMeRNA

.pull-left[

- Most RNA sequences will be ribosomal RNA (rRNA)

- Great for taxonomical assignment (who is there?)

- Not informative for functional analysis (What are they doing?)

- Filter out rRNA before doing functional analysis
]

.pull-right[
.image-90[![SortMeRNA](../../images/metatranscriptomics/sortmerna.png)]
]

---

## FastQ interlacer

- Paired-end data often comes in two separate FastQ files
- One file with *forward* reads, one with *reverse* reads

<br>

![paired end deinterlaced file](../../../sequence-analysis/images/mapping/pairedend_fastq.png)

---

## FastQ interlacer

- Some tools require a single **interlaced** FastQ file
- Galaxy has tools for **interlacing** and **deinterlacing** FastQ files

<br>

![paired end interlaced file](../../../sequence-analysis/images/mapping/pairedend_interlaced.png)


???
forward and reverse files are 'zipped' together into a single file

---

## Community Profile

- We want to identify which organisms are present in our sample, and their relative abundances

![Cartoon of several differently coloured and shaped microbes in a circle.](../../images/microbial-community.png)

- **MetaPhlan2** {% icon tool %} for identification
- **Krona** {% icon tool %} and **Graphlan** {% icon tool %} for visualisation

---

## MetaPhlan2 {% icon tool %}

- Estimates the presence and relative abundance of microbial cells
- Maps reads against a set of marker sequences
- Caveat: this tool is designed for DNA-seq
  - Be careful interpreting abundances when using this tool with transcriptomics data

.footnote[Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.]

???

About the caveat: The theoretical problem is that we quantify species abundance by averaging the coverage of marker genes. Marker genes are supposed to be at the same coverage as they are single copy genes from the same genome, but this is not true for their transcripts. So MetaPhlAn2 on metatranscriptomics gives an idea about the average transcriptional rate of a given species. So it _can_ be used with caution...


---

## Krona {% icon tool %}

- Visualization of community composition, interactive plot

<iframe id="krona" src="krona.html" frameBorder="0" width="80%" height="600px"> ![Krona at bacteria level](../../images/metatranscriptomics/krona_bacteria.png) </iframe>

---

## Graphlan {% icon tool %}

- *Cladogram* visualisation

![Colourful cladogram which begins from the center and expands outward with the lineage of the samples. Each sector of the chart is coloured differently for each group of genus and spieces. E.g. streptococcus streptococcaceae has three different leafs of the cladogram tree.](../../images/metatranscriptomics/graphlan_general.png)

---

## Genus Abundance

- Tutorial: one timepoint
- Over multiple timepoints:

![stacked bar chart with timepoints along the x axis and genus abundance as a percentage along the y axis. Each of the 7 time samples consists mostly of Coprothermobacter and Clostridium.](../../images/metatranscriptomics/genus_abundance.png)

---

## Functional Analysis

- Pathways
- Gene Ontology
  - Biological process
  - Molecular function
  - Cellular component
- Gene Family

---

## Workflow

![functional analysis workflow schematic](../../images/metatranscriptomics/workflow_functional.png)

???
HUMAnN2
- next generation
- HMP Unified Metabolic Analysis Network
- developed by Huttenhower lab
- itself a workflow/pipeline
- basically answering the question about what the microbial community is capable of?

---

## HUMAnN2 {% icon tool %}

- Profiles presence/absence and abundance of microbial community
- Efficiently characterizes microbial metabolic pathways


- **Input**
  - Interlaced non-rRNA reads
  - Taxonomic profile (MetaPhlAn2 {% icon tool %})

- **Output**
  - Gene families and their abundance
  - Pathways and their coverage
  - Pathways and their abundance

???
- contains 5 parts -> non rRNA reads, MetaPhlAn taxonomy, NCBI nucleotide db, Uniref 50/90 protein db, MetaCyc/Unipathway.
- Show the Galaxy wrapper

---

## HUMAnN2 Tiered Search


---
class: top

.left-column70[

<br>

- Meta-omic sequences (DNA/RNA)

]

.right-column30[
.image-60[![Cartoon of several reads coloured into four groups, Species 1, Species 2, Unclassified, Novel.](../../images/metatranscriptomics/humann2_tiered1.png)]
]

--


.left-column70[

<br>

- Initial screen through **MetaPhlAn2** {% icon tool %}: known microbial species
  - Database: merging pangenomes of identified species

]

.right-column30[
.image-60[![Four bins labelled 1 (red), 2 (blue), 3, 4 with reads from the top cartoon show piles of 1 an 2 with NO signs over 3 and 4.](../../images/metatranscriptomics/humann2_tiered2.png)]
]

--

.left-column70[

<br>
- Nucleotide-level mapping against database

]

.right-column30[
.image-60[![Regions x 1 y in red and x 2 y in blue are shown, the pangenomes of each of the red and blue species are shown. Reads map to most segments of the pangenome.](../../images/metatranscriptomics/humann2_tiered3.png)]
]

--

.left-column70[
<br>
- Unaligned reads searched against proteinDB (Uniref) through accelerated translated search

]

.right-column30[
.image-75[![Reads are shown matching against portions of protein sequences of X, Y, Z](../../images/metatranscriptomics/humann2_tiered4.png)]
]

???
Takes non rRNA reads + MetaPhlAn2 gives list of abundant organism, then it does Nucleotide level pangenome mapping with Bowtie and uses CHocophlAN db
giving unmapped and organims specific gene hits, the unmapped reads are further searched against accelerated translated protein database
the protein hits are tehn combined with gene hits and metacyc to give the output.

---


**Result:** Gene family and pathway abundances

.image-40[![A table with two columns, Feature on left and RPK on right. GeneX has an RPK of 8. GeneX|Species1 has an RPK of 2, species2 and unclassified are listed with an RPK of 3.](../../images/metatranscriptomics/humann2_tiered5.png)]

---

## Gene Families Abundances

![Screenshot of a table in Galaxy with Gene Family on left and humann2 abundance RPKs on right.](../../images/metatranscriptomics/gene_family_abundance.png)

RPK (reads per kilobase) = sum of alignment scores

???
Gene families: groups of evolutionary related protein that perform similar function
Pathway: sum over genes catalyzing the reaction
Pathway coverage: presence/absence
RPK relative gene copy number : is computed as the sum of all alignments scores over a particular gene family
UNMAPPED: total number of reads that remained unmapped even after both alignment steps
UNINTEGRATED: no pathway detected.

---

## Gene Families to Functional Annotation

![Humann2 regroup table is the left node in a flow chart with UniRef50. Multiple lines are drawn to an unlabelled right node that lists metacyc, kegg, pfam, EC, GO, informative GO, slim GO.](../../images/metatranscriptomics/results_functional_annotation.png)

???
Gene familes are too large depending on the complexity thus to simplify users can regroup gene families using grouping tool, can download mapping files.
HUMAnN2 regroups Uniref 50/90 values to Go terms to get a broad overview.

---

## Group Abundances

![humann2 regroup table, lines from uniref50 to GO. group humann2 to GO slim terms shows a similar graphic, lines from uniref50 to slim GO.](../../images/metatranscriptomics/results_group_abundances.png)


???
Group abundances converts GO terms to Go slim (subset of GO terms) into Mol function, biological process and cellular components.

---

## Gene Families to Functional Annotation


![Table from Galaxy shown with gene family and RPK](../../images/metatranscriptomics/results_gene_family.png)

![group Human2 to GO slim terms with lines from uniref50 to slim GO and boxes of Molecular Function, biological process, and cellular component below Slim GO](../../images/metatranscriptomics/results_gene_family_function.png)

![Another galaxy table screenshot with GO id, GO name, and abundance.](../../images/metatranscriptomics/results_go.png)

---
class: top

## Output

.left-column30[
<br>
Molecular Function

]

.right-column70[
.image-90[![Table in Galaxy with GO ID, name, abundance.](../../images/metatranscriptomics/results_molecular_function.png)]
]
--

.left-column70[
.image-90[![Basically the same table as above.](../../images/metatranscriptomics/results_biological_process.png)]
]

.right-column30[
<br>
Biological Process

]


--


.right-column70[
.image-90[![Again the same columns in a table. None of the specific data is legible or important.](../../images/metatranscriptomics/results_cellular_component.png)]
]

.left-column30[

<br><br>
Cellular Component

]

???
g is genus
s is species level
---

## Unpack pathway abundances to show genes included

- Renormalize the gene and pathway abundances in copies per million or relative abundance
- This tool unpacks the pathways abundance by including gene families

![output file unpack pathway tool](../../images/metatranscriptomics/results_unpack.png)

---

## Function: Cellulose Degradation

- Quantitative analysis of gene family outputs from HUMAnN2 shows upregulation of cellulase

.image-75[![line chart shown cellulase abundance decreasing from 80 copies per million to 40 as time goes from 13 to 43. Cellulose 1,4 beta cellobiosidase starts at 140 cpm and dps at hour 23 to 120 before increasing to 200 by the end of the graph](../../images/metatranscriptomics/results_cellulose_degradation.png)]

???
explain about datasets first
cellulose 1,4 beta-cellulobiosidase responsible for hydrolysis of cellulose
Gene encoding for the cellulose-binding domain
protein shows an initial decrease and subsequent
increase during cellulose degradation.
---

## Functions associated with a selected taxon

.image-75[![Stacked bar chart with a lot of organisms as left axis (abundance, copies per million) and time on the bottom. It is labelled Coprothermobacter: Functional Pathways](../../images/metatranscriptomics/results_functions_taxon.png)]

???
In gene abundance, Coprothermobacter and Clostridium were observed to be the
most abundant.
In this figure we are looking at Coprothermobacter only->Glycolysis is observed to be the most abundant functional pathway across time points in
Coprothermobacter

---

## Taxa associated with a selected function

.image-75[![Bar chart titled Adenosine ribonucleotides de novo biosynthesis with time in hours as x axis, and Genus abundance (copies per million). Coprothermobacter and Clostridium decrease from ~2000 combined copies per million to ~800, in approximately equal amounts.](../../images/metatranscriptomics/results_taxa_function.png)]

???
This figure shows the contribution of genera to adenosine ribonucleotides denovo biosynthesis across time points.
it shows during ATP synthesis, we see clostridium and coprothermobacter in abundance.
---

# Tabular Outputs from ASaIM Workflow

- Taxonomy (Who?)
  - Kingdom, phylum, class, order, family, genus, species, strain

- Function (What?)
  - Pathways
  - Gene Ontology
    - Biological Process
    - Molecular Function
    - Cellular Component
  - Gene Family