/
slides.html
613 lines (370 loc) · 18.2 KB
/
slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
---
layout: tutorial_slides
logo: "GTN"
title: "Introduction to metatranscriptomics"
questions:
- "How to analyze metatranscriptomics data?"
- "What information can be extracted of metatranscriptomics data?"
- "How to assign taxa and function to the identified sequences?"
contributors:
- subinamehta
- pratikdjagtap
- shiltemann
redirect_from:
- /topics/metagenomics/tutorials/metatranscriptomics/slides
---
## Why study the microbiome?
.pull-left[
- Health care research
- Humans are full of microorganisms
- Skin, gut, oral cavity, nasal cavity, eyes, ..
- Affects health, drug efficacy, etc
]
.pull-right[
.image-100[ ![Image of a human with various pie charts pointing to various regions of the body where microbe populations live](../../images/human_microbiome.png) ]
]
- Sometimes referred to as your **second genome**
- ~10 times more cells than you
- ~100 times more genes than you
- ~1000s different species
---
## Why study the microbiome?
- Environmental studies
- Microbes in the soil affect plants and animals
- Improve agriculture
.image-75[ ![Rhizodeposition: image of a tree converting sun and co2 into fixed carbon used as food for soil microbes.](../../images/environmentalstudies.png) ]
---
## Meta- Omics
![meta-momics diagram](../../images/metatranscriptomics/meta-omics.png)
---
## This Tutorial: ASaiM pipeline
.pull-left[
- Quality Control
- Assess Quality
- Trim and Filter raw reads
- Filter ribosomal RNA (rRNA)
- Community profiling (Who?)
- Determine composition of sample
- Visualisation
- Functional Analysis (What?)
]
.pull-right[
.image-90[![ASiaiM diagram](../../images/asaim-wf.png)]
]
.footnote[Batut et al Gigascience. 2018 7(6) doi: 10.1093/gigascience/giy057]
???
For this short tutorial, while the workflow is running, these slides can be useful
to explain the tools that are being run in that section. After explaining the tools,
the workflows should be far enough along to start showing the results
---
## Input: Cellulose Degradation in a Biogas Reactor
![Workflow graph showing biogas reactor extract being transferred to cellulose and incubated. Time series samples are taken and run through a mass spectrometer and genomic sequencer.](../../images/metatranscriptomics/experimental_setup.png)
???
A 100 µl aliquot of an enriched community from a biogas reactor was transferred to 27 anaerobic bottles containing a rich medium and 10g/L of cellulose as sole carbon source and incubated at 65 °C.
Three bottles were collected at 9 different time points (0, 8, 13, 18, 23, 28, 33, 38 and 43 h) and processed in triplicates. Metatranscriptomic analysis was performed on all time points. Metaproteomics analysis on 4 data points.
---
## Input Format: FastQ Files
- Four lines per read
![Image of a fastq file with label on the first line, sequence on the second, + on the third, and quality scores on the fourth as ascii chars. A callout shows that Base=T, quality=colon, and that means a score of 25.](../../../sequence-analysis/images/quality-control/fastq_fig.jpg)
???
- Four lines per read
- `@` + identifier on first line, just like fasta
- sequence
- `+`
- quality score characters
Segue: so what do the quality chars mean?
---
## FastQ: Quality score
- Each character denotes a different Phred score
![Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina](../../../sequence-analysis/images/quality_score_encoding.png)
- Phred scores are logarithmic
.small[
Phred Quality Score | Probability of incorrect base call | Base call accuracy
--- | --- | ---
10 | 1 in 10 | 90%
20 | 1 in 100 | 99%
30 | 1 in 1000 | 99.9%
40 | 1 in 10,000 | 99.99%
...
]
???
- Logarithmic scale
- Different flavours of encoding exist
---
## Preprocessing
In this tutorial we start with some preprocessing steps
![preprocessing workflow](../../images/metatranscriptomics/workflow_qc.png)
---
## Preprocessing: Tools
In this tutorial we start with some preprocessing steps
![preprocessing workflow](../../images/metatranscriptomics/workflow_qc_tools.png)
???
| Step | Tools |
|:------|-------|
|Quality Control reports | FastQC {% icon tool %} and MultiQC {% icon tool %} |
|Trimming and Filtering | Cutadapt {% icon tool %} |
|Filter ribosomal RNA | SortMeRNA {% icon tool %} |
|Interlace FastQ files | FastQ interlacer {% icon tool %} |
---
## Quality Reports: FastQC
- Generate a web report with quality metrics of your FastQ file
![Screenshot of FastQC report, showing the table of contents with green checks on nearly every result, and the base statistics and per-base sequence quality graphs shown.](../../../sequence-analysis/images/fastqc-report.png)
.footnote[see also our [dedicated QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}) ]
---
## Quality Reports: FastQC
- Many different QC plots
- Example: Per-base sequence Quality plot
![Fastqc quality score plot, most results are in the green region but the box portion of the box and whisker plot start to dip into the yellow, medium quality (less than 30) region near 34+ base position in read. The whiskers begin extending to the red region (less than 20) by base 31 and get progressively worse.](../../../sequence-analysis/images/per_base_sequence_quality_good.png)
.footnote[explanation of different plots: [dedicated QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}) ]
???
- in the per-base sequence quality plot, a boxplot of the base quality (y-axis) per position in the read (x-axis) is drawn
- often you might observe a drop in quality towards the end of the reads, and may consider trimming ends
- this example is very good
---
## Quality Reports: FastQC
- Many more plots
- See [QC tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}) for more information
![Montage of several different fastq reports showing sequence quality graphs, and a numb er of other line graphs.](../../../sequence-analysis/images/quality-control/all_plots.png)
---
## Quality Reports: MultiQC
.pull-left[
- Combine multiple FastQC reports into one report
- Also for outputs of other tools
- Great when sequencing large numbers of samples
]
.pull-right[
![Multiqc's report showing an aggregation of multiple samples. An overview at the top provides context for the 4 samples, and a sequence quality histogram shows 4 samples with similarly behaving quality scores](../../images/metatranscriptomics/multiqc.png)
]
---
## Read Trimming and Filtering: Cutadapt
- Trim low-quality bases from reads
- Filter reads based on length, mean quality score, ..
- Remove adapters/primers
.image-75[![adapter trimming](../../../sequence-analysis/images/quality-control/trimming.png)]
- Many tools: CutAdapt {% icon tool %}, TrimGalore {% icon tool %}, Trimmomatic {% icon tool%} ..
???
These are some examples of ways to trim and filter data, but many more are possible and depend on your experiment what is necessary
---
## SortMeRNA
.pull-left[
- Most RNA sequences will be ribosomal RNA (rRNA)
- Great for taxonomical assignment (who is there?)
- Not informative for functional analysis (What are they doing?)
- Filter out rRNA before doing functional analysis
]
.pull-right[
.image-90[![SortMeRNA](../../images/metatranscriptomics/sortmerna.png)]
]
---
## FastQ interlacer
- Paired-end data often comes in two separate FastQ files
- One file with *forward* reads, one with *reverse* reads
<br>
![paired end deinterlaced file](../../../sequence-analysis/images/mapping/pairedend_fastq.png)
---
## FastQ interlacer
- Some tools require a single **interlaced** FastQ file
- Galaxy has tools for **interlacing** and **deinterlacing** FastQ files
<br>
![paired end interlaced file](../../../sequence-analysis/images/mapping/pairedend_interlaced.png)
???
forward and reverse files are 'zipped' together into a single file
---
## Community Profile
- We want to identify which organisms are present in our sample, and their relative abundances
![Cartoon of several differently coloured and shaped microbes in a circle.](../../images/microbial-community.png)
- **MetaPhlan2** {% icon tool %} for identification
- **Krona** {% icon tool %} and **Graphlan** {% icon tool %} for visualisation
---
## MetaPhlan2 {% icon tool %}
- Estimates the presence and relative abundance of microbial cells
- Maps reads against a set of marker sequences
- Caveat: this tool is designed for DNA-seq
- Be careful interpreting abundances when using this tool with transcriptomics data
.footnote[Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.]
???
About the caveat: The theoretical problem is that we quantify species abundance by averaging the coverage of marker genes. Marker genes are supposed to be at the same coverage as they are single copy genes from the same genome, but this is not true for their transcripts. So MetaPhlAn2 on metatranscriptomics gives an idea about the average transcriptional rate of a given species. So it _can_ be used with caution...
---
## Krona {% icon tool %}
- Visualization of community composition, interactive plot
<iframe id="krona" src="krona.html" frameBorder="0" width="80%" height="600px"> ![Krona at bacteria level](../../images/metatranscriptomics/krona_bacteria.png) </iframe>
---
## Graphlan {% icon tool %}
- *Cladogram* visualisation
![Colourful cladogram which begins from the center and expands outward with the lineage of the samples. Each sector of the chart is coloured differently for each group of genus and spieces. E.g. streptococcus streptococcaceae has three different leafs of the cladogram tree.](../../images/metatranscriptomics/graphlan_general.png)
---
## Genus Abundance
- Tutorial: one timepoint
- Over multiple timepoints:
![stacked bar chart with timepoints along the x axis and genus abundance as a percentage along the y axis. Each of the 7 time samples consists mostly of Coprothermobacter and Clostridium.](../../images/metatranscriptomics/genus_abundance.png)
---
## Functional Analysis
- Pathways
- Gene Ontology
- Biological process
- Molecular function
- Cellular component
- Gene Family
---
## Workflow
![functional analysis workflow schematic](../../images/metatranscriptomics/workflow_functional.png)
???
HUMAnN2
- next generation
- HMP Unified Metabolic Analysis Network
- developed by Huttenhower lab
- itself a workflow/pipeline
- basically answering the question about what the microbial community is capable of?
---
## HUMAnN2 {% icon tool %}
- Profiles presence/absence and abundance of microbial community
- Efficiently characterizes microbial metabolic pathways
- **Input**
- Interlaced non-rRNA reads
- Taxonomic profile (MetaPhlAn2 {% icon tool %})
- **Output**
- Gene families and their abundance
- Pathways and their coverage
- Pathways and their abundance
???
- contains 5 parts -> non rRNA reads, MetaPhlAn taxonomy, NCBI nucleotide db, Uniref 50/90 protein db, MetaCyc/Unipathway.
- Show the Galaxy wrapper
---
## HUMAnN2 Tiered Search
---
class: top
.left-column70[
<br>
- Meta-omic sequences (DNA/RNA)
]
.right-column30[
.image-60[![Cartoon of several reads coloured into four groups, Species 1, Species 2, Unclassified, Novel.](../../images/metatranscriptomics/humann2_tiered1.png)]
]
--
.left-column70[
<br>
- Initial screen through **MetaPhlAn2** {% icon tool %}: known microbial species
- Database: merging pangenomes of identified species
]
.right-column30[
.image-60[![Four bins labelled 1 (red), 2 (blue), 3, 4 with reads from the top cartoon show piles of 1 an 2 with NO signs over 3 and 4.](../../images/metatranscriptomics/humann2_tiered2.png)]
]
--
.left-column70[
<br>
- Nucleotide-level mapping against database
]
.right-column30[
.image-60[![Regions x 1 y in red and x 2 y in blue are shown, the pangenomes of each of the red and blue species are shown. Reads map to most segments of the pangenome.](../../images/metatranscriptomics/humann2_tiered3.png)]
]
--
.left-column70[
<br>
- Unaligned reads searched against proteinDB (Uniref) through accelerated translated search
]
.right-column30[
.image-75[![Reads are shown matching against portions of protein sequences of X, Y, Z](../../images/metatranscriptomics/humann2_tiered4.png)]
]
???
Takes non rRNA reads + MetaPhlAn2 gives list of abundant organism, then it does Nucleotide level pangenome mapping with Bowtie and uses CHocophlAN db
giving unmapped and organims specific gene hits, the unmapped reads are further searched against accelerated translated protein database
the protein hits are tehn combined with gene hits and metacyc to give the output.
---
**Result:** Gene family and pathway abundances
.image-40[![A table with two columns, Feature on left and RPK on right. GeneX has an RPK of 8. GeneX|Species1 has an RPK of 2, species2 and unclassified are listed with an RPK of 3.](../../images/metatranscriptomics/humann2_tiered5.png)]
---
## Gene Families Abundances
![Screenshot of a table in Galaxy with Gene Family on left and humann2 abundance RPKs on right.](../../images/metatranscriptomics/gene_family_abundance.png)
RPK (reads per kilobase) = sum of alignment scores
???
Gene families: groups of evolutionary related protein that perform similar function
Pathway: sum over genes catalyzing the reaction
Pathway coverage: presence/absence
RPK relative gene copy number : is computed as the sum of all alignments scores over a particular gene family
UNMAPPED: total number of reads that remained unmapped even after both alignment steps
UNINTEGRATED: no pathway detected.
---
## Gene Families to Functional Annotation
![Humann2 regroup table is the left node in a flow chart with UniRef50. Multiple lines are drawn to an unlabelled right node that lists metacyc, kegg, pfam, EC, GO, informative GO, slim GO.](../../images/metatranscriptomics/results_functional_annotation.png)
???
Gene familes are too large depending on the complexity thus to simplify users can regroup gene families using grouping tool, can download mapping files.
HUMAnN2 regroups Uniref 50/90 values to Go terms to get a broad overview.
---
## Group Abundances
![humann2 regroup table, lines from uniref50 to GO. group humann2 to GO slim terms shows a similar graphic, lines from uniref50 to slim GO.](../../images/metatranscriptomics/results_group_abundances.png)
???
Group abundances converts GO terms to Go slim (subset of GO terms) into Mol function, biological process and cellular components.
---
## Gene Families to Functional Annotation
![Table from Galaxy shown with gene family and RPK](../../images/metatranscriptomics/results_gene_family.png)
![group Human2 to GO slim terms with lines from uniref50 to slim GO and boxes of Molecular Function, biological process, and cellular component below Slim GO](../../images/metatranscriptomics/results_gene_family_function.png)
![Another galaxy table screenshot with GO id, GO name, and abundance.](../../images/metatranscriptomics/results_go.png)
---
class: top
## Output
.left-column30[
<br>
Molecular Function
]
.right-column70[
.image-90[![Table in Galaxy with GO ID, name, abundance.](../../images/metatranscriptomics/results_molecular_function.png)]
]
--
.left-column70[
.image-90[![Basically the same table as above.](../../images/metatranscriptomics/results_biological_process.png)]
]
.right-column30[
<br>
Biological Process
]
--
.right-column70[
.image-90[![Again the same columns in a table. None of the specific data is legible or important.](../../images/metatranscriptomics/results_cellular_component.png)]
]
.left-column30[
<br><br>
Cellular Component
]
???
g is genus
s is species level
---
## Unpack pathway abundances to show genes included
- Renormalize the gene and pathway abundances in copies per million or relative abundance
- This tool unpacks the pathways abundance by including gene families
![output file unpack pathway tool](../../images/metatranscriptomics/results_unpack.png)
---
## Function: Cellulose Degradation
- Quantitative analysis of gene family outputs from HUMAnN2 shows upregulation of cellulase
.image-75[![line chart shown cellulase abundance decreasing from 80 copies per million to 40 as time goes from 13 to 43. Cellulose 1,4 beta cellobiosidase starts at 140 cpm and dps at hour 23 to 120 before increasing to 200 by the end of the graph](../../images/metatranscriptomics/results_cellulose_degradation.png)]
???
explain about datasets first
cellulose 1,4 beta-cellulobiosidase responsible for hydrolysis of cellulose
Gene encoding for the cellulose-binding domain
protein shows an initial decrease and subsequent
increase during cellulose degradation.
---
## Functions associated with a selected taxon
.image-75[![Stacked bar chart with a lot of organisms as left axis (abundance, copies per million) and time on the bottom. It is labelled Coprothermobacter: Functional Pathways](../../images/metatranscriptomics/results_functions_taxon.png)]
???
In gene abundance, Coprothermobacter and Clostridium were observed to be the
most abundant.
In this figure we are looking at Coprothermobacter only->Glycolysis is observed to be the most abundant functional pathway across time points in
Coprothermobacter
---
## Taxa associated with a selected function
.image-75[![Bar chart titled Adenosine ribonucleotides de novo biosynthesis with time in hours as x axis, and Genus abundance (copies per million). Coprothermobacter and Clostridium decrease from ~2000 combined copies per million to ~800, in approximately equal amounts.](../../images/metatranscriptomics/results_taxa_function.png)]
???
This figure shows the contribution of genera to adenosine ribonucleotides denovo biosynthesis across time points.
it shows during ATP synthesis, we see clostridium and coprothermobacter in abundance.
---
# Tabular Outputs from ASaIM Workflow
- Taxonomy (Who?)
- Kingdom, phylum, class, order, family, genus, species, strain
- Function (What?)
- Pathways
- Gene Ontology
- Biological Process
- Molecular Function
- Cellular Component
- Gene Family