-
Notifications
You must be signed in to change notification settings - Fork 2
/
metagenomics.Rmd
1057 lines (709 loc) · 45.1 KB
/
metagenomics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Visualization of Metagenomic and Metatranscriptomic Data in R"
author: |
| David Levy-Booth
| Edited by Ameena Hashimi, Stephan Koenig and Florent Mazel
| The University of British Columbia
date: "April 7 - 9, 2020"
output:
html_document:
toc: yes
toc_float:
collapsed: no
pdf_document:
fig_height: 6
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
Metagenomics and metatranscriptomics offer unparalleled insight into the taxonomic and functional diversity present across a multitude of environments. Yet the vast amounts of data these tools produce is challenging to effectively analyze and visualize.
We will analyze a unique thermal swamp metagenome and metatranscriptome dataset to develop strategies for effective visual communication.
![Thermal Swamp Metagenomics](images/metagenomics.header.png)
This workshop is designed around the application of whole-genome phylogenetic analysis to organize the diverse data types that we can generate from metagenomics and metatranscriptomics analysis.
You will learn:
1. Manipulating and visualizing whole-genome phylogenies with <i>ggtree</i>
2. Mapping genome assembly statistics to phylogenetic trees with <i>ggplot2</i>
3. Processing count data from <i>Meiji</i> and mapping to phylogenetic trees
4. Genomic functional annotation and pathway analysis with <i>pathview</i>
5. Mapping metatranscriptomic count data and coverage to circular genome maps with <i>circlize</i>
This workshop assumes some prior experience with R (such as that in our [Introduction to R workshop](https://github.com/EDUCE-UBC/workshops_R/tree/master/intro_R_2hr)). All code presented in this workshop is contained in the "metagenomics.R" R script file.
You can follow along with the workshop by selecting the relevant line(s) of code and pressing ctrl+enter on Windows, cmd+enter on MacOS, or using the "Run" button in the upper right of your script window to execute it.
Note that the majority of this workshop uses base R functions. No experience with Tidyverse is required.
Please call over the instructor or a TA if you have any problems with executing the code or if you need any help with the exercises.
# Prior to the workshop
## Setup Instructions
Please come to the workshop with your laptop setup with the required software and data files as described in our setup instructions.
## Making an RStudio Project
Projects allow you to divide your work into self-contained contexts.
Let's create a project to work in.
In the top-right corner of your RStudio window, click the "Project: (None)" button to show the projects
drop-down menu. Select "New Project." > "New Directory" > "New Project." Under directory name, input
"metagenomics" and choose a parent directory to contain this project on your computer.
## Installing Packages
Many R packages can be downloaded and installed from The Comprehensive R Archive Network (CRAN).
You can install each package with the following code:
```{r eval = FALSE}
install.packages("BiocManager",
"devtools",
"RColorBrewer",
"ape",
"tidyverse",
"ggplot2",
"phytools",
"stringr",
"reshape2",
"cowplot",
"grid",
"circlize",
"dplyr",
"aplot")
```
To make things a bit more difficult, there are other source of R packages, such as Bioconductor, which provides tools for the analysis and comprehension of high-throughput genomic data.
```{r eval = FALSE}
BiocManager::install("pathview")
```
##Install ggtree
And finally, we have to install the developmental version of _ggtree_ from GitHub [https://github.com/YuLab-SMU/ggtree](https://github.com/YuLab-SMU/ggtree). **This is very important to ensure colored ranges are applied correctly** (the Bioconductor version does not apply colors to trees correctly)
```{r eval = FALSE}
library(devtools)
install_github("YuLab-SMU/ggtree")
```
##Loading Packages
Next, we should check if all packages can be loaded properly. If a package does not load because of a missing package, please install the package (as shown above) and try again.
```{r, message=FALSE, warning=FALSE}
# Suite of packages for data manipulation and visualization
library(tidyverse)
# Working with trees
library(ape)
library(phytools)
# Plotting Grammar of Graphics
library(ggplot2)
# Tree visualization
library(ggtree)
# String manipulation
library(stringr)
# Data structure manipulation
library(reshape2)
# Multi-panel figure layouts
library(cowplot)
library(grid)
# Color maniputation
library(RColorBrewer)
# Pathway analysis with KEGG Orthologs
library(pathview)
#plot circular graphs and contig maps
library(circlize)
```
##Downloading Workshop Material
The material for this workshop is hosted on GitHub [https://github.com/davidlevybooth/Metagenomics_Workshop](https://github.com/davidlevybooth/Metagenomics_Workshop).
Follow the above link and click the green "Clone or download"" button to download all files as a .zip archive. Unzip the archive to access files.
# Loading custom functions
I have written some helpful functions to replace some time-consuming workflows. Please load them from the source R script like this:
```{r}
#Custom functions for metagenomics and metatranscriptomics
source("scripts/metagenome.utilities.R")
```
Please ensure all packages are installed and up to date, and that you have all data downloaded prior to the start of the workshop. If you require any assistance, please contact David Levy-Booth (dlevyboo@mail.ubc.ca).
We will also spend a few minutes prior to the workshop ensuring that all attendees have completed setup and installation.
# Sanity Check
Because _ggtree_ is the main package that we will work with throughout the workshop, please ensure that you can reproduce a simple example plot using this package.
Please enter the following code in RStudio and ensure your output matches the plot provided:
```{r message=FALSE, warning=FALSE, fig.height = 5, fig.align = "center"}
library(ape) #in case you did not load the package earlier.
library(ggtree)
set.seed(2020)
example.tree <- rtree(4)
ggtree(example.tree) + geom_balance(node=6)
```
If your plot **does not** look like this, you may have to re-install the _ggtree_ package from GitHub and any dependent packages. If it does, congratulations, we are ready to begin!
## Background: Field site and shotgun sequencing methods
![Liard River Hot Springs](images/map.png)
Thermal swamp complexes are rare ecosystems containing unknown diversity of biogeochemical cycling and organic biomass-recycling microbial consortia. By applying metagenomic and metatranscriptomic techniques across thermally-distinct environments we can better understand the organisms responsible for nutrient transformations in these ecosystems, and how their distributions are shaped by thermal properties. Characterizing the thermal microbiome can also open up new biotechnological possibilities including lignin bio-refinement.
Liard River Hot Springs [(59.431, 126.1)](https://goo.gl/maps/hYndRPAKR9hLu8jo9) is a complex of carbonate-hosted hot springs that feed thermally-warmed meteoric waters to extensive swampland underlayed by sandstone and shale. The pools are surrounded by unique thermal meadow vegetation and mixed forest (_Populus tremuloides, Betula papyrifera, Picea glauca_). The two main hot spring samples we will analyze in this workshop are Beta (30-35^o^C) and Epsilon (40-45^o^C) pools. Water cools as it traverses the marshland via the Alpha river outflow to about 17^o^C at our final sampling location (Cool).
One ng DNA from three replicates of selected sediment samples (Cool, Beta, Epsilon) was prepared for shotgun sequencing on one NextSeq550 (Illumina) run in High Output mode using Nextera XT library preparation (Illumina).
As the warmest undisturbed pool, 2 g sediment from Epsilon was incubated with 10 mg of vanillate, Eucalyptus milled-wood lignin (EMWL), Eucalyptus Kraft lignin (EKL), Coniferyl-alcohol dehydrogenase-polymer (DHP). Control incubations were conducted with no exogenous carbon. RNA was extracted from approx. 0.5 g sediment. RNA integrity was measured using an Agilent 2100 Bioanalyzer (Agilent Technologies) before and after rRNA removal with RiboMinus Transcriptome Isolation Kit, bacteria (ThermoFisher). Sequencing libraries were prepared using SuperScript Double-Stranded cDNA synthesis (ThermoFisher) and NexteraXT (Illumina) for sequencing using NextSeq550 (Illumina) High Output.
Genomes were extracted from metagenome assemblies using [MetaBat2](https://github.com/Ecogenomics/GTDBTk) and manually refined.
## Background: Genome Taxonomy Database
The Genome Taxonomy Database (GTDB) [https://gtdb.ecogenomic.org/](https://gtdb.ecogenomic.org/) is a standardized bacterial and archaeal taxonomy based on genome phylogeny. It contains about 24,000 species-level genomes with revised taxonomy for the genomic era. We will leverage the GTDB and its associated classification tool, GTDB-Tk [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk) to classify and place the whole-genome phylogeny of our assembled genomes.
For each genome our job will be to answer the following biological questions:
1. What are they? (Taxonomic classification)
2. Can we trust them? (Assembly statistics)
3. Where are they? (Abundance calculations)
4. What can they do? (Function/Pathway analysis)
5. What are they doing? (Metatranscriptomics)
## Introduction to phylogenomics
Please see the slide deck for an introduction to, and motivations for, phylogenetic analysis of microbial genomes.
Note that due to time and memory constraints, the genomes that we will use in this workshop have already undergone classification with GTDB-Tk.
# Section 1. Basic Tree Plotting
Load the GTDB-Tk tree for Thermal Swamp Metagenome-Assembled Genomes (MAGs).
We use the analyses of phylogenetics and evolution (ape) package, which provides basic functionality for loading and manipulating phylogenetic data. Here, we will use the read.tree() function to open the GTDB-Tk classification tree for the genomes that we will use in throughout these exercises.
Note that GTDB-Tk will separately analyze and place archaeal and bacterial genomes using 122 or 120 single copy genes (SCGs). We then used the bind.tree() function to combine both trees at their default roots. This combined tree is what is provided for you.
```{r}
tree.path <- "data/gtdb.classification.tree"
#Use read.tree(<tree.path>) to load tree file from path.
#Important: Make sure you use the "ape" version of the read.tree() function
tree <- ape::read.tree(tree.path)
#Inspect tree object
tree
```
Let's also load the taxonomy paths for the GTDB Genomes. The taxonomy of the GTDB is based on a revised, standardized taxonomic-rank normalization according to relative evolutionary divergence (RED). [(See Park et al., 2018. Nat. Biotech 4229)](https://www.nature.com/articles/nbt.4229)
```{r}
GTDB.taxonomy.path <- "data/gtdb.taxonomy.tsv"
#Load data
GTDB.taxonomy <- read.csv(GTDB.taxonomy.path, sep="\t", stringsAsFactors = FALSE)
```
What does the taxonomy data look like?
```{r}
head(GTDB.taxonomy)
```
Isolate tip labels. Each tip is a genomic placement.
```{r}
tips <- tree$tip.label
head(tips)
```
> Question: What other data are associated with tree objects? How can you access it? (tip: look at help page of function ape::read.tree, type "names(tree)"", or "tree"" followed by the dollar sign to see what is autofilled: tree$)
If we try to plot all 23K GTDB bacterial genomes with MAG placements the figure would be a mess, and my computer will probably explode. Let's subset the number of tips to a reasonable number first using a custom function.
```{r}
#Let's set "seed" to ensure that we all end up with the same list of tips to keep
set.seed(69420)
tips.keep <- sample_gtdb_tips(tips, 1000)
```
Use keep.tips() to subset tree.
```{r}
tree.subset <- keep.tip(tree, tips.keep)
#if you are having difficulty with you subset tree, please uncomment to load from the following file:
#tree.subset <- read.tree("results/tree.subset")
```
> Question: How else could we have accomplished this? (tip: look at ape::drop.tip function)
Plot the reduced tree with ggtree.
Note that we set xlim() to make room for longer labels (ggtree generally does a poor job of managing plot space)
```{r fig.height = 3, fig.align = "center"}
tree.plot <- ggtree(tree.subset) + xlim(0,4)
tree.plot
```
We can take advantage of several layouts.
For example, while we will focus on plotting data to linear trees, let's take a look at plotting a circular tree, just for kicks.
```{r fig.height = 3, fig.align = "center"}
ggtree(tree.subset, layout="circular")
```
Plot the reduced tree with ggtree -- add tip labels with geom_tiplab()
```{r fig.height = 3, fig.align = "center"}
tree.plot + geom_tiplab(size = 2)
```
Hmm, not very readable. Let's add phyla labels with geom_cladelabel() instead.
First, let's use the taxonomy table to select a random phyla from the GTDB. I'm using Firmicutes_A, but we could select any number of other targets.
Note that we are trying to isolate one specific group of genomes by GenomeID that all share the same Phylum. To do this we subset the GTDB taxonomy table using the which() function, _which_ creates a Boolean vector of rows that match our condition: in this case if the Phylum column contains the string "p__Firmicutes_A"
Let's review some base R fundamentals before we proceed:
A data.frame in R is made of rows of observations and columns of variables. For metagenome or microbiome count data, rows are typically genome, OTUs, etc., which columns containing counts in each sample. Note that this organization is transposed for some ecological statistics packages (e.g., in _vegan_) that expect rows of samples with columns of species/OTUs/genomes.
Regardless, the rows and columns within a data.frame is represented by positions between square brackets, e.g., [rows, columns]. So if we want to take the first 4 rows we can subset the data.frame like this:
```{r eval=FALSE}
data.frame[1:4, ]
# This is an example, don't actually try to run this"
```
This means that we can use index (index) vectors OR Boolean (True, False) vectors to subset data with a host of functions (which, match, etc.). Below we use which to subset the data.frame, which has analogous functionality to filter() in tidyverse.
To demonstrate this we will subset the GTDB taxonomy table to only our clade of interest. We then only take the GenomeIDs for these subset genomes. We must also ensure that the resulting list is a character vector because a factor will do screwy things to the findMRCA() function.
```{r}
#Assign our phylogenetic clade of interest to a variable called "Clade"
Clade <- "p__Firmicutes_A"
#Here we combine a few base R functions to subset the GTDB taxonomy table to only:
GTDB.taxonomy.clade <- GTDB.taxonomy[which(GTDB.taxonomy[,"Phylum"] == Clade),]
Firmicutes_A_tips <- GTDB.taxonomy.clade$GenomeID
head(Firmicutes_A_tips)
```
> Question: Can you obtain the same using the tidyverse (tip= use subset)
```{r}
Firmicutes_A_tips_tidyverse = GTDB.taxonomy %>%
subset(Phylum==Clade) %>%
pull(GenomeID) %>%
as.character()
head(Firmicutes_A_tips_tidyverse)
```
Let's find the most recent common ancestor (MRCA) node for the Firmicute_A phylum.
To find the MRCA, we have to learn to use the findMRCA() function in the phytools package. This allows us to find the common ancestor nodes for a list of tree tips.
```{r}
Firmicutes_A_node <- findMRCA(tree, Firmicutes_A_tips, type="node")
Firmicutes_A_node
```
Next, I created the collapse_nodes() helper function (in metagenome.utilities.R) to apply findMRCA() over the whole tree to collect nodes for all phyla. It works by iterating through each clade at a specified taxonomic level (in this case "Phylum"), and calculates the ancestor node, how many tips belong to each clade, and the maximum branch length of the clade (useful for plotting collapsed clades).
```{r echo = TRUE, results='hide'}
Phyla.nodes <- collapse_nodes("Phylum", tree.subset, GTDB.taxonomy)
```
Almost done, next we can highlight clades of interest.
We use the which() function again as above.
> Question: How would we filter the table using tidyverse notation?
```{r fig.height = 3, fig.align = "center"}
#In tidyverse
Phyla.nodes %>%
filter(Group == "p__Firmicutes_A") %>%
.$Node
#In base R
Firmicutes_A_node <- Phyla.nodes[which(Phyla.nodes$Group == "p__Firmicutes_A"), ]$Node
Firmicutes_A_node
tree.plot <- ggtree(tree.subset) + xlim(0,4)
tree.plot + geom_balance(node=Firmicutes_A_node,
fill='darkgreen',
color='white',
alpha=0.6,
extend=1) +
geom_cladelabel(node = Firmicutes_A_node,
"Firmicutes A",
offset = 1,
fontsize = 2)
```
> ADVANCED: Plot highlights for other clades. How do you highlight more than one clade? (hint: add additional geom\_balance() and geom\_cladelabel() functions)
What about for ALL major clades?
We can run geom_balance() to produce colored ranges for all phyla in our dataframe in a "FOR" loop.
```{r fig.height = 5, fig.align = "center"}
colors <- some_color(Phyla.nodes)
#For loops are a basic tool used in scripting languages.
for(i in 1:nrow(Phyla.nodes)) {
if(Phyla.nodes$TipCount[i] > 10) {
tree.plot <- tree.plot +
geom_balance(node=Phyla.nodes$Node[i], fill=colors[i], color='white', alpha=0.6, extend=0.6) +
geom_cladelabel(node = Phyla.nodes$Node[i], Phyla.nodes$Group[i], offset = 0.6, fontsize = 2)
}
}
tree.plot
```
Can we do this for a circular plot?
To assist with adding colored ranges to the tree you are provided with a new custom function called clade_colors() that simply runs the loop above. It can be used with rectangular or circular trees.
```{r fig.height = 3, fig.align = "center"}
tree.plot <- ggtree(tree.subset, layout="circular")
tree.plot <- clade_colors(tree.plot, Phyla.nodes, colors)
tree.plot
```
(Looks pretty badass to me!)
One more thing: adding bootstrap values to trees
```{r}
tree.subset$node.label <- parse_bootstraps(tree.subset, method = "parse")
```
ggtree is sometimes counterintuitive. Case in point geom\_nodepoint() counts tips as nodes. To make sure we have enough elements in our final list of bootstraps, we have to build a vector of bootstrap values that also includes the number of tips. We can do this with another helper function called "parse\_bootstraps()."
```{r}
bs_count <- parse_bootstraps(tree.subset, method = "count")
tail(bs_count)
```
Plot numeric bootstrap values with the ggtree::geom_text() implementation:
```{r fig.height = 3, fig.align = "center"}
ggtree(tree.subset) + xlim(0,3) + geom_text(aes(label = bs_count), size = 2)
```
Simple graphical representation of Bootstrap values
```{r fig.height = 3, fig.align = "center"}
tree.plot <- ggtree(tree.subset) +
xlim(0,4) +
geom_nodepoint(aes(subset= bs_count >= 90),
fill = "cadetblue",
size=1.5,
alpha = 0.5,
shape = 21)
tree.plot
```
And add phyla labels:
```{r fig.height = 3, fig.align = "center"}
tree.plot <- clade_labels(tree.plot, Phyla.nodes, tiplimit = 5, fontsize = 2)
tree.plot
```
Finally, we must add a scale representing the substitution distance
```{r fig.height = 5, fig.align = "center"}
tree.plot + geom_treescale(x=0, y=length(tree.subset$tip.label)-50, width=0.2, offset = 10)
```
> Exercise: Using the above functions, find and highlight the phylum of your genome in a tree with bootstrap values and a scale
# Section 2. Plotting MAG phylogenies
Set path for MAG taxonomy table and import.
Note that we use the read.csv() function here, but set the delimiter to tabs.
> Question: What are the differences between read.csv(), read.table() and read.delim()?
```{r}
mag.taxonomy.path <- "data/gtdb.classification.tsv"
mag.taxonomy <- read.csv(mag.taxonomy.path, sep="\t")
```
Subset tree to show only MAG placements using keep.tips().
```{r}
tree.mags <- keep.tip(tree, as.character(mag.taxonomy$GenomeID))
tree.mags
```
Plot MAG tree with ggtree()
```{r fig.height = 3, fig.align = "center"}
tree.plot <- ggtree(tree.mags, ladderize=F) + geom_tiplab(size = 2) + xlim(0,2.6)
tree.plot
```
> Question: Why is 'ladderize=F' so important here? Hint: it has to do with the order of the genomes vertically.
Add arbitrary color ranges with the predefined some_color() function.
> Advanced: Check out the some_color() function in the metagenome.utilities.R file. how would you change this function to change how colored ranges are applied to taxonomic clades?
```{r echo = TRUE, results='hide'}
Phyla.nodes <- collapse_nodes("Phylum", tree.mags, mag.taxonomy)
colors <- some_color(Phyla.nodes)
```
Create tree plot with custom clade colors:
```{r}
tree.plot <- clade_colors(tree.plot, Phyla.nodes, colors, extend = 0.6)
tree.plot
```
Add bootstraps with the parse_bootstraps() function:
```{r}
tree.mags$node.label <- parse_bootstraps(tree.mags, method = "parse")
bs_count <- parse_bootstraps(tree.mags,method = "count")
```
> Question: What's the deal with the raw node values in the GTDB tree? How do we even find this information? (hint look at tree.mags$node.label before running parse_bootstraps())
Finally: Add it all together for a great looking phylogeny of our MAGs:
```{r fig.height = 5, fig.align = "center"}
tree.plot <- tree.plot +
geom_nodepoint(aes(subset= bs_count >= 75), fill = "cadetblue", size=2, alpha = 0.5, shape = 21) +
geom_treescale(x=0, y=1, width=0.1, offset = 0.5)
tree.plot
```
> ADVANCED: How would you go about highlighting only your genome tip of interest? (hint: remember how we used geom_balance() in the first exercise?)
# Section 3. MAG statstic plots
We can use checkM data to map genome binning metrics to MAG tree. CheckM uses single copy genes (SCGs) to determine the levels of completeness and contamination ("redundancy") in a genome bin. We also use BBMap to calculate coverage.
```{r}
MAG.checkM.path <- "data/checkm.qa.tsv"
MAG.checkM <- read.csv(MAG.checkM.path, sep="\t")
head(MAG.checkM)
```
To plot data we must organize data in the 'long' format using reshape2::melt()
> Question: What is the difference between 'wide' and 'long' data formats? When would each be used?
```{r}
MAG.checkM.long <- melt(MAG.checkM, id.vars = "GenomeID")
```
"Wide" data:
```{r}
head(MAG.checkM)
```
"Long" data:
```{r}
head(MAG.checkM.long)
```
Remember to set GenomeID as an ordered factor to match tree tip order. Factors are important in R for organizing data without changing the order of the underlaying data.
```{r}
MAG.checkM.long$GenomeID <- factor(MAG.checkM.long$GenomeID,
levels = tree.mags$tip.label)
MAG.checkM.long$variable <- factor(MAG.checkM.long$variable,
levels = rev(unique(MAG.checkM.long$variable)))
```
Create separate 'completeness' data.frame (still in "long" format).
We will use the imported 'factorize()' function to set a variable to be a factor where the factor levels are simply the current order of the variable in the data.frame.
But first we get to learn a new function!
> Question: What is grepl()? How does it differ from grep()?
> Answer: GREP stands for Global Regular Expression Print. We can match strings (characters), including with a wide variety of "wildcards" to subset a data.frame() by creating a numeric index vector for matching entries.
> grepl(), (grep logical) in contrast, produces a Boolean vector. This is really useful for certain functions like ifelse() that apply different outcomes based on Booleans.
Here, we could use grep() or grepl() to catch rows containing both the "Completeness" variable as well as the completeness residual "Completeness_AR" variable.
```{r}
#Subset to get only the completness data
MAG.checkM.long.completeness <- MAG.checkM.long[grepl("Completeness", MAG.checkM.long$variable), ]
#R gets some kooky ideas how we should order our data.
#Here we explicitly tell R to keep the order of your genomes (variables)
#using the factorize() helper function.
MAG.checkM.long.completeness$variable <- factorize(MAG.checkM.long.completeness$variable, rev = TRUE)
```
> Question: Why do you think we used rev = TRUE here?
> Answer: When we plot data in ggplot2 with flipped x and y axes using coord_flip(), the order of the "y-axis" can be reversed from what we expect. The factorize() function is designed to account for this.
This is how we would create the completeness plot with ggplot2.
Within ggplot function we first specify the data.frame that contains the data.
We then set the plotting aesthetics with aes() -- including the "x" and "y" variables. Here we also indicate that factors will be differentiated using "fill".
> Question: How does "fill" differ from "color"?
We could just plot the bin completeness with a single bar, but I would like to have more control over the aesthetics of the unplotted region. To do this we have calculated the "residual" (100-completeness) and can assign a different fill color for this region manually. The residuals are plotted with completeness values as a stacked bar plot.
```{r eval=FALSE}
ggplot(MAG.checkM.long.completeness,
aes(x = GenomeID, y = value, fill = variable)) +
geom_bar(stat = "identity", color = "black") + #The "black" color provides the border.
scale_fill_manual(values = c("cadetblue", "gold")) + #Manually assign fill for residual and completeness
coord_flip() + #Switches the x and y axis to produce vertical stacked bar plot.
theme(legend.position = "none") + #We don't need a legend for these data
ylab("Completeness") #Add label
```
That will work, but it would be cool to create a function that lets us apply extra formatting to all similar plot types. tree_bar_plot() replicates and extends the above ggplot plotting instructions. All we need to supply is the data and the x-axis label.
```{r fig.height = 3, fig.align = "center"}
MAG.completeness <- tree_bar_plot(MAG.checkM.long.completeness, "Completeness")
MAG.completeness
```
> Question: Why did we not include y-axis text? How can we change the above plot to double check that we got the y-axis order correct?
> ADVANCED: Create similar plots for genome Contamination and Coverage
```{r}
MAG.checkM.long.contamination <- MAG.checkM.long[grepl("Contamination", MAG.checkM.long$variable), ]
MAG.checkM.long.contamination$variable <- factorize(MAG.checkM.long.contamination$variable, rev = TRUE)
MAG.checkM.long.coverage <- MAG.checkM.long[grepl("Coverage", MAG.checkM.long$variable), ]
MAG.checkM.long.coverage$variable <- factorize(MAG.checkM.long.coverage$variable, rev = TRUE)
MAG.contamination <- tree_bar_plot(MAG.checkM.long.contamination, "Contamination",
colors = c("cadetblue", "firebrick"))
MAG.coverage <- tree_bar_plot(MAG.checkM.long.coverage, "Coverage",
colors = c("cadetblue", "darkblue"))
```
> Question: Is there an even more efficient way to create multiple plots using ggplot2?
Put it all together to plot genome statistics with the tree we created earlier:
```{r}
tree.plot <- tree.plot + scale_y_tree() #got a warning here '`expand_scale()` is deprecated; use `expansion()` instead. '
```
> Question: Wait, why do we need this scale_y_tree() function?
> Answer: scale_y_tree() is critical for aligning ggtree plots with ggplots.
> Change values in scale_y_tree() function in data/metagenome.utilities.R if you need to change the vertical alignment with a ggplot. Warning: setting the y-scale is only for advanced users.
Align all plots together with cowplot::plot_grid()
> Notice that we should be able to horizontally align the plots at the bottom axis with align = "h" and axis = "b"
```{r fig.height = 5, fig.align = "center"}
MAG.title <- plot_title("Liard River Phylogeny and Statistics")
MAG.plots <- plot_grid(tree.plot,
MAG.completeness,
MAG.contamination,
MAG.coverage,
align = "h",
axis = "b",
ncol = 4,
rel_widths = c(1.25, 0.25, 0.25, 0.25))
#Nesting the plot_grid() function allows us to easily add titles to our groups of plots!
plot_grid(MAG.title,
MAG.plots,
ncol = 1,
rel_heights = c(0.1, 1))
```
Looks great!
# Section 4. Aligning abundance plots with phylogenetic trees.
Now we will use a heatmap to explore MAG environmental abundance and temperature range.
Load count data. Meiji was used to align shotgun metagenome reads to MAGs and select GTDB genomes.
```{r}
counts.path <- "data/metagenome.counts.tsv"
counts <- read.table(counts.path, sep="\t", header = TRUE)
```
We should also parse the lowest level of taxonomic classification so we can tell what the heck each genome is without having to plot the whole taxonomy string.
```{r}
#tax_class() is helper function from metagenome.utilities.R
counts$Classification <- tax_class(counts$Taxonomy)
```
Optionally, we can calculate relative abundance for each sample:
```{r}
#relative_abundance() is helper function from metagenome.utilities.R
counts.ra <- relative_abundance(counts)
```
Or normalize across taxa:
```{r}
#normalize() is helper function from metagenome.utilities.R
counts.norm <- normalize(counts)
```
Now, match the count data to the order of the tree (remember why we set 'ladderize=F'?), then format for plotting.
We use the match() function a few more times to ensure these data are organized correctly.
* _%in%_ subsets a data.frame by creating Boolean vector of presence or absence in a list.
* _match_ creates an index vector to re-order data.frame rows using a character vector.
* _order()_ re-orders data.frame rows based on alphanumeric variables.
* _factor()_ re-orders hierarchy of data.frame row entries without changing the data itself
> _match_ is a really useful function for keeping data organized across data.frames!
> Question: Can you tell what the tax_class() function is doing by looking at the function in the "metagenome.utilities.R" script?
```{r}
#We once again match the counts.norm and mags.taxonomy.norm data.frames with the order of tips in the tree.
#It can be confusing that the first vector in the match function is actually
#what you're trying to match to, so be careful with the order.
counts.norm <- counts.norm[match(tree.mags$tip.label, counts.norm$GenomeID), ]
mag.taxonomy.norm <- mag.taxonomy[match(tree.mags$tip.label, mag.taxonomy$GenomeID), ]
#Now that the tree tip, counts.norm and mag.taxonomy.norm all match we can consolidat the variables we want to retain in the counts.norm data.frame.
counts.norm$GenomeID <- tree.mags$tip.label
counts.norm$Taxonomy <- mag.taxonomy.norm$Taxonomy
#Finally we use the tax_class() function to pull out the lowest level of GTDB classification for each genome in the counts.norm data.frame.
counts.norm$Classification <- tax_class(counts.norm$Taxonomy)
```
Melt data.frame for plotting and do some more formatting:
```{r}
#Transform the count data to the "long" format for plotting.
#reshape2::melt() creates three types of columns: IDs, variables and values.
#We want to retain "GenomeID", "Taxonomy", and "Classification" as IDs.
#And we keep the actual counts as the "value". Making each sample a "variable"
counts.long <- melt(counts.norm,
id.vars = c("GenomeID", "Taxonomy", "Classification"),
value.name = "norm.count")
#Use factorize() again to lock in the current variable order when plotting
counts.long$variable <- factorize(counts.long$variable)
#We use the paste0 function to create a new label by combining the genome name and
#the classification.
counts.long$Label <- paste0(counts.long$Genome, " (", counts.long$Classification, ")")
counts.long$Label <- factorize(counts.long$Label)
#Finally, we use cut_last(), which is a wrapper for the substr() function.
#This allows us to trim characters from the end of our sample variables
#removing the replicates and keeping the pool IDs.
counts.long$Pool <- cut_last(counts.long$variable, 3)
counts.long$Pool <- factorize(counts.long$Pool)
#Let's take a look:
head(counts.long)
```
Plot Gene Count Heatmap with ggplot2 using geom_tile()
> Question: Why are we using facets to divide the data from different pools into different subplots?
```{r fig.height = 5, fig.align = "center"}
counts.heatmap <- ggplot(data = counts.long, mapping = aes(x = variable, y = Label, fill = norm.count)) +
geom_tile(color = "grey50") +
scale_fill_gradient(low = "#FFFFFF", high = "#012345", na.value = "#FFFFFF") +
scale_y_discrete(drop=FALSE) + #keep even "empty" genomes to preserve all elements in the tree
theme(axis.text = element_text(size = 8, vjust = 0.4),
axis.text.x = element_blank(),
axis.title = element_blank(),
strip.background = element_blank()) +
facet_grid(.~Pool, scales = "free_x")
#Let's take a look at the plot!
counts.heatmap
```
Finally, combine tree and heatmap plot with cowplot::plot_grid()
```{r fig.height = 5, fig.align = "center"}
counts.long$Genome <- factor(counts.long$GenomeID, levels = tree.mags$tip.label)
new.count.heatmaps <- heatmap_plot(counts.long) + theme(axis.text.y = element_blank())
#Put them together with cowplot::plot_grid()
MAG.title <- plot_title("Liard River MAG Phylogeny and Abundance")
MAG.plots <- plot_grid(tree.plot,
new.count.heatmaps,
align = "h",
axis = "bt",
ncol = 2,
rel_widths = c(1, 1))
plot_grid(MAG.title,
MAG.plots,
ncol = 1,
rel_heights = c(0.1, 1))
```
> Question: What is the temperature range of your genome? In what pool is your genome most abundant?
# Section 5. Functional Annotation
A note on functional annotation:
To understand the ecological role(s) and functional potential of any discrete shotgun assembly (genome, metagenome, transcriptome), we must *annotate* their genes with confidence. To do this we can first predict open reading frames (ORFs) using Prodigal, then use KOFAMSCAN to apply profile hidden Markov Models to annotate KEGG orthologs and pathways.
These steps have already been run for you and the results tables for all genomes are concatenated. This organization works well for ggplot2, in which data is primarily supplied to in melted (long) format.
We also have a table of KEGG Orthologs (KOs) to help us organize data into different pathways (KO.pathways). These data were organized manually, which was a real pain in the butt.
Anyways, let's load our data:
```{r}
# Load KEGG pathways of interest
KO.pathways <- read.table("data/KO.pathways.tsv", sep = "\t", header = TRUE)
KO.pathways$Function <- factorize(KO.pathways$Function)
#This is important to keep the pathway data organized for later!
#Load MAG kofamscan annotations
KO.mag <- read.delim("data/KO.genomes.tsv", sep = "\t", header = TRUE)
head(KO.mag)
```
> Question: How were the pathways and genes annotated?
Select only genomes KOs in pathways of interest:
* N-cycling
* S-cycling
* CH4-cycling
* Aromatic degradation
* Photosynthesis
* Etc.
Look, we're using match again to ensure that the MAG and Pathway data are in the same order.
```{r}
KO.mag <- KO.mag[which(KO.mag$KO %in% KO.pathways$KO), ]
KO.mag$Function <- KO.pathways[match(KO.mag$KO, KO.pathways$KO), ]$Function
KO.mag$Pathway <- KO.pathways[match(KO.mag$KO, KO.pathways$KO), ]$Pathway
```
> Question: Can you subset using the tidyverse and the subset function?
Select pathways to plot:
```{r}
keep.pathways <- c("Nitrogen Metabolism", "Sulphur Metabolism", "Photosynthesis", "Hydrogen Metabolism", "Carbon Metabolism")
#Isolate selected pathways in data.frame
KO.mag <- KO.mag[which(KO.mag$Pathway %in% keep.pathways), ]
```
Order the genomes for plotting based on order of tips in tree using a factor:
```{r}
KO.mag$Genome <- factor(KO.mag$GenomeID, levels = as.character(tree.mags$tip.label))
```
Generate pathway colors and plot functional annotations. Plots use geom_tile() again but this time add color from the pallet by pathway function.
We also apply facetting to divide the individual functions.
```{r fig.height = 5, fig.align = "center"}
pathway.colors <- set_path_colors()
#Plot functional annotations with ggplot2
function.plot <- ggplot(KO.mag, aes(x = KO, y = Genome, fill = Function)) +
geom_tile(color = "black") + #geom_tile creates each square for an annotated gene
scale_fill_manual(values = pathway.colors) + #I use a named vector to apply fills to specific conditions
theme_bw() + # this is just making it pretty
theme(axis.title = element_blank(),
#axis.text.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 0, size = 6),
legend.position = "none",
strip.text.x = element_text(size = 8, angle = 90, hjust = 0, vjust = 0.6),
strip.background = element_blank(),
strip.placement = "outside") +
scale_x_discrete(position = "top") +
facet_grid(.~Function, scales="free_x", space="free_x") #facetting by Function
function.plot
```
> Advanced: plot the aromatic degradation pathway
> Hint: Find the label for the aromatic degradation genes in KO.mag$Pathway, then add label to the "keep.pathways" list.
# Section 6. Pathway Analysis
Optionally, we can use the _pathview_ package to create Pathway Analysis maps. We won't spend a ton of time on this, but it can be useful to manually inspect your pathway data.
We will look at the benzoate degradation map for the Chloroflexi MAG L.E.CH.5.
* Benzoate degradation is KEGG pathway 00362 (Find more here: https://www.genome.jp/kegg/pathway.html)
* Pathview creates XML for KEGG pathway annotation.
* It writes the final file to your working directory using the pathway # and suffix string.
* Warning: if XML data is missing, pathview will attempt to download it (ensure internet connection)
```{R eval=FALSE}
#Create labelled benzoate degradation pathway map for genome: CH.5
CH5 <- KO.mag[which(KO.mag$GenomeID == "L.E.CH.5"), ]
pathview(gene.data = as.character(CH5$KO),
pathway.id = "00362",
species = "ko",
gene.idtype = "KEGG",
kegg.native = TRUE,
out.suffix = "CH5")
```
![Roseiflexus sp. CH5 benzoate degradation](images/ko00362.CH5.png)
We can also supply quantitative data. Advanced uses are welcome to implement this on their own time.
# Section 7. Mapping Metatranscriptomic Data
Metatranscriptome analysis can be difficult to analyze effectively. In a traditional transcriptome study, we target cells generally in the same growth phase to obtain our desired signal. In an environmental sample, or even in an incubation, members of the microbiome may be in wildly different growth phases.
The following data comes from incubating Epsilon sediment at 45^o^C with the following substrates:
* 0.1% Kraft Lignin (EKL)
* 0.1% Milled Wood Lignin (EMWL)
* 0.1% DHP Lignin (DHP)
* 0.1% Vanillate (VAN)
* No exogenous carbon controls (NC)
To understand the broad changes in the Liard hot spring transcriptome during incubations with different lignin substrates, we will first analyze reads that have been mapped with Meiji to our refined genome bins from above, but also less-complete genomic and transcriptomic bins.
To avoid extraneous data parsing, let's simply upload our data file already in the "long" format, factorize the labels to maintain order on the y-axis, then plot as a heatmap to view whole-transcriptome abundance across samples:
```{r}
metatrans.long <- read.table("data/metatranscriptome.counts.long.tsv", sep = "\t")
metatrans.long$Label <- factorize(metatrans.long$Label)
head(metatrans.long)
```
Next we use the sample heatmap plot as above to see if we can find bins that are particularly active in our various lignin compound incubations:
```{r fig.height = 5, fig.align = "center"}
trans.heatmap <- ggplot(data = metatrans.long, mapping = aes(x = variable, y = Label, fill = value)) +
geom_tile(color = "grey50") +
scale_fill_gradient(low = "#FFFFFF", high = "#012345", na.value = "#FFFFFF") +
scale_y_discrete(drop=FALSE) +
theme(axis.text = element_text(size = 8, vjust = 0.4),
axis.text.x = element_blank(),
axis.title = element_blank(),
strip.background = element_blank()) +
facet_grid(.~Substrate, scales = "free_x")
trans.heatmap
```
# Section 8. Circular genome plots with gene expression data
One more interesting way we can plot the metatranscriptomics data is by looking at gene expression in a single genome bin.
We will start by looking at expression patterns of the Chloroflexi MAG L.E.CH.5 grown on Kraft Lignin.
First, prepare the data:
```{r}
#Load the coverage table generated with BBMap
CH5.EKL.COV <- read.table("data/LE.CH5.EKL.cov", sep = "\t", header = TRUE)
cmax <- max(CH5.EKL.COV$cov)