-
Notifications
You must be signed in to change notification settings - Fork 81
BAQLaVa
BAQLaVa, Bioinformatic Application of Quantification and Labeling of Viral Taxonomy, is a tool designed to perform viral community profiling on metagenomic, metatranscriptomic, or metavirome sequencing data, especially data from complex microbial communities. BAQLaVa can identify all viruses recognized by the International Committee on Taxonomy of Viruses (ICTV), as well as the many novel viruses as captured and released in the Gut Virome Database, RNA Viruses in Metatranscriptomes (RVMT) database, Viral Sequence Clusters and Benler et al 2021.
- Overview
- Running BAQLaVa
- Module 1: Metagenomic Sample
- Module 2: Metatranscriptomic Sample
- Module 3: Community Analysis
BAQLaVa v0.5 intakes short sequencing reads and runs parallel reference-based searches against a set of Viral Genome Bins (VGBs). Each search is done to a set of material specific to each VGB - genetic marker regions in nucleotide search and VGB-specific ORFs in translated search. The predictions from these two searches are reconciled into one output: a BAQLaVa viral profile.
BAQLaVa simply requires a single input fasta or fastq file and a directory location to save outputs. Input reads should have already been QC'd with a tool such as KneadData. If using paired sequencing samples, reads should be in s single file. This can be done by using cat to join forward and reverse reads together into one input file. This ensures a singular output of viral profiling per sample.
The locations of the data mentioned below are specific to VMs hosted by the Huttenhower Lab, but if you are navigating through this tutorial independently, all of the input data products are distributed with BAQLaVa here, or the inputs and their 'prebaked' outputs can be found and downloaded here. You can also download the items directly:
Demo Input File: https://github.com/biobakery/baqlava/blob/master/examples/BAQLaVa_demo.fq
Nucleotide Demo DB: https://github.com/biobakery/baqlava/tree/master/examples/BAQLaVa.V0.5.nucleotide
Protein Demo DB: https://github.com/biobakery/baqlava/tree/master/examples/BAQLaVa.V0.5.protein
HMP2 MGX sample: https://github.com/biobakery/baqlava/blob/master/examples/HMP2_MGX.fa
HMP2 MTX sample: https://github.com/biobakery/baqlava/blob/master/examples/HMP2_MTX.fa
Merged multi-sample HMP2 table: https://github.com/biobakery/baqlava/blob/master/examples/HMP2_merged.tsv
For the first BAQLaVa run (this will also test the installation) we can run:
cd ~/Tutorials/baqlava
baqlava -i input/demo.fq -o output/ --bypass-bacterial-depletion --nucdb input/BAQLaVa.V0.5.nucleotide --protdb input/BAQLaVa.V0.5.protein
Normally, we do not need to specify the nucleotide or translated databases, but we are using smaller databases here so we do.
You should see the following output (only Completed step is shown below but you will see Ready, Started, and Completed):
[ 1/10 - 10.00%] **Completed** Task 0: Running HUMAnN to depete bacterial reads from file
[ 2/10 - 20.00%] **Completed** Task 2: Formatting bacterially depleted FASTA file
[ 3/10 - 30.00%] **Completed** Task 3: Running BAQLaVa Nucleotide Search
[ 4/10 - 40.00%] **Completed** Task 7: Running BAQLaVa Translated Search
[ 5/10 - 50.00%] **Completed** Task 4: mv
[ 6/10 - 60.00%] **Completed** Task 5: Calculating Marker Coverage & Abundance
[ 7/10 - 70.00%] **Completed** Task 6: mv
[ 8/10 - 80.00%] **Completed** Task 8: Making BAQLaVa Viral Profile
[ 9/10 - 90.00%] **Completed** Task 9: rm
[10/10 - 100.00%] **Completed** Task 10: rm
Run Finished
Let's look at what outputs we get:
ls output/
BAQLaVa produces four output files:*
demo_BAQLaVa_profile.txt
demo_bacterial_depeled.fq
demo_tempfile_markers.txt
demo_tempfile_proteins.txt
Let's look at BAQLaVa_demo_BAQLaVa_profile.tsv. If you want to look at a tabular formatted view you can always pass these commands through column -t -s $'\t' first:
less -S output/demo_BAQLaVa_profile.txt
column -t -s $'\t' output/demo_BAQLaVa_profile.txt | less -S
Some of the lines are quite long:
BAQLaVa VGB demo Reference Species Taxonomy Other ICTV Genomes in VGB
VGB_1593 43.5046709396 VGB_1593
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g__crassvi
VGB_1593|nucleotide 13.184010221779396 VGB_1593
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclass
VGB_1593|translated 43.5046709396 VGB_1593
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g_
VGB_49585 18.416016893941936 VGB_49585
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__caudoviricetes_unclassified;f__caudoviricete
VGB_49585|nucleotide 13.30486294831733 VGB_49585
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__caudoviricetes_unclassified;f__caudo
VGB_49585|translated 18.416016893941936 VGB_49585
r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__caudoviricetes_unclassified;f__caudo
VGB_6438 39.447859546400004 VGB_6438
r__viruses_unclassified;k__viruses_unclassified;p__viruses_unclassified;c__viruses_unclassified;o__viruses_unclassi
VGB_6438|translated 39.447859546400004 VGB_6438
r__viruses_unclassified;k__viruses_unclassified;p__viruses_unclassified;c__viruses_unclassified;o__viruses_
Let's look at all columns but only the first couple short rows to get a better look at the file type:
head -n 4 output/demo_BAQLaVa_profile.txt | less -S
column -t -s $'\t' output/demo_BAQLaVa_profile.txt | head -n 4| less -S
BAQLaVa VGB demo Reference Species Taxonomy Other ICTV Genomes in VGB
VGB_1593 43.5046709396 VGB_1593 r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g__crassvirales_unclassified
VGB_1593|nucleotide 13.1840102217 VGB_1593 r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g__crassvirales_unclassified
VGB_1593|translated 43.5046709396 VGB_1593 r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g__crassvirales_unclassified
We see that the first VGB reported, VGB_1593, was observed in both nucleotide search and translated search, with different abundances detected at each point. The first line represents the total abundance detected, which is the maximum across the two individual searches. The following lines represent the breakdown of the abundance detected in individual search, when it is found.
Let's just look at the first column of VGB abundances (total & stratified):
cut -f1 output/demo_BAQLaVa_profile.txt
The following VGBs are reported:
BAQLaVa VGB
VGB_1593
VGB_1593|nucleotide
VGB_1593|translated
VGB_49585
VGB_49585|nucleotide
VGB_49585|translated
VGB_6438
VGB_6438|translated
Checkpoint Questions:
How many VGBs were found?
Were all VGBs found in both nucleotide & translated search, or were some found only in one of the searches?
Now let's look at the taxonomy for the VGBs we found. We will just look at the total abundance lines to minimize the text we are looking at:
grep -v "|" output/demo_BAQLaVa_profile.txt | column -t -s $'\t' | less -S
We see a full taxonomic lineage for each VGB:
BAQLaVa VGB demo Reference Species Taxonomy Other ICTV Genomes in VGB
VGB_1593 43.5046709396 VGB_1593 r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__crassvirales_unclassified;g__crassvirales_unclassified
VGB_49585 18.416016893941936 VGB_49585 r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__caudoviricetes_unclassified;f__caudoviricetes_unclassified;g__caudoviricetes_unclassified
VGB_6438 39.447859546400004 VGB_6438 r__viruses_unclassified;k__viruses_unclassified;p__viruses_unclassified;c__viruses_unclassified;o__viruses_unclassified;f__viruses_unclassified;g__viruses_unclassified
Checkpoint Questions:
What is the lowest taxonomic rank known for each VGB?
What does it mean when we observe an unclassified VGBs?
Finally, let's talk about the two remaining columns, Reference Species and Other ICTV Genomes in VGB. ICTV is the International Committee on the Taxonomy of Viruses, a group which officially recognizes and classifies viruses. This sample was made from viral Metagenome Assembled Genome (MAGs), which are not ICTV species. However, when BAQLaVa finds an ICTV species, that will be reported in these two columns. A singular species is named in Reference Species column, and any additional ICTV species that also cluster into the same VGB will be named in the Other ICTV Genomes in VGB column.
Next, let's look at some real data. For the examples we will be looking it, we have data from the iHMP Project. This project gathered human microbiota samples to study a variety of conditions, but we will be focusing on the Inflammatory Bowel Diseases subset and using the fecal samples. We have samples from individuals with IBD and healthy controls, and for samples from all individuals, we have paired metagenomic and metatranscriptomic data prepared from the same origin sample. Looking first at a single sample - this input is a subsampled metagenomic sequencing sample from a human fecal sample, and the FASTA file has already been enriched for non-bacterial reads to speed up the profiling for this tutorial.
We can use the --bypass-bacterial-depletion flag to skip the first step of BAQLaVa which removes the bacterial reads before proceeding to viral profiling. The bacterial depletion step is aimed at reducing false positive predictions. Today we are skipping it for speed, but other reasons you may want to skip it are samples with particularly low abundance or metatranscriptomic samples. (Samples running through bacterial depletion but finding no bacterial reads will automatically fail but can be rescued with --bypass-bacterial-depletion).
We can get the viral profile with BAQLaVa:
baqlava -i input/HMP2_MGX.fa -o MGX_output/ --bypass-bacterial-depletion
Since that takes a while to run, let's look at the BAQLaVa output that has been pre-computed:
column -t -s $'\t' prebaked_outputs/MGX_output/HMP2_MGX_BAQLaVa_profile.txt | less -S
It's quite long - even with this small sample, we get a lot of viruses. How many unique viruses did we identify? (The code below will show us all VGBs, we can count these by adding on wc -l)
cut -f1 prebaked_outputs/MGX_output/HMP2_MGX_BAQLaVa_profile.txt | grep -v "|"
It looks like there were 20 VGBs found in column 1, so we observed 20 Viral Genome Bins. What type of viruses do we observe reported out by taxonomic lineage?
Let's look at the Viral Genome Bins we observed that contain an ICTV species. The code below is selecting the total abundance line from column 1, and then excluding VGBs from column 3 that do not have an ICTV species name:
grep -v "|" prebaked_outputs/MGX_output/HMP2_MGX_BAQLaVa_profile.txt | cut -f3-4 | grep -v "VGB"
Which gives:
Reference Species Taxonomy
Canhaevirus hiberniae r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__suoliviridae;g__suoliviridae_unclassified
Buchavirus coli r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__suoliviridae;g__buchavirus
Blohavirus americanus r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__suoliviridae;g__suoliviridae_unclassified
We see that all three of the ICTV-recognized VGBs identified are CrAssphage (Order Crassvirales). Do these viral species seem reasonable to observe in the gut virome?
Next, let's look at the viral species we observe by the different components of reference-based search by running each of the commands below:
grep "nucleotide" prebaked_outputs/MGX_output/HMP2_MGX_BAQLaVa_profile.txt | wc -l
grep "translated" prebaked_outputs/MGX_output/HMP2_MGX_BAQLaVa_profile.txt | wc -l
We observed 10 Viral Genome Bins via nucleotide search, and 14 by translated search. Protein space is less specific than nucleotide space, so we were able to identify more species that were present in the sample by utilizing translated search in addition to nucleotide search.
Checkpoint questions:
How many samples were identified in both the nucleotide and translated search steps? Does this seem like the expected behavior?
Why do we identify so many Viral Genome Bins of unclassified viral species?
Do the classified viral species we observed seem reasonable to be present in a gut virome?
Why do we observe more VGBs with translated search than nucleotide search?
As we learned above, iHMP project samples had paired metagenomic and metatranscriptomic sequencing data. Let's look at what a metatranscriptomic viral sample looks like when profiled with BAQLaVa. Once again, we have subsampled this file and taken care of bacterial read depletion already (when working with paired data, we suggest using the bacteria identified in the MGX sample to deplete bacterial reads from the MTX sample.)
baqlava -i input/HMP2_MTX.fa -o MTX_output/ --bypass-bacterial-depletion
How many species do we see in the paired metatranscriptomic data?
grep -v "|" prebaked_outputs/MTX_output/HMP2_MTX_BAQLaVa_profile.txt | less -S
We observed 3 VGBs. How many of these are ICTV classified and how many contain novel viral genomes? What taxonomy do we see?
grep -v "|" prebaked_outputs/MTX_output/HMP2_MTX_BAQLaVa_profile.txt | cut -f3-4 | less -S
Reference Species Taxonomy
VGB_1580 r__monodnaviria;k__sangervirae;p__phixviricota;c__malgrandaviricetes;o__petitvirales;f__microviridae;g__microviridae_unclassified
VGB_2374 r__riboviria;k__orthornavirae;p__pisuviricota;c__duplopiviricetes;o__durnavirales;f__picobirnaviridae;g__picobirnaviridae_unclassified
Blohavirus americanus r__duplodnaviria;k__heunggongvirae;p__uroviricota;c__caudoviricetes;o__crassvirales;f__suoliviridae;g__suoliviridae_unclassified
We observed Blohavirus americanus in both the MGX and MTX data for this sample - this species is a dsDNA phage. What does it indicate about these phages that we observe them in both the MGX & MTX data?
We did not observe VGB_2374 or VGB_1580 in the MGX viral profile. What are two potential explanations for observing a virus in MTX data but not in MGX data?
VGB_1580 is a ssDNA virus and VGB_2374 is a dsRNA virus - Are the two potential explanations above sufficient to explain both of their observations in the MTX sample?
Checkpoint questions:
When a virus with a DNA backbone is observed in metatranscriptomic data, what can we learn about it's role in the microbial community at the time of sequencing?
What are two reasons we might observe a virus in paired MTX data but not in MGX data? Do these explanations encompass the different nucleic acid types of viruses we may observe?
Today we only ran samples from one individual, but if we had time to run more FASTA files, we could get a table with multiple viral profiles. A table like this has samples across the columns and the species identified in rows. Luckily, we ran 10 metagenomic samples from the iHMP Project and made a table we can work with. Let's take a look at it now:
column -t -s $'\t' prebaked_outputs/HMP2_merged.tsv | less -S
Here, we only see a single VGB abundance, not multiple lines for stratified abundances. To merge individual viral profiles, we just used the total viral abundance identified for each VGB.
This file has the 100 most prevalent VGBs observed across the ten random metagenomic samples. Let's visualize these samples with a heatmap:
Go to https://maayanlab.cloud/clustergrammer/ and upload the HMP2_merged.tsv file.
Chose 'Correlation' and 'Average', then submit.
Move the Opacity slider all the way to the right.
Play around with the sum sliders to look at the top 100, 50, and 25 viral species by summed RPK.
Checkpoint questions:
The HMP2 project has samples from individuals with IBD and healthy controls. Based on the heatmap, do you observe clustering that indicates a viral profile can differentiate groups of individuals?
Are there viral species that are present across all virome samples?
Refer to the table below. Do you observe any trends in the heatmap by IBD diagnosis or dysbiosis status?
Sample IBD Status Dysbiosis Status
HSM5MD4O CD FALSE
CSM9X22U UC FALSE
MSM79HBB nonIBD FALSE
HSMA33RT UC FALSE
CSM5MCYW CD TRUE
MSM9VZOQ nonIBD TRUE
CSM79HH4 CD FALSE
MSM79HD8_P UC FALSE
MSM5LLFO_P UC FALSE
MSM6J2JF_P nonIBD FALSE
Based on this small dataset, what properties do viromes look to share with bacterial microbiomes? Which appear to be different?
- HUMAnN 2.0
- HUMAnN 3.0
- HUMAnN 4.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- BAQLaVa
- Assembly workflow
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MaAsLin 3.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- MTX model 3
- PARATHAA