-
Notifications
You must be signed in to change notification settings - Fork 72
BAQLaVa Tutorial
BAQLaVa, Bioinformatic Application of Quantification and Labeling of Viral Taxonomy, is a tool designed to perform viral community profiling on metagenomic, metatranscriptomic, or metavirome sequencing data, especially data from microbiome experiments. BAQLaVa can identify all viruses recognized by the International Committee on Taxonomy of Viruses (ICTV), as well as the many novel viruses as captured and released in the Gut Virome Database, RNA Viruses in Metatranscriptomes (RVMT) database, and Benler et al 2021.
- Overview
- Running BAQLaVa
- Module 1: Metagenomic Sample
- Module 2: Metatranscriptomic Sample
- Module 3: Community Analysis
This tutorial is short, as BAQLaVa has minimal argument options at this time. Please let us know what features you would find most useful to add next!
BAQLaVa is designed with two workflow arms. A tiered reference-based search arm intakes short sequencing reads and searches them against a viral marker set in both nucleotide and protein space. Separately, short reads are assembled into contigs, and BAQLaVa uses deep learning to predict viral taxonomy for these contigs at Genus level taxonomy. The predictions from these two arms are reconciled into one output: a BAQLaVa viral profile.
BAQLaVa v0.2 carries out tiered reference-based search arm only. An input FASTA or FASTQ file is first searched against the marker set in nucleotide space. Reads that do not align in nucleotide search are next searched against the marker set in protein space. The output from both steps are reconciled and produced in a single BAQLaVa viral profile.
BAQLaVa simply requires a single input fasta or fastq file and a directory location to save outputs. If using paired sequencing samples, cat forward and reverse reads together into one input file for BAQLaVa to ensure a single output viral profile per sample.
Input reads should have already been QC'd with a tool such as KneadData. It is optional, though highly recommended, to additionally enrich the file for viral reads by mapping to and removing reads that align to a bacterial database, or through a similar method. This will ensure lower false positive predictions in tiered reference search, and produce better contigs for the deep learning arm.
To remove bacterial contamination, the sample can upstream of BAQLaVa be run through HUMAnN or MetaPhlan. Reads that never align to the bacterial marker set will be captured in a temporary file called <sample>_humann_temp/<sample>_bowtie2_unaligned.fa
. Prepare this file for use with BAQLaVa by running the BAQLaVa accessory script found in the utility_scripts
directory:
python remove_lengths_humann_bacterial_depletion.py <sample>_humann_temp/<sample>_bowtie2_unaligned.fa
The resulting file will be named <sample>_bowtie2_unaligned_lengthremoved.fa
. This FATSA file is depleted of bacterial reads and compatible with input to BAQLaVa.
For the first BAQLaVa run (this will also test the installation) we can run:
baqlava -i input/BAQLaVa.V0.2.demo.fq -o demo_output/
You should see the output:
[0/2 - 0.00%] **Ready ** Task 0: humann
[0/2 - 0.00%] **Started ** Task 0: humann
[1/2 - 50.00%] **Completed** Task 0: humann
[1/2 - 50.00%] **Ready ** Task 2: python3
[1/2 - 50.00%] **Started ** Task 2: python3
[2/2 - 100.00%] **Completed** Task 2: python3
Run Finished
Let's look at what outputs we get:
ls demo_output/
BAQLaVa is built with the functionality of HUMAnN, so we see HUMAnN outputs BAQLaVa.V0.2.demo_genefamilies.tsv, BAQLaVa.V0.2.demo_pathabundance.tsv, and BAQLaVa.V0.2.demo_pathcoverage.tsv. The BAQLaVa output we care about is BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv.
anadama.log
BAQLaVa.V0.2.demo_genefamilies.tsv
BAQLaVa.V0.2.demo_pathabundance.tsv
BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv
BAQLaVa.V0.2.demo_humann_temp
BAQLaVa.V0.2.demo_pathcoverage.tsv
Let's look at BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv. If you want to look at a tabular formatted view you can always pass these commands through column -t -s $'\t'
first:
less -S demo_output/BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv
Some of the species names are quite long:
Species
EMESVIRUS_ZINDERI.VGB_34413
EMESVIRUS_ZINDERI.VGB_34413
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618
SINSHEIMERVIRUS_PHIX174.VGB_23180
SINSHEIMERVIRUS_PHIX174.VGB_23180
TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
UNCLASSIFIED_SPECIES.VGB_39523
UNCLASSIFIED_SPECIES.VGB_39523
Let's look at all columns but only the first couple short rows to get a better look at the file type:
head -n 8 demo_output/BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv | less -S
Species BAQLaVa.V0.2.demo_Abundance-RPKs Database
EMESVIRUS_ZINDERI.VGB_34413 92.38451935083009 TOTAL
EMESVIRUS_ZINDERI.VGB_34413 92.38451935083009 nucleotide
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618 613.5383356949833 TOTAL
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618 553.4027289001592 nucleotide
PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618 60.135606794824035 translated
SINSHEIMERVIRUS_PHIX174.VGB_23180 78.82953243780814 TOTAL
SINSHEIMERVIRUS_PHIX174.VGB_23180 78.82953243780814 nucleotide
We can see that some of the species were observed through nucleotide search, some through translated search, and some through both.
Checkpoint Questions:
Why are some of the species names long? What does this mean?
What does it mean when we observe an unclassified species VGB?
If only known species were used to make the synthetic demo data, should we expect to see an unclassified species? What can we learn aboat BAQLaVa that we see one?
Next, let's look at some real data. For the examples we will be looking it, we have data from the iHMP Project. This project gathered human microbiota samples to study a variety of conditions, but we will be focusing on the Inflammatory Bowel Diseases subset and using the fecal samples. We have samples from individuals with IBD and healthy controls, and for samples from all individuals, we have paired metagenomic and metatranscriptomic data prepared from the same origin sample. Looking first at a single sample - this input is the metagenomic sequencing of a human fecal sample, and the FASTA file has been enriched for non-bacterial reads (using upstream HUMAnN bacterial depletion). This sample was sequenced at a lower depth than iHMP samples on average (~3M reads), so we will get a result quite quickly with BAQLaVa. Let's get the viral profile with BAQLaVa:
baqlava -i input/BAQLaVa.V0.2.tutorial.MGX.fa -o MGX_output/
When that has completed, let's look at the BAQLaVa output:
less -S MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv
It's quite long - even with this undersequenced sample, we get a lot of viruses. How many unique viruses did we identify?
cut -f1 MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | uniq | grep -v "Species" | wc -l
It looks like there were 52 unique entries in column 1, so we observed 52 Viral Genome Bins. From looking at the species profile with less
, it looks like many are unclassified. What does it mean that we identified a Viral Genome Bin without a classified virus?
Let's look at the Viral Genome Bins we observed that contain an ICTV species:
cut -f1 MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | uniq | grep "UNCLASSIFIED" -v | grep -v "Species"
Which gives:
BEHUNAVIRUS_BH1_ASHDUOVIRUS_A2_COLUNAVIRUS_ILP1308_COLUNAVIRUS_CL2_COLUNAVIRUS_CL1_JUNAVIRUS_J1_PLEETREVIRUS_PLE3_JUNAVIRUS_PL1_PLEETREVIRUS_ILP84_LARMUNAVIRUS_LRM1_JUNAVIRUS_LJ.VGB_4696
CULOIVIRUS_AMERICANUS_COHCOVIRUS_SPLANCHNICUS_CULOIVIRUS_FAECALIS_COHCOVIRUS_HIBERNIAE_CANHAEVIRUS_HIBERNIAE_CANHAEVIRUS_FAECALIS_CULOIVIRUS_INTESTINALIS.VGB_566
We see that there are two identified VGBs that contain ICTV-recognized viruses in our MGX viral profile. Do these viral species seem reasonable to observe in the gut virome?
Next, let's look at the viral species we observe by the different components of reference-based search by running each of the commands below:
grep "nucleotide" MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | wc -l
grep "translated" MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | wc -l
We observed 13 Viral Genome Bins via nucleotide search, and 47 by translated search. Protein space is less specific than nucleotide space, so we were able to identify more species that were present in the sample by utilizing translated search in addition to nucleotide search.
Checkpoint questions:
Why do we identify so many Viral Genome Bins of unclassified viral species?
Do the classified viral species we observed seem reasonable to be present in a gut virome?
Why do we observe more VGBs with translated search than protein search?
As we learned above, iHMP project samples had paired metagenomic and metatranscriptomic sequencing data. Let's look at what the metatranscriptomic viral sample looks like when profiled with BAQLaVa:
baqlava -i input/BAQLaVa.V0.2.tutorial.MTX.fa -o MTX_output/
How many species do we see in the paired metatranscriptomic data?
cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | wc -l
We observed 25 unique species. How many of these are novel unclassified Viral Genome Bins?
cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | grep -v "Species" | grep "UNCLASSIFIED" | wc -l
21 observed viruses were novel. Let's look at the Viral Genome Bins we observed that contain an ICTV species:
cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | grep "UNCLASSIFIED" -v | grep -v "Species"
Which gives:
BEHUNAVIRUS_BH1_ASHDUOVIRUS_A2_COLUNAVIRUS_ILP1308_COLUNAVIRUS_CL2_COLUNAVIRUS_CL1_JUNAVIRUS_J1_PLEETREVIRUS_PLE3_JUNAVIRUS_PL1_PLEETREVIRUS_ILP84_LARMUNAVIRUS_LRM1_JUNAVIRUS_LJ.VGB_4696
CANDIDA_ALBICANS_TCA2_VIRUS.VGB_66696
CULOIVIRUS_AMERICANUS_COHCOVIRUS_SPLANCHNICUS_CULOIVIRUS_FAECALIS_COHCOVIRUS_HIBERNIAE_CANHAEVIRUS_HIBERNIAE_CANHAEVIRUS_FAECALIS_CULOIVIRUS_INTESTINALIS.VGB_566
YONGLOOLINVIRUS_CDMH1_YONGLOOLINVIRUS_C2_YONGLOOLINVIRUS_MMP03_YONGLOOLINVIRUS_MMP01.VGB_19001
We observed VGB_4696 and VGB_566 in both the MGX and MTX data for this sample. The species contained in these VGBs are dsDNA phages - what does it indicate about these phages that we observe them in both the MGX & MTX data?
We did not observe VGB_66696 or VGB_19001 in the MGX viral profile. What are two potential explanations for observing a virus in MTX data but not in MGX data?
VGB_66696 is a ssRNA virus and VGB_19001 is a dsDNA virus - are the two potential explanations above sufficient to explain both of their observations in the MTX sample?
Checkpoint questions:
When a virus with a DNA backbone is observed in metatranscriptomic data, what can we learn about it's role in the microbial community at the time of sequencing?
What are two reasons we might observe a virus in paired MTX data but not in MGX data? Do these explanations encompass the different nucleic acid viral types?
Today we only ran samples from one individual, but if we had time to run more FASTA files, we could get a table with multiple viral profiles. A table like this has samples across the columns and the species identified in rows. Luckily, we ran 10 metagenomic samples from the iHMP Project and made a table we can work with. Let's take a look at it now:
less -S prebaked_outputs/BAQLaVa.HMP2.merged.3.txt
As always, that Species line is long. Let's look just at the samples:
cut -f2-12 prebaked_outputs/BAQLaVa.HMP2.merged.3.txt | less -S
This file has the 500 most prevalent VGBs observed across ten random metagenomic samples from the HMP2 project. Let's visualize these samples with a heatmap:
Go to https://maayanlab.cloud/clustergrammer/ and upload the output/BAQLaVa.HMP2.merged.3.txt file.
Chose 'Correlation' and 'Average', then submit.
Move the Opacity slider all the way to the right.
Play around with the sum sliders to look at the top 100, 50, and 25 viral species by summed RPK.
Checkpoint questions:
The HMP2 project has samples from individuals with IBD and healthy controls. Based on the heatmap, do you observe clustering that indicates a viral profile can differentiate groups of individuals?
Are there viral species that are present across all virome samples?
Refer to the table below. Do you observe any trends in the heatmap by IBD diagnosis or dysbiosis status?
Sample IBD Status Dysbiosis Status
PSM7J17V nonIBD TRUE
HSM7J4LD nonIBD TRUE
HSM5MD43 UC FALSE
HSM7J4IS CD FALSE
HSM5MD7Q CD TRUE
HSM67VFD CD TRUE
MSMAPC6A CD FALSE
CSM79HI7 CD TRUE
MSM6J2MH CD FALSE
CSM79HH4 CD FALSE
Based on this small dataset, what properties do viromes look to share with bacterial microbiomes? Which appear to be different?
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA