Skip to content

BAQLaVa Tutorial

jjensen44 edited this page Sep 28, 2023 · 28 revisions

BAQLaVa Tutorial

BAQLaVa, Bioinformatic Application of Quantification and Labeling of Viral Taxonomy, is a tool designed to perform viral community profiling on metagenomic, metatranscriptomic, or metavirome sequencing data, especially data from microbiome experiments. BAQLaVa can identify all viruses recognized by the International Committee on Taxonomy of Viruses (ICTV), as well as the many novel viruses as captured and released in the Gut Virome Database, RNA Viruses in Metatranscriptomes (RVMT) database, and Benler et al 2021.

NOTE: This tutorial corresponds to BAQLaVa v0.2


Table of contents

This tutorial is short, as BAQLaVa has minimal argument options at this time. Please let us know what features you would find most useful to add next!

Overview

BAQLaVa is designed with two workflow arms. A tiered reference-based search arm intakes short sequencing reads and searches them against a viral marker set in both nucleotide and protein space. Separately, short reads are assembled into contigs, and BAQLaVa uses deep learning to predict viral taxonomy for these contigs at Genus level taxonomy. The predictions from these two arms are reconciled into one output: a BAQLaVa viral profile.

baqlava v0 2 overviewfig

BAQLaVa v0.2 carries out tiered reference-based search arm only. An input FASTA or FASTQ file is first searched against the marker set in nucleotide space. Reads that do not align in nucleotide search are next searched against the marker set in protein space. The output from both steps are reconciled and produced in a single BAQLaVa viral profile.

Preparing samples for use with BAQLaVa

BAQLaVa simply requires a single input fasta or fastq file and a directory location to save outputs. If using paired sequencing samples, cat forward and reverse reads together into one input file for BAQLaVa to ensure a single output viral profile per sample.

Input reads should have already been QC'd with a tool such as KneadData. It is optional, though highly recommended, to additionally enrich the file for viral reads by mapping to and removing reads that align to a bacterial database, or through a similar method. This will ensure lower false positive predictions in tiered reference search, and produce better contigs for the deep learning arm.

To remove bacterial contamination, the sample can upstream of BAQLaVa be run through HUMAnN or MetaPhlan. Reads that never align to the bacterial marker set will be captured in a temporary file called <sample>_humann_temp/<sample>_bowtie2_unaligned.fa. Prepare this file for use with BAQLaVa by running the BAQLaVa accessory script found in the utility_scripts directory:

python remove_lengths_humann_bacterial_depletion.py <sample>_humann_temp/<sample>_bowtie2_unaligned.fa

The resulting file will be named <sample>_bowtie2_unaligned_lengthremoved.fa. This FATSA file is depleted of bacterial reads and compatible with input to BAQLaVa.

Running BAQLaVa

For the first BAQLaVa run (this will also test the installation) we can run:

 baqlava -i input/BAQLaVa.V0.2.demo.fq -o demo_output/

You should see the output:

 [0/2 -   0.00%] **Ready    ** Task 0: humann
 [0/2 -   0.00%] **Started  ** Task 0: humann
 [1/2 -  50.00%] **Completed** Task 0: humann
 [1/2 -  50.00%] **Ready    ** Task 2: python3
 [1/2 -  50.00%] **Started  ** Task 2: python3
 [2/2 - 100.00%] **Completed** Task 2: python3
 Run Finished

Let's look at what outputs we get:

 ls demo_output/

BAQLaVa is built with the functionality of HUMAnN, so we see HUMAnN outputs BAQLaVa.V0.2.demo_genefamilies.tsv, BAQLaVa.V0.2.demo_pathabundance.tsv, and BAQLaVa.V0.2.demo_pathcoverage.tsv. The BAQLaVa output we care about is BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv.

 anadama.log			       
 BAQLaVa.V0.2.demo_genefamilies.tsv  
 BAQLaVa.V0.2.demo_pathabundance.tsv
 BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv  
 BAQLaVa.V0.2.demo_humann_temp	   
 BAQLaVa.V0.2.demo_pathcoverage.tsv

Let's look at BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv. If you want to look at a tabular formatted view you can always pass these commands through column -t -s $'\t' first:

less -S demo_output/BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv

Some of the species names are quite long:

 Species                                                                                                                                                        
 EMESVIRUS_ZINDERI.VGB_34413                                                                                                                                    
 EMESVIRUS_ZINDERI.VGB_34413                                                                                                                                    
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618                                                                                                            
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618                                                                                                             
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618                                                                                                            
 SINSHEIMERVIRUS_PHIX174.VGB_23180                                                                                                                              
 SINSHEIMERVIRUS_PHIX174.VGB_23180                                                                                                                              
 TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
 TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
 TEQUATROVIRUS_LUTTER_TEQUATROVIRUS_SLUR03_TEQUATROVIRUS_PST_ESCHERICHIA_VIRUS_CF2_TEQUATROVIRUS_OE5505_TEQUATROVIRUS_PE37...
 TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
 TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
 TUNAVIRUS_SHFL1_TUNAVIRUS_SH6_TUNAVIRUS_ADB2_TUNAVIRUS_PSF2_TUNAVIRUS_SH2_TUNAVIRUS_JMPW2_TUNAVIRUS_IME18_TUNAVIRUS_ISF002...
 UNCLASSIFIED_SPECIES.VGB_39523                                                                                                                                 
 UNCLASSIFIED_SPECIES.VGB_39523

Let's look at all columns but only the first couple short rows to get a better look at the file type:

head -n 8 demo_output/BAQLaVa.V0.2.demo_BAQLaVa_profile.tsv | less -S
 Species                                              BAQLaVa.V0.2.demo_Abundance-RPKs  Database
 EMESVIRUS_ZINDERI.VGB_34413                          92.38451935083009                 TOTAL
 EMESVIRUS_ZINDERI.VGB_34413                          92.38451935083009                 nucleotide
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618  613.5383356949833                 TOTAL
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618  553.4027289001592                 nucleotide
 PUNAVIRUS_P1_PUNAVIRUS_SJ46_PUNAVIRUS_RCS47.VGB_618  60.135606794824035                translated
 SINSHEIMERVIRUS_PHIX174.VGB_23180                    78.82953243780814                 TOTAL
 SINSHEIMERVIRUS_PHIX174.VGB_23180                    78.82953243780814                 nucleotide

We can see that some of the species were observed through nucleotide search, some through translated search, and some through both.

Checkpoint Questions:

Why are some of the species names long? What does this mean?

What does it mean when we observe an unclassified species VGB?

If only known species were used to make the synthetic demo data, should we expect to see an unclassified species? What can we learn aboat BAQLaVa that we see one?


Module 1: Metagenomic Sample

Next, let's look at some real data. For the examples we will be looking it, we have data from the iHMP Project. This project gathered human microbiota samples to study a variety of conditions, but we will be focusing on the Inflammatory Bowel Diseases subset and using the fecal samples. We have samples from individuals with IBD and healthy controls, and for samples from all individuals, we have paired metagenomic and metatranscriptomic data prepared from the same origin sample. Looking first at a single sample - this input is the metagenomic sequencing of a human fecal sample, and the FASTA file has been enriched for non-bacterial reads (using upstream HUMAnN bacterial depletion). This sample was sequenced at a lower depth than iHMP samples on average (~3M reads), so we will get a result quite quickly with BAQLaVa. Let's get the viral profile with BAQLaVa:

 baqlava -i input/BAQLaVa.V0.2.tutorial.MGX.fa -o MGX_output/

When that has completed, let's look at the BAQLaVa output:

 less -S MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv

It's quite long - even with this undersequenced sample, we get a lot of viruses. How many unique viruses did we identify?

 cut -f1 MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | uniq | grep -v "Species" | wc -l

It looks like there were 52 unique entries in column 1, so we observed 52 Viral Genome Bins. From looking at the species profile with less, it looks like many are unclassified. What does it mean that we identified a Viral Genome Bin without a classified virus?

Let's look at the Viral Genome Bins we observed that contain an ICTV species:

 cut -f1 MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | uniq | grep "UNCLASSIFIED" -v | grep -v "Species" 

Which gives:

BEHUNAVIRUS_BH1_ASHDUOVIRUS_A2_COLUNAVIRUS_ILP1308_COLUNAVIRUS_CL2_COLUNAVIRUS_CL1_JUNAVIRUS_J1_PLEETREVIRUS_PLE3_JUNAVIRUS_PL1_PLEETREVIRUS_ILP84_LARMUNAVIRUS_LRM1_JUNAVIRUS_LJ.VGB_4696
CULOIVIRUS_AMERICANUS_COHCOVIRUS_SPLANCHNICUS_CULOIVIRUS_FAECALIS_COHCOVIRUS_HIBERNIAE_CANHAEVIRUS_HIBERNIAE_CANHAEVIRUS_FAECALIS_CULOIVIRUS_INTESTINALIS.VGB_566

We see that there are two identified VGBs that contain ICTV-recognized viruses in our MGX viral profile. Do these viral species seem reasonable to observe in the gut virome?

Next, let's look at the viral species we observe by the different components of reference-based search by running each of the commands below:

grep "nucleotide" MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | wc -l
grep "translated" MGX_output/BAQLaVa.V0.2.tutorial.MGX_BAQLaVa_profile.tsv | wc -l

We observed 13 Viral Genome Bins via nucleotide search, and 47 by translated search. Protein space is less specific than nucleotide space, so we were able to identify more species that were present in the sample by utilizing translated search in addition to nucleotide search.

Checkpoint questions:

Why do we identify so many Viral Genome Bins of unclassified viral species?

Do the classified viral species we observed seem reasonable to be present in a gut virome?

Why do we observe more VGBs with translated search than protein search?


Module 2: Metatranscriptomic Sample

As we learned above, iHMP project samples had paired metagenomic and metatranscriptomic sequencing data. Let's look at what the metatranscriptomic viral sample looks like when profiled with BAQLaVa:

 baqlava -i input/BAQLaVa.V0.2.tutorial.MTX.fa -o MTX_output/

How many species do we see in the paired metatranscriptomic data?

 cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | wc -l     

We observed 25 unique species. How many of these are novel unclassified Viral Genome Bins?

 cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | grep -v "Species" | grep "UNCLASSIFIED" | wc -l

21 observed viruses were novel. Let's look at the Viral Genome Bins we observed that contain an ICTV species:

 cut -f1 MTX_output/BAQLaVa.V0.2.tutorial.MTX_BAQLaVa_profile.tsv | uniq | grep "UNCLASSIFIED" -v | grep -v "Species" 

Which gives:

BEHUNAVIRUS_BH1_ASHDUOVIRUS_A2_COLUNAVIRUS_ILP1308_COLUNAVIRUS_CL2_COLUNAVIRUS_CL1_JUNAVIRUS_J1_PLEETREVIRUS_PLE3_JUNAVIRUS_PL1_PLEETREVIRUS_ILP84_LARMUNAVIRUS_LRM1_JUNAVIRUS_LJ.VGB_4696
CANDIDA_ALBICANS_TCA2_VIRUS.VGB_66696
CULOIVIRUS_AMERICANUS_COHCOVIRUS_SPLANCHNICUS_CULOIVIRUS_FAECALIS_COHCOVIRUS_HIBERNIAE_CANHAEVIRUS_HIBERNIAE_CANHAEVIRUS_FAECALIS_CULOIVIRUS_INTESTINALIS.VGB_566
YONGLOOLINVIRUS_CDMH1_YONGLOOLINVIRUS_C2_YONGLOOLINVIRUS_MMP03_YONGLOOLINVIRUS_MMP01.VGB_19001

We observed VGB_4696 and VGB_566 in both the MGX and MTX data for this sample. The species contained in these VGBs are dsDNA phages - what does it indicate about these phages that we observe them in both the MGX & MTX data?

We did not observe VGB_66696 or VGB_19001 in the MGX viral profile. What are two potential explanations for observing a virus in MTX data but not in MGX data?

VGB_66696 is a ssRNA virus and VGB_19001 is a dsDNA virus - are the two potential explanations above sufficient to explain both of their observations in the MTX sample?

Checkpoint questions:

When a virus with a DNA backbone is observed in metatranscriptomic data, what can we learn about it's role in the microbial community at the time of sequencing?

What are two reasons we might observe a virus in paired MTX data but not in MGX data? Do these explanations encompass the different nucleic acid viral types?


Module 3: Community Analysis

Today we only ran samples from one individual, but if we had time to run more FASTA files, we could get a table with multiple viral profiles. A table like this has samples across the columns and the species identified in rows. Luckily, we ran 10 metagenomic samples from the iHMP Project and made a table we can work with. Let's take a look at it now:

less -S prebaked_outputs/BAQLaVa.HMP2.merged.3.txt

As always, that Species line is long. Let's look just at the samples:

cut -f2-12 prebaked_outputs/BAQLaVa.HMP2.merged.3.txt | less -S

This file has the 500 most prevalent VGBs observed across ten random metagenomic samples from the HMP2 project. Let's visualize these samples with a heatmap:

Go to https://maayanlab.cloud/clustergrammer/ and upload the output/BAQLaVa.HMP2.merged.3.txt file.

Chose 'Correlation' and 'Average', then submit.

Move the Opacity slider all the way to the right.

Play around with the sum sliders to look at the top 100, 50, and 25 viral species by summed RPK.

Checkpoint questions:

The HMP2 project has samples from individuals with IBD and healthy controls. Based on the heatmap, do you observe clustering that indicates a viral profile can differentiate groups of individuals?

Are there viral species that are present across all virome samples?

Refer to the table below. Do you observe any trends in the heatmap by IBD diagnosis or dysbiosis status?

Sample       IBD Status    Dysbiosis Status
PSM7J17V     nonIBD        TRUE
HSM7J4LD     nonIBD        TRUE
HSM5MD43     UC            FALSE
HSM7J4IS     CD            FALSE
HSM5MD7Q     CD            TRUE
HSM67VFD     CD            TRUE
MSMAPC6A     CD            FALSE
CSM79HI7     CD            TRUE
MSM6J2MH     CD            FALSE
CSM79HH4     CD            FALSE

Based on this small dataset, what properties do viromes look to share with bacterial microbiomes? Which appear to be different?

Clone this wiki locally