Skip to content

biobakery_workflows

Thomas Kuntz edited this page Jun 17, 2022 · 80 revisions
Clone this wiki locally

bioBakery Workflows Tutorial

bioBakery workflows is a collection of workflows made up of tasks for executing common microbial community analyses using standardized, validated tools and parameters. These tasks are typically bioBakery tools reproducibly and efficiently connected via the AnADAMA2 framework.

The workflows generally fall into two categories: profiling, which take raw data to usable data products, and analysis, which process data products into plots and statistics.


Table of Contents


Overview


image


1. Setup


1.1 Requirements

All of the core requirements (including AnADAMA2) are automatically installed when installing bioBakery workflows. This includes the following bioBakery tools:

  1. KneadData (v0.6.1+)
  2. MetaPhlAn
  3. HUMAnN
  4. StrainPhlAn

1.2 Installation

bioBakery workflows can be installed with Conda, Docker, or pip.

Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install bioBakery workflows because the tool and its dependencies are already installed.

1.2.1 Install with Conda

bioBakery workflows can be installed using Conda. This will install almost all of the dependencies for all workflows (ie KneadData, MetaPhlan, etc.) excluding those dependencies that have licenses.

To install with conda:

conda install -c biobakery workflows
  • Before installing the workflows with conda, make sure to configure your channels so biobakery is at the top of the list.

1.2.2 Install with Docker

bioBakery workflows can be run from a pre-built Docker image. This option is best if you are running in a computing environment where a container platform like Docker is installed or if you have the elevated permissions required to install Docker or a similar tool. The image will include all dependencies for all workflows (ie KneadData, MetaPhlan, etc.) excluding those dependencies that have licenses.

To install and run with Docker:

docker run -it biobakery/workflows bash

1.2.3 Install with pip

bioBakery workflows can be installed with pip. This will only install the core software and dependencies. It will not install the dependencies for the workflows. Refer to the bioBakery workflows user manual on how to install the dependencies for each workflow.

To install with pip:

pip install biobakery_workflows

To install the core wmgx dependencies for the tutorial (minus strainphlan), first please ensure that python3, python3-dev, gcc, and gcc-c++ are installed.

To install the core wmgx dependencies running biobakery workflows for version 3:

# installing kneaddata and humann from source to also install dependencies
pip install kneaddata --no-binary :all:
pip install humann --no-binary :all:
# install metaphlan and databases (set nproc to total cores to use to build database)
pip install metaphlan
metaphlan --install --nproc 4

If running biobakery workflows with version 2 of HUMAnN and MetaPhlAn, replace humann with humann2 in the command above and manually install MetaPhlAn from source following the instructions in the tools readme as MetaPhlAn is only available from PyPI starting with version 3.

1.2.4 Install dependencies with licenses

The 16S workflow optionally depends on usearch which requires a license. Obtain a license for usearch and install it manually if you will be running the 16S workflow.

1.2.5 Install databases

All of the data processing workflows require large databases. Due to their size, they are not installed with the Conda, Docker, or VM images. A demo set of databases will be used with this tutorial to reduce the run time.

To install the demo shotgun databases:

 biobakery_workflows_databases --install wmgx_demo 

To install the full shotgun databases:

 biobakery_workflows_databases --install wmgx

To install the full 16S databases:

 biobakery_workflows_databases --install 16S

2. Metagenome profiling


Raw metagenome shotgun sequencing reads can be processed through the workflow with a single command. The workflow will run quality control, taxonomic profiling, functional profiling, strain profiling, and optionally assembly for the full set of samples.

image

The raw files can be single or paired-end and can be formatted as FASTA or FASTQ. They can also be compressed with gzip. By default fastq.gz files are expected. If paired-end, the default pair identifier is ".R1". To change the default for these parameters use the options "--pair-identifier <R1>" and "--input-extension <fasta>".

2.1 Input files

For this tutorial, we will be using a set of subsampled files to reduce run time. These files were generated using reads from a set of three healthy and three diseased samples in a liver cirrhosis study: Alterations of the human gut microbiome in liver cirrhosis.

NOTE: These files are located in the Tutorials/workflows/input directory of the bioBakery image. To work with them there, from the home directory, type cd Tutorials/workflows.

2.2 Run the workflow

Run the following command to start the workflow, substituting your input and output folder names as needed:

biobakery_workflows wmgx --input input --output output_data

For demo purposes, we will bypass strain profiling with --bypass-strain-profiling and run 2 tasks at once with --local-jobs 2 to substantially reduce the run time.

If you have more time to spend or cores to use, see the sections that follow for different options.


  • What other steps can you bypass? Hint: bioBakery workflows can also accept the --help argument.

2.2.1 Scaling up

By default, the workflow will run a single task at a time and each task will be allowed a single core. If you have multiple cores or a grid computing environment, add the following options to your workflow to speed up the run:

  • Add the option --local-jobs 6 to run at most six tasks at once locally.
  • Add the option --threads 2 to give each task two cores.
  • If you are running in an environment where grid computing is available (using a SLURM or SGE system), add the option --grid-jobs 6 --threads 8 to run at most six grid jobs at once each with eight cores.

2.2.2 Expected run time

The run time will vary depending on the number of input files and the workflow options. All run time estimates are based on the use of a single core.

  • Run a single sample

    • 5 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling)
    • 15 minutes: Run the full workflow.
  • Run all six samples

    • 30 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling)
    • 60 minutes: Run the full workflow.

When running the subsampled files with strain profiling it is recommended to add the option --strain-profiling-options="--sample_with_n_markers 1". This will allow for strains to be identified with lower numbers of markers than the default.


  • Why do we need to lower the number of required markers here?

2.2.3 Standard output

Running the workflow on a single sample, the output to your terminal will look something like the following:

(Jul 18 13:53:24) [ 0/21 -   0.00%] **Ready    ** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:24) [ 0/21 -   0.00%] **Started  ** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Completed** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Ready    ** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Started  ** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 2/21 -   9.52%] **Completed** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 2/21 -   9.52%] **Ready    ** Task  4: metaphlan____HD42R4_subsample
...

The output contains the following information:

  • Column 1: (Jul 18 13:53:26) : The time stamp.
  • Column 2: [ 2/21 - 9.52%] : The status of the workflow in tasks and percent of tasks completed.
  • Column 3: **Ready ** : The current status for the task. Tasks that are ready have all dependencies available and are waiting for computing resources. Tasks that have started are beginning to run locally. Tasks that are completed have finished executing.
  • Column 4: Task 4 : The task number.
  • Column 5: metaphlan____HD42R4_subsample : The name of the task. Tasks that are specific to a sample will include the sample name at the end of the task name.

A line will be written to standard output anytime a task changes status. When running on the grid, task status will be polled at minute intervals to track job progress in standard output.

All text written to standard output will also be written to the log.


  • The output above was from a single core job. How might this look different running on multiple processors?

2.3 Output files

All output files will be located in the output folder specified with --output <folder> when running the workflow. If you would like to move forward without finishing the previous run, the results can be downloaded directly:

NOTE: Results of the demo run are located in the Tutorials/workflows/backup_outputs/output_data directory of the bioBakery image if you would like to skip forward in that environment.

Check the contents of the output folder:

cd output_data/
ls
anadama.log
humann
metaphlan
kneaddata
strainphlan

A subfolder was created for the outputs of each tool run. Therefore, if the --bypass-strain-profiling option was used, as it is in this demo, the StrainPhlAn results will not be present.

The following sections describe the basic structure and content of these folders. For a detailed breakdown of the final data products and how they can be used, see the visualization and statistics workflow portions of this tutorial as well as the user manuals and tutorials for included each tool.

2.3.1 Quality control data

KneadData is used for quality control. Two folders are output, main which contains the per-sample results, and merged which contains across sample summaries.

Let's look at all of the outputs generated for a specific sample:

ls kneaddata/main | grep HD32R1
HD32R1_subsample.fastq.gz
HD32R1_subsample.log
HD32R1_subsample.repeats.removed.fastq
HD32R1_subsample.trimmed.fastq
HD32R1_subsample_Homo_sapiens_demo_bowtie2_contam.fastq

These are, in order:

  1. The final reads passing QC
  2. The log file with parameters and debugging info
  3. The reads with tandem repeats removed
  4. The reads trimmed for poor quality bases and reads
  5. The contaminant sequences aligned to the KneadData human database

Now look at the across sample summaries:

ls kneaddata/merged
kneaddata_read_counts_table.tsv

This is a table of the read counts at each step for every sample, which we can use to examine the overall QC for the experiment.

2.3.2 Taxonomic profiling data

MetaPhlAn is used for taxonomic profiling and similarly to KneadData, generates main and merged folders as its output.

Again, let's look at the types of files output for each input sample:

ls metaphlan/main | grep HD32R1
HD32R1_subsample_bowtie2.sam
HD32R1_subsample_taxonomic_profile.tsv

Each sample will generate mapping results to unique sequence markers and a single sample taxonomic profile.

And let's look at the merged outputs:

ls metaphlan/merged
metaphlan_species_counts_table.tsv
metaphlan_taxonomic_profiles.tsv

These are the counts of species identified in each sample and a merged table of all of the sample's taxonomic profiles, respectively.

2.3.3 Functional profiling data

HUMAnN is used for functional profiling. There is significantly more output than KneadData or MetaPhlAn, but the main and merged folders are similar.

Each input file will generate a file of gene family abundances, pathway abundances, pathway coverage, a log, and a folder of intermediate output files into the main folder.

Additionally, these files are regrouped into enzyme commission numbers (ecs) and normalized to relative abundances in the regrouped and relab subfolders, respectively.

To see this all for a single sample (excluding intermediate files):

find humann/ | grep HD32R1 | grep -v temp
humann/regrouped/HD32R1_subsample_ecs.tsv
humann/relab/genes/HD32R1_subsample_genefamilies_relab.tsv
humann/relab/pathways/HD32R1_subsample_pathabundance_relab.tsv
humann/relab/ecs/HD32R1_subsample_ecs_relab.tsv
humann/main/HD32R1_subsample_genefamilies.tsv
humann/main/HD32R1_subsample_pathcoverage.tsv
humann/main/HD32R1_subsample_pathabundance.tsv
humann/main/HD32R1_subsample.log

The merged folder contains the gene families, ecs, and pathways files for all samples merged into single files for each output type.

ls humann/merged
ecs.tsv
ecs_relab.tsv
genefamilies.tsv
genefamilies_relab.tsv
pathabundance.tsv
pathabundance_relab.tsv

Note that the raw and normalized to relative abundance (relab) versions are both included.

The counts folder contains feature counts of gene families, ecs, and pathways for all samples. It also contains a file of total species identified after filtering and total reads aligning (for nucleotide and translated search) for each sample:

ls humann/counts
humann_ecs_relab_counts.tsv
humann_feature_counts.tsv
humann_genefamilies_relab_counts.tsv
humann_pathabundance_relab_counts.tsv
humann_read_and_species_count_table.tsv

2.3.4 Strain profiling data

StrainPhlAn is used for strain profiling. For more information on StrainPhlAn, see the StrainPhlAn tutorial.

The strainphlan folder contains all of the core outputs from running the StrainPhlAn software. For more information on the core output files, see the StrainPhlAn user manual. By default, at most ten species identified in the set of samples are profiled. The RAxML files contain the trees generated for each species. If the markers are not present in sufficient quantities for these ten species, there might not be any trees included in this folder.

2.3.5 Log file

The workflow log contains all of the details of that workflow's execution including software versions, commands run, and task benchmarking information. This appends when writing, so if you were to run the workflow twice, you would see two log entries, one for each workflow run.

The start of a log file includes the workflow settings, for example:

2017-07-18 15:30:41,946 LoggerReporter  started INFO: Beginning AnADAMA run with 37 tasks.
2017-07-18 15:30:41,946 LoggerReporter  started INFO: Workflow configuration options
2017-07-18 15:30:41,946 LoggerReporter  started INFO: strain_profiling_options =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_benchmark = off
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_partition =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: pair_identifier = .R1
...

Later lines include task information, for example:

2017-07-18 15:30:42,988 LoggerReporter  task_command    INFO: Tracked executable version:  kneaddata v0.6.1
2017-07-18 15:30:42,988 LoggerReporter  task_command    INFO: Executing with shell:  kneaddata --input /code/demo/HD42R4_subsample.fastq.gz --output /code/demo/output/kneaddata/main --threads 1 --output-prefix HD42R4_subsample   --reference-db /code/demo/kneaddata_demo_db/Homo_sapiens_demo/

Here, the first line captures the version of software used by the workflow and the second prints a command run including command line parameters.


  • Other than diagnosing errors, why might you want to keep this log file?

3. 16S profiling


16S sequencing reads can also be processed through the workflow with a single command. The workflow will run quality control, taxonomic profiling, and functional prediction for the full set of samples.

There are two methods to select from for 16S data sets: UPARSE (VSEARCH or USEARCH) and DADA2. They both require the same type of input files and generate similar output files. You can select which method to run by adding the --method <vsearch/usearch/dada2> option to your command.

3.1 Input files

For the 16S workflow the raw files can be single or paired-end and be multiplexed or de-multiplexed. The raw files can be fastq or fastq.gz formatted. By default fastq.gz files are expected. If paired-end the default pair identifier is _R1_001. To change the default for the pair identifiers use the option --pair-identifier <R1>. If the files are multiplexed, a barcode file must be included in the input file set by adding the option --barcode-file <file>. The barcode file should be tab-delimited with the first column the sample names and the second the barcode sequences. An index file is also required. To change the default index file identifier add the option --index-identifier <_I1_001>.

For this tutorial, we will be using a set of sub-sampled data to reduce run time. These files are from the data set generated for the following study, Melanie Schirmer, Umer Z. Ijaz, Rosalinda D'Amore, Neil Hall, William T. Sloan, Christopher Quince; Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform, Nucleic Acids Research, Volume 43, Issue 6, 31 March 2015, Pages e37, https://doi.org/10.1093/nar/gku1341.

Download all of the files to run the full tutorial. If you have limited time, only download one or two pairs of files. Place these files in a folder named input in your current working directory.

3.2 Run the workflow

Run the following command to start the workflow:

 biobakery_workflows 16s --input input --output output_data --method usearch

All input files are located in the folder input and all output files will be written to the folder output_data.

By default the UPARSE method will be run. To run the DADA2 method, add the option --method dada2. By default the DADA2 method will use the SILVA databases. To select a different database add the option –dada-db <silva/gg/rdp>. If you have more time or cores to use to run the tutorial, see the prior section in this tutorial about scaling up.

See the prior section on standard output for the details on the output written to standard out for the bioBakery workflows.

3.2.1 Expected run time

The run time will vary depending on the number of input files and the workflow options. All run time estimates are based on the use of a single core.

  • Run a single sample

    • 5 minutes: UPARSE method
    • 5 minutes: DADA2 method with green genes or rdp databases
  • Run all samples

    • 10 minutes: UPARSE method
    • 10 minutes: DADA2 method with green genes or rdp databases - 60 minutes: DADA2 method with silva database

3.3 UPARSE output files

The UPARSE method follows the UPARSE pipeline plus functional profiling with PICRUSt and phylogenetic tree generation with FastTree.

image

All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data.

If you used the same command as included above, the output folder will be named output_data. Run the command to list the contents of the output folder to see one sub-folder along with the workflow log.

ls output_data
merged_renamed
all_samples_categorize_by_function.biom
all_samples_categorize_by_function.tsv
all_samples_closed_reference.fasta
all_samples_clustalo_aligned_closed_reference.fasta
all_samples_clustalo_aligned_nonchimera.fasta
all_samples_concatenated_discarded.fasta
all_samples_concatenated_filtered.fasta
all_samples_concatenated_truncated.fasta
all_samples_concatenated.fasta
all_samples_concatenated.fastq
all_samples_denovo_otu_table.tsv
all_samples_dereplicated.fasta
all_samples_eestats2.txt
all_samples_green_genes_mapping_results.tsv
all_samples_green_genes_mapping_results.uc
all_samples_normalize_by_copy_number.biom
all_samples_open_reference.fasta
all_samples_otu_mapping_results.tsv
all_samples_otu_mapping_results.uc
all_samples_otus_nonchimeras.fasta
all_samples_predict_metagenomes.biom
all_samples_predict_metagenomes.tsv
all_samples_read_counts.tsv
all_samples_sorted.fasta
all_samples_taxonomy_closed_reference.biom
all_samples_taxonomy_closed_reference.tsv
all_samples_taxonomy_open_reference.tsv
all_samples_uparse_otus.txt
anadama.log
closed_reference.tre

The merged_renamed folder contains merged pairs with sequences renamed to sample names.

3.3.1 Quality control data

Quality of the reads are summarized in all_samples_eestats2.txt. Read counts of original, mapped, and unclassified reads are included in all_samples_read_counts.tsv.

For quality control all reads from all samples are filtered and truncated. All reads with sequence ids corresponding to sample name are in all_samples_concatenated.fastq.

all_samples_concatenated_discarded.fasta
all_samples_concatenated_filtered.fasta
all_samples_concatenated_truncated.fasta
all_samples_concatenated.fasta
all_samples_concatenated.fastq
all_samples_dereplicated.fasta
all_samples_sorted.fasta

3.3.2 OTU mapping data

Then reads are clustered, chimeras are removed, and an OTU table is created. Details of mapping of reads to OTUs can be found in following files.

all_samples_denovo_otu_table.tsv
all_samples_otu_mapping_results.tsv
all_samples_otu_mapping_results.uc
all_samples_otus_nonchimeras.fasta
all_samples_uparse_otus.txt

3.3.3 Taxonomic profiling data

The following files are closed and open reference tables after assigning taxonomy using the green genes database. The closed reference table is converted to biom format as well for further processing by PICRUSt.

all_samples_closed_reference.fasta
all_samples_green_genes_mapping_results.tsv
all_samples_green_genes_mapping_results.uc
all_samples_open_reference.fasta
all_samples_taxonomy_closed_reference.biom
all_samples_taxonomy_closed_reference.tsv
all_samples_taxonomy_open_reference.tsv

3.3.4 Phylogenetic tree

Then Clustal Omega’s multiple sequence alignment algorithm is used to construct msa files and FastTree to construct phylogenetic tree.

all_samples_clustalo_aligned_closed_reference.fasta
all_samples_clustalo_aligned_nonchimera.fasta
closed_reference.tre

3.3.5 Functional profiling data

Metagenome functional predictions can be found in these files generated by PICRUSt’s predict metagenomes and categorize by function modules.

all_samples_predict_metagenomes.biom
all_samples_predict_metagenomes.tsv
all_samples_categorize_by_function.biom
all_samples_categorize_by_function.tsv

3.4 DADA2 output files

The DADA2 method follows the DADA2 pipeline plus phylogenetic tree generation with FastTree.

image

All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data.

If you used the same command as included above, the output folder will be named output_data. Run the following command to list the contents of the output folder to see four sub-folders along with the workflow log.

ls output_data
Error_rates_per_sample_FWD.png          
Error_rates_per_sample_REV.png              
FWD_read_plot.png                   
REV_read_plot.png                   
Read_counts_after_filtering.tsv         
Read_counts_at_each_step.tsv                
Read_counts_filt.rds                
all_samples_SV-counts.tsv               
all_samples_clustalo_aligned_nonchimera.fasta   
all_samples_taxonomy_closed_reference.tsv       
all_samples_taxonomy_closed_reference_taxcolumns.tsv
all_samples_taxonomy_closed_reference_withseqs.tsv
anadama.log
closed_reference.tre
error_ratesFWD.rds
error_ratesREV.rds
filtered_input
mergers.rds
seqtab_final.rds
sequences.fasta

The filtered_input folder contains filtered and trimmed fastq.gz files, anadama.log file contains workflow run log information.

3.4.1 Quality control data

Following plots describe quality of reads for each sample. They are learned error rates per sample for forward and reverse reads, and quality of reads per sample for forward and reverse reads.

Error_rates_per_sample_FWD.png          
Error_rates_per_sample_REV.png              
FWD_read_plot.png                   
REV_read_plot.png           

The workflow tracks read counts on each step of the processing. The counts are in following files

Read_counts_after_filtering.tsv         
Read_counts_at_each_step.tsv                

3.4.2 Taxonomic profiling data

The Amplicon Sequence Variant (ASV) table generated by DADA2 workflow uses sequences as ids. It is an equivalent of a high resolution OTU table.

all_samples_SV-counts.tsv

The 16S DADA2 workflow gives the option to assign taxonomy including species level using database of silva, green genes, or rdp. The following files are closed reference tables after assigning taxonomy in 3 different formats. DADA2 uses sequences as ids, and has separate column for each taxonomic level. For users convenience and consistence with other workflows, we generate a version of the table with all taxonomic levels combined in one column, and a version where ids are not sequences, but short strings ASV1, ASV2,… similar to OTU1, OTU2,….

all_samples_taxonomy_closed_reference.tsv
all_samples_taxonomy_closed_reference_taxcolumns.tsv
all_samples_taxonomy_closed_reference_withseqs.tsv

3.4.3 Phylogenetic tree

All sequences found in samples after merging pairs, de-replicating and removing chimeras are saved in fasta format.

sequences.fasta

Then Clustal Omega’s multiple sequence alignment algorithm is used to construct a msa file and FastTree to construct phylogenetic tree.

all_samples_clustalo_aligned_nonchimera.fasta
closed_reference.tre

3.4.4 RDS files

The following R native format RDS files are also available in the output folder:

Read_counts_filt.rds       read counts after filtering
error_ratesFWD.rds         error rates by sample, forward reads
error_ratesREV.rds         error rates by sample, reverse reads
mergers.rds                merged pairs
seqtab_final.rds           ASV table after merging, de-replicating and removing chimeras

4. Analysis workflows


The AnADAMA2 framework can be additionally be utilized for building downstream analysis pipelines. The bioBakery workflows installation provides two pre-built workflows of this type, named vis and stats. These are compatible with the data products from both the bioBakery metagenome and 16S workflows (USEARCH and DADA2), though any correctly formatted and named files can be processed. These visualization and statistics workflows can use quality control, taxonomic profiles, and functional profiles plus sample metadata to generate initial exploratory tables and figures as well as a summary pdf report.

4.1 Input files

For the visualization and statistics demos, we will use data generated from running metagenomic samples from the HMP2 project through the bioBakery wmgx profiling workflow, plus a subset metadata file.


  • What would happen if you instead used the data processed previously in this tutorial?

To begin, download the demo files into an input folder:

NOTE: These files are located in the Tutorials/workflows/input_vis_stats directory of the bioBakery VM image.

Now let's examine these inputs, e.g.:

less -S input_vis_stats/kneaddata_read_count_table.tsv

  • How is each file formatted, and what does it contain? Refer back to the help pages for each bioBakery tool if uncertain.
Answer:
  • kneaddata_read_count_table.tsv : A tab-delimited file with samples as rows and read counts at each step of the quality control filtering process as columns.
  • metaphlan_taxonomic_profiles.tsv : A tab-delimited file with samples as columns and rows with taxonomic abundance information.
  • pathabundance_relab.tsv : A tab-delimited file with samples as columns and rows with pathway abundances.
  • HMP2_metadata_subset.tsv : A file of information about the sources of the samples with matching sample IDs to the microbiome measurements. In this case, they are IBD patients and healthy controls from the Integrative Human Microbiome Project.
  • anadama.log : The log file from the profiling workflow that generated these data products. It captures the command line options, the versions of the software, and the exact commands run.

Again, these files were subset from the original study for tutorial purposes so that they take less time to view and process, while still showing some interesting behavior and associations. If you are interested in performing analyses on the full data, you can obtain it from the public repository.

4.2 Vis workflow

The vis workflow is designed to take as input the outputs from the metagenomics or 16S workflows and produce summary visualizations that can be used to quickly assess the quality and content of a processed microbiome dataset, with minimal work necessary from the user. Towards that end, it will identify files based on the default names generated by those workflows for merged tables (see the previous sections on "Output files" for details) and generate a number of common visualizations from the available inputs. Metadata can be included or not, but must be identified via a command line parameter rather than the file name.

The output of the help command, in the input argument section, describes all of the files the workflow can utilize:

biobakery_workflows vis --help

  • What files are required when running the workflow to process metagenomics data?
  • What files are required when running the workflow to process 16S data?

If you would like to use this workflow with data files that have been generated through a different pipeline, you can give them the file names in the help message, and they will be processed. However, keep in mind that if the formatting varies from the bioBakery standard, you may encounter errors.

4.2.1 Run the workflow

Once the desired input files are in one folder (subfolders will also be searched), only a single command is required to get sensible output from most studies:

biobakery_workflows vis --input input_vis_stats --output HMP2_vis --project-name HMP2 --author-name YOUR_NAME

Metadata can be included to generate additional relevant outputs:

biobakery_workflows vis --input input_vis_stats --output HMP2_vis --project-name HMP2 --author-name YOUR_NAME 
--input-metadata input_vis_stats/HMP2_metadata_subset.tsv

This command will take either about 4 minutes or 10 minutes, if the --input-metadata parameter is included, to run. You can also download the final report directly, though this does not include a number of important auxiliary files: wmgx_report.pdf.

NOTE: This file is in the Tutorials/workflows/backup_outputs/output_vis_stats directory of the bioBakery image.

4.2.2 Output files

The vis workflow outputs a single folder with two subfolders, as well as a zip archived copy:

ls HMP2_vis
anadama.log  data  figures  wmgx_report.pdf

The primary output is a pdf that compiles and summarizes the input files, automatically adjusting in size and content based on the types of data products supplied.

Additionally, a log file named anadama.log, like the one generated for the data profiling workflows, captures the command line options, software versions for any command line executed tools, plus any command line executed tasks.

The folder named data contains all of the files that were used to generate the figures in the report. The folder named figures contains high resolution images of all of the figures contained in the report. Thus individual aspects of the pdf report can be pulled out as input to other software and presentations.

4.2.3 Vis report

A description of a typical report and some sample images follows. To view the full output, run or download the demo data from the previous section.

The report begins with a header including an optional image if provided, the project name, author and date. Next, a table of contents is displayed, including clickable links to each section.

The first section in the report contains an introduction. By default, a diagram of the workflow corresponding to the type of data products is included in this section along with a short description. The user can modify this section to include a custom introduction by using the option --introduction-text $TEXT, replacing $TEXT with the custom introduction.

The next section includes the quality control data, if provided. This shows the number of reads filtered at each QC step, starting from the number of raw (unprocessed) reads. If paired end, the results are broken down by paired and orphan (unpaired) reads. Note that if any table is too large, a partial table is included in the document with a link to the full table as a text file in the data folder.

The third section details the taxonomic composition of the samples in several ways. This is always included as a minimum in the report. Species and genus level information is compiled for metagenomic data. For 16S data, terminal taxa, OTUs or ASVs collapsed to highest known taxonomic order, are used in place of species, while genera are filtered for only those which have assignments at that level.

The section begins with a table listing the total number of taxa after standard filtering for abundance and prevalence. Then, if metadata is provided, alpha diversity plots are generated using the Inverse Simpson index, with box plots for categorical variables and scatter plots for continuous variables. Note that the alpha diversity is calculated on the unfiltered taxonomic table, but all subsequent visualizations and statistics are performed on filtered data.

Next, beta diversity is qualitatively assessed via principal coordinate analysis (PCoA) performed on Bray-Curtis dissimilarities, visualized by plotting the top two axes. If metadata is included, these plots are replicated, one for each metadata variable, with continuous metadata values colored with a gradient and categorical colored individually.

Finally, sample level taxonomic summary figures are displayed. Heatmaps containing the top 25 and stacked barplots containing the top 15 taxa by average relative abundance are generated. These plots are annotated with metadata when available. Note that both heatmaps and barplots can be set to show a different number of top taxa through command line arguments.

The fourth section focuses on functional profiling and plots will be generated if pathway or Enzyme Commission (EC) data are provided. This data, like taxonomic abundance data, is first filtered by abundance and prevalence. Heatmaps are included, similar to the taxonomy section, for the top 25 pathways and ECs. Additionally, the total number of reads aligned at each of the HUMAnN processing steps can be plotted. The final part of this section includes a list of the top 25 pathways, with links to descriptions in MetaCyc for the top three.

Finally, if a log file is provided, such as those generated with the metagenomic or 16S data processing workflows, a section including details about the logged tasks and environment is included. This part includes two sections, with the first listing all of the software run by the workflow plus their versions, and the second listing all of the commands run by the workflow to generate the data files.


  • What can you glean about the HMP2 cohort from these visualizations?
  • Can you tell that these are subset data?

4.3 Stats workflow

To complement the visualization workflow, a stats workflow is provided which generates a number of standard summary statistics and plots suitable for a preliminary assessment of a dataset. It is designed primarily for use with data products from the bioBakery profiling workflows.

However, it will work to identify and process any tab-delimited file that includes sample and feature names in the input folder. This allows for inclusion of a larger variety of data types, for example metabolomics measurements, with that caveat that the user must remove extraneous files from the input folder. The stats workflow, unlike the vis, always requires metadata, which should be tab-delimited and can be in the format of samples as rows or columns.

The metadata can describe a single point or repeat measures study, with the latter being run if random effects are specified with the --random-effects flag. These are differentiated because within-subject correlations are often strong enough that different statistical approaches are required. For repeat measures studies, a column or row with the name subject in the metadata is required, denoting which samples belong to the same set (have the same subject ID).

For the purpose of this tutorial, we will be starting with longitudinal data from the HMP2. We will additionally subset the data to the first time point from each subject for comparison purposes.

4.3.1 Run the workflow

Running the stats workflow by default is similar to running the vis workflow. More specifically the user is only required to provide the input folder, metadata file, output folder, project name, and author name. However, many components of the stats can take additional arguments, owing to the larger variety of input files possible.

To see what options exist, check the help message:

biobakery_workflows stats --help

For the tutorial data, we need to add a few optional parameters to specify how the statistical models should best handle this study design. Generally, repeat measures experiments and non-standard data types necessitate more command line specification.

biobakery_workflows stats --input input_vis_stats/ --input-metadata input_vis_stats/HMP2_metadata_subset.tsv --fixed-effects="diagnosis,antibiotics,age,dysbiosis" --random-effects="site,subject" --output HMP2_stats --static-covariates="diagnosis,age,site --permutations 10 --maaslin-options="reference='diagnosis,nonIBD'" --project-name HMP2 --author-name AUTHOR_NAME

The tutorial input folder, including the metadata within, remains the same as for the vis workflow. The output folder is changed so as to not overwrite files from the vis output. Let's look at each of the additional new options in detail:

First, we set the fixed effects and the random effects to indicate how these metadata variables will be treated in MaAsLin 2 models:

--fixed-effects="diagnosis,antibiotics,age,dysbiosis" and --random-effects="site,subject"

Differentiating these are recommended in a study design with repeated, correlated measurements (ie, longitudinal, blocked, etc.).


  • What is a fixed effect? How about a random effect?
  • Why are we setting "site" (sampling location) and "subject" as random effects?
Answer:

A fixed effect is typically a fixed parameter estimate for each level of interest, so that we can investigate the differences in responses between levels. A random effect is used for a variable whose levels are regarded as a sample from some larger population of levels (e.g.: individuals in the longitudinal study or neighborhoods in a city) where we are not interested in difference between responses, but rather so that we can model the variance of the population that gave rise to the observed levels.

Here we are not interested in directly comparing the microbiomes of individual subjects, which were sampled from a large population of possible subjects, so we treat this as a random effect. We however include it in the model because not accounting for these differences substantially obscures the fixed effects we are interested in here, such as the effect of diagnosis. Note that being treated as a fixed or random effect does not generally determine a variable’s relevance to the study and that the same variable can sometimes be modeled as either a fixed or a random effect. We could chose to specify site as a fixed effect if we were interested in the differences between sampling locations, but right now we aren't.


Next, we set what are referred to here as static covariates. These are variables which do not change for subjects during the study. This is required for the repeat measures PERMANOVA, because it uses a permutation scheme which shuffles within subjects for variables that have more than one value per subject and between subjects in blocks for "static" values. This doesn't necessarily align with random vs. fixed effects, and subject is a privileged metadata variable name that is required for this to run correctly (i.e. define the blocking structure for permutations).

--static-covariates="diagnosis,age,site"

In our demo data, diagnosis and sampling site do not change within any subjects. Age does not either, because it was recorded at the start of the study. Dysbiosis score and antibiotic usage, however, do vary across the study at the individual level.

As mentioned, PERMANOVA is a permutation based technique, so for the purposes of speeding up demo calculations, we will set this very low:

--permutations 10

In practice you would likely want to use the default number of permutations, which is 4999, or at minimum 999.

Finally, we set the reference level for the diagnosis variable to nonIBD:

--maaslin-options="reference='diagnosis,nonIBD'"

This is required for MaAsLin 2 factors with more than 2 levels, since comparisons are done with respect to a single reference level, rather than pairwise. Reported results are changes relative to this level and may vary if another is chosen.

This run should take approximately 4 minutes with 4 cores. You can also download the generated report directly: stats_report.pdf.

NOTE: This file is in the Tutorials/workflows/backup_outputs/output_vis_stats directory of the bioBakery image.

Now let's modify the data set to remove the multiple time points. First download the list of samples that correspond to the first time point for each subject: HMP2_sample_subset.txt.

NOTE: These files are located in the Tutorials/workflows/input_vis_stats_subset directory of the bioBakery VM image.

Next, create a new folder and then generate a subset metadata file, taxonomic profile file, and pathway abundance file.

We will use a workflows utility script named subset_file_by_sample_id.py to subset the files. The script will print out the bash command that it is running to generate the subset file.

First let's create a subset metadata file, excluding column 5 which is the subject id. Then let's create subset taxonomy and pathways files.

subset_file_by_sample_id.py --input input_vis_stats/HMP2_metadata_subset.tsv --input-ids HMP2_samples_first_timepoint.txt --output input_vis_stats_subset/HMP2_metadata_subset_multivariable.tsv --exclude-row-name subject
subset_file_by_sample_id.py --input input_vis_stats/metaphlan_taxonomic_profiles.tsv --input-ids HMP2_samples_first_timepoint.txt --output input_vis_stats_subset/metaphlan_taxonomic_profiles.tsv 
subset_file_by_sample_id.py --input input_vis_stats/pathabundance_relab.tsv --input-ids HMP2_samples_first_timepoint.txt --output input_vis_stats_subset/pathabundance_relab.tsv

Let’s walk through the bash command the script uses to see how the utility script created the new files:

COLUMNS=$(head -n 1 input_vis_stats/metaphlan_taxonomic_profiles.tsv | tr '\t' '\n' | cat -n | grep -f HMP2_samples_first_timepoint.txt | awk -F' ' '{ print $1}' | tr '\t' ',' | sed 's/,$//' ) && cut -f '1,'$COLUMNS  input_vis_stats/metaphlan_taxonomic_profiles.tsv > input_vis_stats_subset/metaphlan_taxonomic_profiles.tsv

The first command takes the first line of the taxonomic profile with head -n 1 and then replaces all tabs with newlines with tr '\t' '\n' and then adds row numbers with cat -n. Next, it excludes any rows that do not include the sample ids of interest with grep -f and then it captures just the row numbers remaining with awk -F" " '{ print $1}' and then it replaces all newlines with commas with tr '\n' ',' and then removes the final trailing comma with sed 's/,$//'. The comma-delimited results are captured in a single environment variable named COLUMNS. For the second command the column numbers, uses $COLUMNS` plus the first column for the row names to cut from the original taxonomic profile to create a new taxonomic profile that just includes the subset of samples of interest.

If you would prefer to download the resulting files directly you can do so from these links:

Now run the stats workflow on this subset of data, which is no longer a repeat measures study because it only has a single time point:

biobakery_workflows stats --input input_vis_stats_subset/ --input-metadata input_vis_stats_subset/HMP2_metadata_subset_multivariable.tsv --fixed-effects="diagnosis,antibiotics,age,dysbiosis" --output HMP2_stats_subset_output --maaslin-options="reference='diagnosis,nonIBD'" --project-name HMP2 --author-name AUTHOR_NAME

This should take about a minute to run with a single core. You can download the resulting report: stats_report.pdf.


  • What options differ for this run and why?

4.3.2 Output files

The output folder structure is similar to the vis output with a figures and data subfolder plus a pdf report. Additionally all the output from this workflow is compressed into a single zip archive.

Let's check out the outputs in detail:

ls HMP2_stats
anadama.log  data  features  halla_taxonomy_pathways  maaslin2_pathways  maaslin2_taxa  mantel_test  permanova  stats_report.pdf  stratified_pathways

The stats_report.pdf contains a title page, like the vis report, with an optional image, title, author name, date, and dynamic table of contents. The table of contents varies based on the data provided by the user and the workflow run time settings. The next page includes the introduction which describes the four sections of the report. By default the workflow is broken into four sections with the contents varying based on the study type.

Section 1: All against all: This section includes correlations between pairs of data sets computed using a Mantel test with Bray-Curtis dissimilarities. Because the Mantel test is a permutational method which addresses the dependence of distances (unlike a simple correlation coefficient) and does not make assumptions about the statistical distributions, it is appropriate for a wide variety of data that can be supplied by the user, so long as the matrices are the same shape (e.g. taxa and pathways). This section is included for all study types with a plot and corresponding data table, included as a text file, generated for the mantel test.

Section 2: One against all: The second section includes associations of each metadata variable with each full data set. PERMANOVA is used as a robust method to associate metadata under a variety of experimental designs and metadata distributions with all outcome features simultaneously. This, like ANOVA, compares the between- and within-group variance, but does so without assuming normality and on underlying measures, factors crucial for microbiome data analysis. These results can often be seen qualitatively on the PCoA plots from the vis workflow, however, note that PERMANOVA is performed not on the ordination, but the underlying data. Univariable and multivariable (all input variables) tests are run on single point data, whereas subject-aware "univariable" tests are run on repeat measures data.

Section 3: Each metadata against all features individually: The third section computes associations for each metadata variable against each individual feature in the data sets. A general linear model approach is utilized through the MaAsLin 2 package which appropriately treats microbial meta’omic data to be tested against a wide range of metadata and experimental designs, with widely customizable parameters with defaults determined by extensive benchmarking. Compared to PERMANOVA, which outputs a single test statistic per metadata variable, MaAsLin 2 runs every feature against every variable as specified in the input model and multiple hypothesis corrects to account for the large number of tests. Note that a significant PERMANOVA result does not ensure significant MaAsLin 2 results and vice versa, due to both the differing natures of the tests and how some associations are more apparent at the community or at the feature level.

The report contains for each input meta’omic data matrix a heatmap of all significant results relative to the reference level for each variable and up to the top 9 significant results (by corrected p-value) represented by bar or scatter plots as appropriate. Additional figures and full text output can be found in the same directory as the report. For more information on these see the MaAsLin 2 documentation. Additionally, for HUMAnN generated stratified pathways, if provided, the most significant associations, for categorical features only, are plotted stratified by species, using the humann_barplot script provided in that package. Details on these plots are found in the HUMAnN documentation.

Section 4: Feature and metadata all-against-all testing semi-individually: The final section computes correlations between all pairs of features and metadata for each input dataset. HAllA is used to discover associations in high-dimensional datasets by a hierarchical hypothesis testing approach which avoids the pitfalls of traditional correlation techniques regarding meta’omic data while retaining high sensitivity. This software reports a correlation for each possible comparison as well as generating ranked blocks of associated features, which improves power and aids in interpretability. It can handle heterogeneous datasets (e.g. clinical metadata) through a wide variety of supported correlation coefficients and clustering algorithms, and unlike the Mantel test, it is not reliant on a shared distance metric nor equal dimensionality of inputs.

Output from HAllA is visualized as a hallagram which resembles a typical pairwise correlation heatmap, with rows of features from the first input matrix and columns of features from the second, but with these features arranged by hierarchical clustering and densely associated blocks of features numbered to match output text files. For more information on HAllA and the interpretation of hallagrams, see the HAllA documentation.

There are a few additional folders which contain the full results from running external bioBakery statistics tools. First is the halla_taxonomy_pathways folder which contains the results from running the taxonomy against pathways with HAllA. For each pair of input data files there will be a HAllA run evaluating the two feature sets. HAllA is a computationally intensive step so to speed up processing time users can run with the option to bypass these tasks with the setting --bypass-halla. These folders contain all of the data products output from the HAllA run while the report just includes the heatmap. Additionally the anadama.log in the output folder will contain the exact command run to generate these HAllA results along with the exact commands run to generate any other results like the MaAsLiN 2 runs.

Next are the folders named maaslin2_pathways and maaslin2_taxa which contain the results from the MaAsLiN 2 runs. A MaAsLiN 2 run will be completed on each data set unless the user runs with the option --bypass-maaslin. The report contains the heatmap for each run plus a subset of the plots. The report is limited to showing the top nine plots for each of the metadata variables for each data set. If both taxonomy and pathways data are provided the top N, with N a variable set by default to 3 that can be changed on the command line, then pathways are plotted stratified by species taxonomy. High resolution images of these plots are stored in the stratified_pathways folder.

The next folder named permanova contains the high resolution images of the plots for the PERMANOVA results with repeated measures plus the univariable. This folder also contains tab-delimited files of the data points for each of these plots. The final folder named mantel_test also contains a tab-delimited file of the data points output from the mantel test plus the high resolution plot.