-
Notifications
You must be signed in to change notification settings - Fork 73
anadama2
AnADAMA2 is the next generation of AnADAMA (Another Automated Data Analysis Management Application). AnADAMA is a tool to create reproducible workflows and execute them efficiently. Tasks can be run locally or in a grid computing environment to increase efficiency. Essential information from all tasks is printed to the screen and logged to ensure reproducibility. A auto-doc feature allows for workflows to generate documentation automatically to further ensure reproducibility.
- For additional information, see the AnADAMA2 User Manual.
- For questions about the software, please reach out to bioBakery Support Forum.
Table of Contents
- Prerequisites
- Installation
-
How to write a workflow
- Collect tasks
- Write a workflow
- Run a workflow
- Make your workflow generic
- Run a generic workflow
- Allow each task to use multiple cores
- Run a workflow using multiple cores
- Allow each task to use the grid
- Run a workflow on the grid
- Add an auto-generated document
- Run a workflow to auto-generate a document
- Workflow examples
- Python (version >= 2.7)
- Pweave (only required for auto-doc workflows, automatically installed)
- Pandoc (only required for auto-doc workflows)
- matplotlib (only required for auto-doc workflows)
- LaTeX (only required for auto-doc workflows)
- hclust2 (only required for auto-doc workflows with hclust2 heatmaps)
Install AnADAMA2 and dependencies with the following command:
$ pip install anadama2
Add the option --user
to the install command if you do not have root
permissions.
This tutorial also requires MetaPhlAn2 to be installed. For information on how to install MetaPhlAn2, see the MetaPhlAn2 tutorial. The auto-doc section of this tutorial requires bioBakery workflows to be installed as it uses some of the utility functions from this software package. See the bioBakery Workflows User Manual for information on how to install.
The first step before writing a workflow is to collect the set of tasks the workflow will run. In this tutorial we will use a set of three commands that are run directly on the command line. These commands run MetaPhlAn2 on a set of two fastq input files.:
$ metaphlan2.py sample1.fastq --input_type fastq --no_map > sample1_profile.txt
$ metaphlan2.py sample2.fastq --input_type fastq --no_map > sample2_profile.txt
$ merge_metaphlan_tables.py sample1_profile.txt sample2_profile.txt > all_profiles.txt
Next write your initial AnADAMA2 workflow. Open a file named
myworkflow.py
in any text editor and add the following lines of python
code (lines starting with "#" are comments in python code).:
# import anadama2 and create a workflow instance, removing the options input/output
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow", remove_options=["input","output"])
# add tasks to your workflow (there are three tasks corresponding to the three commands)
workflow.add_task(
"metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
depends=["sample1.fastq"], targets=["sample1_profile.txt"])
workflow.add_task(
"metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
depends=["sample2.fastq"], targets=["sample2_profile.txt"])
workflow.add_task(
"merge_metaphlan_tables.py [depends[0]] [depends[1]] > [targets[0]]",
depends=["sample1_profile.txt","sample2_profile.txt"], targets=["all_profiles.txt"])
# call the go function to indicate that all task have been added to the workflow
workflow.go()
Download the following demo files to your current working directory prior to running the workflow:
Now you can run the workflow by running the python file you just
created. Running the same command a second time, after all tasks have
completed in the first run, will cause the workflow to skip running the
tasks because the files have already been created and are up-to-date. If
you want to run again without skipping any tasks add the option
--skip-nothing
.:
# all tasks are run
$ python myworkflow.py
# all tasks are skipped (because they were just run and everything is up-to-date)
$ python myworkflow.py
# all tasks are run (because of the flag applied)
$ python myworkflow.py --skip-nothing
Once you have a workflow running, you might want to make it more generic
so the input file names do not have to match those in the workflow. You
can modify the workflow to search an input folder for the file names
with a specific extension and then run all of these files through the
tasks in your workflow. Open the file named myworkflow.py
in any text
editor and change the python code to include the following lines.:
# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")
# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
args = workflow.parse_args()
# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")
# add tasks to your workflow
workflow.add_task_group(
"metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
depends=input_fastq_files, targets=output_profiles)
workflow.add_task(
"merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
depends=output_profiles, targets=output_merged, args=args.output)
# call the go function to indicate that all task have been added to the workflow
workflow.go()
Notice how the workflow changed from the original workflow to identify what new code makes the tasks independent of the input file names. Code line 2 no longer has the option to remove the default arguments input and output. Now the workflow can be run with "--input" and "--output" on the command line. Code line 3 adds an argument to specify the extension of the input files and line 4 parses the command line arguments provided by the user when running the workflow. Code line 5 finds all of the files in the input folder with the extension provided by the user (or uses the default extension set on line 3 if no extension was provided by the user). Code line 6 names output files based on the names of the input files (so they can keep the same identifiers). The first two add_tasks have changed to a single add_task_group. This task group has the same command for all tasks.
Now try running your workflow again, using the new input and output command line options. First you might want to create a new folder and move the fastq files into the folder. Alternatively, you can just provide the folder the input files are currently located in as the input folder.:
$ python myworkflow.py --input fastq_folder --output output_folder
After running, check out the output folder to see the profile files it now contains.
MetaPhlAn2 has an option to speed up the run by using multiple threads.
If we make a small change to the workflow, we can have the user specify
how many threads each MetaPhlAn2 task should use. Open the file named
myworkflow.py
in any text editor and change the python code to include
the following lines.:
# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")
# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
args = workflow.parse_args()
# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")
# add tasks to your workflow
workflow.add_task_group(
"metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
depends=input_fastq_files, targets=output_profiles, args=[args.threads])
workflow.add_task(
"merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
depends=output_profiles, targets=output_merged, args=args.output)
# call the go function to indicate that all task have been added to the workflow
workflow.go()
The workflow has changed by adding a new add_argument which defaults the threads argument to one. Also the add_task_group command now includes the option "--nproc" that provides the threads number to each MetaPhlAn2 task.
Run the workflow again, this time allowing each MetaPhlAn2 task to have three cores (one for each thread). Add the option to skip nothing since you have already run the workflow once and the output folder will be full of up-to-date output files but you would like to run all tasks again. This command will run one task at a time.:
$ python myworkflow.py --input fastq_folder --output output_folder --threads 3 --skip-nothing
Next run again, this time adding a new option which will run two tasks at once on your local machine with each task using three cores.:
$ python myworkflow.py --input fastq_folder --output output_folder --threads 3 --local-jobs 2 --skip-nothing
Now you can make a small final adjustment to your workflow so that the
MetaPhlAn2 tasks will be run on your grid computing environment instead
of your local machine. This is convenient if you have hundreds of fastq
files that you need to process through this workflow. Open the file
named myworkflow.py
in any text editor and change the python code to
include the following lines.:
# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")
# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
args = workflow.parse_args()
# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")
# add tasks to your workflow
workflow.add_task_group_gridable(
"metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
depends=input_fastq_files, targets=output_profiles, args=[args.threads],
time=3*60, mem=12*1024, cores=args.threads)
workflow.add_task(
"merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
depends=output_profiles, targets=output_merged, args=args.output)
# call the go function to indicate that all task have been added to the workflow
workflow.go()
These modifications change the add_task_group
function to
add_task_group_gridable
and add an estimate of time, memory, plus the
requested cores for the task that is to be run on the grid. In this
example, we request from the grid 3 hours and 12GB of memory for each
MetaPhlAn2 task.
Running this workflow with the default options will run all tasks on
your local machine. If you add the grid jobs option, it will launch the
number of jobs at a time that you specify. For example, if you request
10 grid jobs, then at most 10 grid jobs will be in the queue at one
time. By default the grid system used is that detected on your local
machine. If you need to specify the grid, use the option
--grid {slurm|sge}
. If you need to specify the partition, use the
option --grid-partition {general}
. To run the workflow with 10 grid
jobs each with 8 cores, run the following command.:
$ python myworkflow.py --input fastq_folder --output output_folder --threads 8 --grid-jobs 10 --skip-nothing
Next, you can add an auto-generated pdf document that shows a plot and a
heatmap of the data in the merged taxonomic profile. This report can
even have a custom project name. Open the file named myworkflow.py
in
any text editor and change the python code to include the following
lines.:
# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")
# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
workflow.add_argument("project-name", desc="the name of the project", default="Demo project")
args = workflow.parse_args()
# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")
# add tasks to your workflow
workflow.add_task_group_gridable(
"metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
depends=input_fastq_files, targets=output_profiles, args=[args.threads],
time=3*60, mem=12*1024, cores=args.threads)
workflow.add_task(
"merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
depends=output_profiles, targets=output_merged, args=args.output)
# add a document task
pdf_output = workflow.name_output_files("taxonomy_profile.pdf")
workflow.add_document(
templates="taxonomy_template.py",
depends=output_merged,
targets=pdf_output,
vars={"taxonomic_profile":output_merged,
"project":args.project_name})
# call the go function to indicate that all task have been added to the workflow
workflow.go()
These modifications add one more command line argument which is the
project name. This name will be included in the report. They also add an
add_document task to the workflow to generate the document. The
document is generated from a python file which includes comments in
markdown format. To create the template file, open a file named
taxonomy_template.py
, in any text editor and add the following lines
of code. Save this file in the same folder as the workflow file.:
#' % <% from anadama2 import PweaveDocument; document=PweaveDocument(); print("Taxonomy Report") %>
#' % Project: <% vars = document.get_vars(); print(vars["project"]) %>
#' % Date: <% import time; print(time.strftime("%m/%d/%Y")) %>
#+ echo=False
from biobakery_workflows import utilities
samples, taxonomy, data = document.read_table(vars["taxonomic_profile"])
species_taxonomy, species_data = utilities.filter_species(taxonomy,data)
top_taxonomy, top_data = utilities.top_rows(species_taxonomy, species_data, 5, "average")
document.plot_stacked_barchart(top_data, top_taxonomy, samples,
title="Top 5 species by average abundance", ylabel="% Predicted composition",
legend_title="Species")
#' This project analyzed <% print(len(samples)) %> samples in total.
#' For these samples, the species with the highest predicted abundance was <% print(top_taxonomy[0]) %>.
#+ echo=False
document.show_hclust2(samples, top_taxonomy, top_data, title="Top 5 species by average abundance")
The template file is a python script with markdown formatting in the comments. It uses the anadama2 document class to get the variables provided by the workflow. It reads the taxonomic profile table that was created by the workflow and extracts the data for just the species. It uses a biobakery workflows utility function to simplify the filtering and extract the top rows. Next it adds a stacked barchart and a heatmap generated from hclust2.
Running this workflow with the default options will just generate the new pdf document since all of the other tasks have already been run. The new pdf document will have a project name of "Demo project" as this is the default project name if a project name is not provided on the command line. To run the workflow to generate the new document, run the following command.:
$ python myworkflow.py --input fastq_folder --output output_folder
An example document running with two small demo samples is shown below.
For examples of workflows written with AnADAMA2, see the bioBakery Workflows software.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MaAsLin 3.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- MTX model 3
- PARATHAA