Skip to content

anadama2

Vic_ks edited this page May 20, 2020 · 5 revisions

AnADAMA2 tutorial

AnADAMA2 is the next generation of AnADAMA (Another Automated Data Analysis Management Application). AnADAMA is a tool to create reproducible workflows and execute them efficiently. Tasks can be run locally or in a grid computing environment to increase efficiency. Essential information from all tasks is printed to the screen and logged to ensure reproducibility. A auto-doc feature allows for workflows to generate documentation automatically to further ensure reproducibility.


Table of Contents


Prerequisites

  • Python (version >= 2.7)
  • Pweave (only required for auto-doc workflows, automatically installed)
  • Pandoc (only required for auto-doc workflows)
  • matplotlib (only required for auto-doc workflows)
  • LaTeX (only required for auto-doc workflows)
  • hclust2 (only required for auto-doc workflows with hclust2 heatmaps)

Installation

Install AnADAMA2 and dependencies with the following command:

$ pip install anadama2

Add the option --user to the install command if you do not have root permissions.

This tutorial also requires MetaPhlAn2 to be installed. For information on how to install MetaPhlAn2, see the MetaPhlAn2 tutorial. The auto-doc section of this tutorial requires bioBakery workflows to be installed as it uses some of the utility functions from this software package. See the bioBakery Workflows User Manual for information on how to install.


How to write a workflow

Collect tasks

The first step before writing a workflow is to collect the set of tasks the workflow will run. In this tutorial we will use a set of three commands that are run directly on the command line. These commands run MetaPhlAn2 on a set of two fastq input files.:

$ metaphlan2.py sample1.fastq --input_type fastq --no_map > sample1_profile.txt
$ metaphlan2.py sample2.fastq --input_type fastq --no_map > sample2_profile.txt
$ merge_metaphlan_tables.py sample1_profile.txt sample2_profile.txt > all_profiles.txt

Write a workflow

Next write your initial AnADAMA2 workflow. Open a file named myworkflow.py in any text editor and add the following lines of python code (lines starting with "#" are comments in python code).:

# import anadama2 and create a workflow instance, removing the options input/output
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow", remove_options=["input","output"])

# add tasks to your workflow (there are three tasks corresponding to the three commands)
workflow.add_task(
    "metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
    depends=["sample1.fastq"], targets=["sample1_profile.txt"])
workflow.add_task(
    "metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
    depends=["sample2.fastq"], targets=["sample2_profile.txt"])
workflow.add_task(
    "merge_metaphlan_tables.py [depends[0]] [depends[1]] > [targets[0]]",
    depends=["sample1_profile.txt","sample2_profile.txt"], targets=["all_profiles.txt"])

# call the go function to indicate that all task have been added to the workflow
workflow.go()

Download the following demo files to your current working directory prior to running the workflow:

Run a workflow

Now you can run the workflow by running the python file you just created. Running the same command a second time, after all tasks have completed in the first run, will cause the workflow to skip running the tasks because the files have already been created and are up-to-date. If you want to run again without skipping any tasks add the option --skip-nothing.:

# all tasks are run
$ python myworkflow.py

# all tasks are skipped (because they were just run and everything is up-to-date)
$ python myworkflow.py

# all tasks are run (because of the flag applied)
$ python myworkflow.py --skip-nothing

Make your workflow generic

Once you have a workflow running, you might want to make it more generic so the input file names do not have to match those in the workflow. You can modify the workflow to search an input folder for the file names with a specific extension and then run all of these files through the tasks in your workflow. Open the file named myworkflow.py in any text editor and change the python code to include the following lines.:

# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")

# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
args = workflow.parse_args()

# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")

# add tasks to your workflow
workflow.add_task_group(
    "metaphlan2.py [depends[0]] --input_type fastq --no_map > [targets[0]]",
    depends=input_fastq_files, targets=output_profiles)
workflow.add_task(
    "merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
    depends=output_profiles, targets=output_merged, args=args.output)

# call the go function to indicate that all task have been added to the workflow
workflow.go()

Notice how the workflow changed from the original workflow to identify what new code makes the tasks independent of the input file names. Code line 2 no longer has the option to remove the default arguments input and output. Now the workflow can be run with "--input" and "--output" on the command line. Code line 3 adds an argument to specify the extension of the input files and line 4 parses the command line arguments provided by the user when running the workflow. Code line 5 finds all of the files in the input folder with the extension provided by the user (or uses the default extension set on line 3 if no extension was provided by the user). Code line 6 names output files based on the names of the input files (so they can keep the same identifiers). The first two add_tasks have changed to a single add_task_group. This task group has the same command for all tasks.

Run a generic workflow

Now try running your workflow again, using the new input and output command line options. First you might want to create a new folder and move the fastq files into the folder. Alternatively, you can just provide the folder the input files are currently located in as the input folder.:

$ python myworkflow.py --input fastq_folder --output output_folder

After running, check out the output folder to see the profile files it now contains.

Allow each task to use multiple cores

MetaPhlAn2 has an option to speed up the run by using multiple threads. If we make a small change to the workflow, we can have the user specify how many threads each MetaPhlAn2 task should use. Open the file named myworkflow.py in any text editor and change the python code to include the following lines.:

# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")

# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
args = workflow.parse_args()

# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")

# add tasks to your workflow
workflow.add_task_group(
    "metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
    depends=input_fastq_files, targets=output_profiles, args=[args.threads])
workflow.add_task(
    "merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
    depends=output_profiles, targets=output_merged, args=args.output)

# call the go function to indicate that all task have been added to the workflow
workflow.go()

The workflow has changed by adding a new add_argument which defaults the threads argument to one. Also the add_task_group command now includes the option "--nproc" that provides the threads number to each MetaPhlAn2 task.

Run a workflow using multiple cores

Run the workflow again, this time allowing each MetaPhlAn2 task to have three cores (one for each thread). Add the option to skip nothing since you have already run the workflow once and the output folder will be full of up-to-date output files but you would like to run all tasks again. This command will run one task at a time.:

$ python myworkflow.py --input fastq_folder --output output_folder --threads 3 --skip-nothing

Next run again, this time adding a new option which will run two tasks at once on your local machine with each task using three cores.:

$ python myworkflow.py --input fastq_folder --output output_folder --threads 3 --local-jobs 2 --skip-nothing

Allow each task to use the grid

Now you can make a small final adjustment to your workflow so that the MetaPhlAn2 tasks will be run on your grid computing environment instead of your local machine. This is convenient if you have hundreds of fastq files that you need to process through this workflow. Open the file named myworkflow.py in any text editor and change the python code to include the following lines.:

# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")

# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
args = workflow.parse_args()

# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")

# add tasks to your workflow 
workflow.add_task_group_gridable(
    "metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
    depends=input_fastq_files, targets=output_profiles, args=[args.threads],
    time=3*60, mem=12*1024, cores=args.threads)
workflow.add_task(
    "merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
    depends=output_profiles, targets=output_merged, args=args.output)

# call the go function to indicate that all task have been added to the workflow
workflow.go()

These modifications change the add_task_group function to add_task_group_gridable and add an estimate of time, memory, plus the requested cores for the task that is to be run on the grid. In this example, we request from the grid 3 hours and 12GB of memory for each MetaPhlAn2 task.

Run a workflow on the grid

Running this workflow with the default options will run all tasks on your local machine. If you add the grid jobs option, it will launch the number of jobs at a time that you specify. For example, if you request 10 grid jobs, then at most 10 grid jobs will be in the queue at one time. By default the grid system used is that detected on your local machine. If you need to specify the grid, use the option --grid {slurm|sge}. If you need to specify the partition, use the option --grid-partition {general}. To run the workflow with 10 grid jobs each with 8 cores, run the following command.:

$ python myworkflow.py --input fastq_folder --output output_folder --threads 8 --grid-jobs 10 --skip-nothing

Add an auto-generated document

Next, you can add an auto-generated pdf document that shows a plot and a heatmap of the data in the merged taxonomic profile. This report can even have a custom project name. Open the file named myworkflow.py in any text editor and change the python code to include the following lines.:

# import anadama2 and create a workflow instance
from anadama2 import Workflow
workflow = Workflow(version="0.1", description="MetaPhlAn2 workflow")

# add custom arguments and parse arguments from the command line
workflow.add_argument("input-extension", desc="the extensions of the input files", default="fastq")
workflow.add_argument("threads", desc="the number of threads for each task", default=1)
workflow.add_argument("project-name",  desc="the name of the project", default="Demo project")
args = workflow.parse_args()

# get input/output file names in the input/output folders provided on the command line
input_fastq_files = workflow.get_input_files(extension=args.input_extension)
output_profiles = workflow.name_output_files(name=input_fastq_files, tag="profile", extension="txt")
output_merged = workflow.name_output_files("all_profiles.txt")

# add tasks to your workflow 
workflow.add_task_group_gridable(
    "metaphlan2.py [depends[0]] --nproc [args[0]] --input_type fastq --no_map > [targets[0]]",
    depends=input_fastq_files, targets=output_profiles, args=[args.threads],
    time=3*60, mem=12*1024, cores=args.threads)
workflow.add_task(
    "merge_metaphlan_tables.py [args[0]]/*profile.txt > [targets[0]]",
    depends=output_profiles, targets=output_merged, args=args.output)

# add a document task
pdf_output = workflow.name_output_files("taxonomy_profile.pdf")

workflow.add_document(
    templates="taxonomy_template.py",
    depends=output_merged,
    targets=pdf_output,
    vars={"taxonomic_profile":output_merged,
        "project":args.project_name})    

# call the go function to indicate that all task have been added to the workflow
workflow.go()

These modifications add one more command line argument which is the project name. This name will be included in the report. They also add an add_document task to the workflow to generate the document. The document is generated from a python file which includes comments in markdown format. To create the template file, open a file named taxonomy_template.py, in any text editor and add the following lines of code. Save this file in the same folder as the workflow file.:

#' % <% from anadama2 import PweaveDocument; document=PweaveDocument(); print("Taxonomy Report") %>
#' % Project: <% vars = document.get_vars(); print(vars["project"]) %>
#' % Date: <% import time; print(time.strftime("%m/%d/%Y")) %>
#+ echo=False
from biobakery_workflows import utilities

samples, taxonomy, data = document.read_table(vars["taxonomic_profile"])
species_taxonomy, species_data = utilities.filter_species(taxonomy,data)

top_taxonomy, top_data = utilities.top_rows(species_taxonomy, species_data, 5, "average")
document.plot_stacked_barchart(top_data, top_taxonomy, samples,
    title="Top 5 species by average abundance", ylabel="% Predicted composition",
    legend_title="Species")

#' This project analyzed <% print(len(samples)) %> samples in total.
#' For these samples, the species with the highest predicted abundance was <% print(top_taxonomy[0]) %>.

#+ echo=False
document.show_hclust2(samples, top_taxonomy, top_data, title="Top 5 species by average abundance")

The template file is a python script with markdown formatting in the comments. It uses the anadama2 document class to get the variables provided by the workflow. It reads the taxonomic profile table that was created by the workflow and extracts the data for just the species. It uses a biobakery workflows utility function to simplify the filtering and extract the top rows. Next it adds a stacked barchart and a heatmap generated from hclust2.

Run a workflow to auto-generate a document

Running this workflow with the default options will just generate the new pdf document since all of the other tasks have already been run. The new pdf document will have a project name of "Demo project" as this is the default project name if a project name is not provided on the command line. To run the workflow to generate the new document, run the following command.:

$ python myworkflow.py --input fastq_folder --output output_folder

An example document running with two small demo samples is shown below.

image

image

Workflow examples

For examples of workflows written with AnADAMA2, see the bioBakery Workflows software.

Clone this wiki locally