### AnADAMA2 Example: A workflow to download files in parallel

[AnADAMA2](http://huttenhower.sph.harvard.edu/anadama2) is the next generation of AnADAMA (Another Automated Data Analysis Management Application). AnADAMA is a tool to create reproducible workflows and execute them efficiently. Tasks can be run locally or in a grid computing environment to increase efficiency. Essential information from all tasks is recorded, using the default logger and command line reporters, to ensure reproducibility. A auto-doc feature allows for workflows to generate documentation automatically to further ensure reproducibility by capturing the latest essential workflow information. AnADAMA2 was architected to be modular allowing users to customize the application by subclassing the base grid meta-schedulers, reporters, and tracked objects (ie files, executables, etc).

* For additional information, see the [AnADAMA2 User Manual](https://bitbucket.org/biobakery/anadama2) or the [AnADAMA2 Tutorial](https://bitbucket.org/biobakery/biobakery/wiki/anadama2).
* For more example workflows, download the AnADAMA2 software source and demos ( [anadama2.tar.gz](https://pypi.python.org/pypi/anadama2) ).
* Please direct questions to the [AnADAMA Google Group](https://groups.google.com/forum/#!forum/anadama-users).
                                                        
**This example shows how to write a simple AnADAMA2 workflow to download three files.**


**Step 1:** Import the workflow from anadama2. 

In [1]:
from anadama2 import Workflow

**Step 2:** Create a workflow instance. 
Since we are using Jupyter we need to turn off the command line interface for the workflow. 
The command line interface is helpful when executing a workflow directly from the command line. 
It allows the user to provide options like input/output folders at run-time. 


In [2]:
workflow = Workflow(cli=False)

**Step 3:** Add tasks to the workflow. In this example a task will be added for each file that needs to be downloaded. 
Also we track the executable used to download the files. This will cause the tasks to rerun if the version of the
executable is changed. It will also log the version of the executable when the tasks are run.

In [3]:
# import the TrackedExecutable class
from anadama2.tracked import TrackedExecutable

# set the list of urls to download
downloads=["ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011175.fsa.gz",
    "ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011273.fsa.gz",
    "ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011180.fsa.gz"]

# add a task to the workflow to download each url
for link in downloads:
    workflow.add_task(
        "wget -O [targets[0]] [args[0]]",
        depends=TrackedExecutable("wget"),
        targets=link.split("/")[-1],
        args=link) 

**Step 4:** Now lets change the current working directory to see if we have any of the files already downloaded.
We don't expect to see any of the files downloaded yet as the tasks have just been added to the workflow. 
The tasks have not yet been run.

In [4]:
# check the current working directory to see we do not have any of the files downloaded yet
import os
os.listdir(".")

['.ipynb_checkpoints', 'AnADAMA2_download_files_example.ipynb']

**Step 4:** Run the workflow. By executing ``go`` we run the tasks in the workflow. We can choose to do a dry run
which will only show the tasks that would be run instead of actually running the tasks by setting ``dry_run=True``.

In [5]:
workflow.go(dry_run=True)

0 - Task0
  Dependencies (1)
  - /usr/bin/wget (Executable)
  Targets (1)
  - /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011175.fsa.gz (Big File)
  Actions (1)
  - wget -O /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011175.fsa.gz ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011175.fsa.gz (command)
------------------
2 - Task2
  Dependencies (1)
  - /usr/bin/wget (Executable)
  Targets (1)
  - /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011273.fsa.gz (Big File)
  Actions (1)
  - wget -O /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011273.fsa.gz ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011273.fsa.gz (command)
------------------
3 - Task3
  Dependencies (1)
  - /usr/bin/wget (Executable)
  Targets (1)
  - /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011180.fsa.gz (Big File)
  Actions (1)
  - wget -O /work/code/anadama/anadama2/jupyter_not

**Step 5:** Run the workflow again not in dry run mode to run the tasks to download the files.

In [6]:
workflow.go()

(Jun 06 11:50:19) [0/3 -   0.00%] **Ready    ** Task 2: wget
(Jun 06 11:50:19) [0/3 -   0.00%] **Started  ** Task 2: wget
(Jun 06 11:50:21) [1/3 -  33.33%] **Completed** Task 2: wget
(Jun 06 11:50:21) [1/3 -  33.33%] **Ready    ** Task 3: wget
(Jun 06 11:50:21) [1/3 -  33.33%] **Started  ** Task 3: wget
(Jun 06 11:50:24) [2/3 -  66.67%] **Completed** Task 3: wget
(Jun 06 11:50:24) [2/3 -  66.67%] **Ready    ** Task 0: wget
(Jun 06 11:50:24) [2/3 -  66.67%] **Started  ** Task 0: wget
(Jun 06 11:50:27) [3/3 - 100.00%] **Completed** Task 0: wget
Run Finished


**Step 6:** Check the current working directory to see the files have been downloaded.

In [7]:
os.listdir(".")

['SRS011180.fsa.gz',
 '.ipynb_checkpoints',
 'anadama.log',
 'SRS011273.fsa.gz',
 'SRS011175.fsa.gz',
 'AnADAMA2_download_files_example.ipynb']

**Step 7:** Run the workflow again to see all tasks are skipped because the files are already downloaded.

In [8]:
workflow.go()

(Jun 06 11:50:35) [1/3 -  33.33%] **Skipped  ** Task 0: wget
(Jun 06 11:50:35) [2/3 -  66.67%] **Skipped  ** Task 3: wget
(Jun 06 11:50:35) [3/3 - 100.00%] **Skipped  ** Task 2: wget
Run Finished


**Step 8:** Delete one of the downloads and run in dry run mode to see that only the file that was deleted
will be downloaded if the workflow is run again.

In [9]:
# delete one of the files
os.remove("SRS011180.fsa.gz")
# then execute a dry run to see what will be run
workflow.go(dry_run=True)

(Jun 06 11:50:46) [1/3 -  33.33%] **Skipped  ** Task 0: wget
(Jun 06 11:50:46) [2/3 -  66.67%] **Skipped  ** Task 2: wget
3 - Task3
  Dependencies (1)
  - /usr/bin/wget (Executable)
  Targets (1)
  - /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011180.fsa.gz (Big File)
  Actions (1)
  - wget -O /work/code/anadama/anadama2/jupyter_notebooks/download_files_example/SRS011180.fsa.gz ftp://public-ftp.hmpdacc.org/HM16STR/by_sample/SRS011180.fsa.gz (command)
------------------
Run Finished


**Step 9:** Run the workflow to download the single file that we just deleted.

In [10]:
workflow.go()

(Jun 06 11:50:53) [1/3 -  33.33%] **Skipped  ** Task 0: wget
(Jun 06 11:50:53) [2/3 -  66.67%] **Skipped  ** Task 2: wget
(Jun 06 11:50:53) [2/3 -  66.67%] **Ready    ** Task 3: wget
(Jun 06 11:50:53) [2/3 -  66.67%] **Started  ** Task 3: wget
(Jun 06 11:50:55) [3/3 - 100.00%] **Completed** Task 3: wget
Run Finished


**Step 10:** Import the AnADAMA2 reporter that logs information when workflows run and print the commands that were
run for this workflow.

In [11]:
# get the commands run in the workflow from the log
from anadama2.reporters import LoggerReporter
LoggerReporter.read_log("anadama.log","commands")

['wget -O SRS011273.fsa.gz SRS011273.fsa.gz',
 'wget -O SRS011180.fsa.gz SRS011180.fsa.gz',
 'wget -O SRS011175.fsa.gz SRS011175.fsa.gz']

**Step 11:** Print the versions for the tracked executables for this workflow.

In [12]:
LoggerReporter.read_log("anadama.log","versions")

['GNU Wget 1.18 built on linux-gnu.']

**Step 12:** Now rerun the full workflow with three tasks at once, not skipping any tasks even though the files already exist.

In [13]:
# rerunning all commands in the workflow this time executing all three downloads at once
workflow.go(jobs=3, skip_nothing=True)

(Jun 06 11:51:16) [0/3 -   0.00%] **Started  ** Task 2: wget
(Jun 06 11:51:16) [0/3 -   0.00%] **Started  ** Task 3: wget
(Jun 06 11:51:16) [0/3 -   0.00%] **Started  ** Task 0: wget
(Jun 06 11:51:16) [0/3 -   0.00%] **Ready    ** Task 2: wget
(Jun 06 11:51:16) [0/3 -   0.00%] **Ready    ** Task 3: wget
(Jun 06 11:51:16) [0/3 -   0.00%] **Ready    ** Task 0: wget
(Jun 06 11:51:18) [1/3 -  33.33%] **Completed** Task 3: wget
(Jun 06 11:51:19) [2/3 -  66.67%] **Completed** Task 2: wget
(Jun 06 11:51:19) [3/3 - 100.00%] **Completed** Task 0: wget
Run Finished
