Permalink
Browse files

Imporvements for Conda tutorials.

  • Loading branch information...
jmchilton committed May 23, 2018
1 parent e6efb0f commit 2002b49c39b307de182954a44994c56401e6bc50
@@ -0,0 +1,22 @@
.. note:: *Why Conda?*

Many different package managers could potentially be targeted here, but we focus on Conda_
for a few key reasons.

* No compilation at install time - binaries with their dependencies and libraries
* Support for all operating systems
* Easy to manage multiple versions of the same recipe
* HPC-ready: no root privileges needed
* Easy-to-write YAML recipes
* Viberant communities

.. note:: **Conda Terminology**

.. figure:: http://galaxyproject.github.io/training-material/topics/dev/images/miniconda_vs_anaconda.png
:alt: Diagram describing the relationship between Conda, Miniconda, and Anaconda.

Conda *recipes* build *packages* that are published to *channels*.

Planemo is setup to target a few channels by default, these include ``iuc``, ``bioconda``,
``conda_forge``, ``defaults`` - the whole dependency management scheme outlined here works a lot
better if packages can be found in one of these "best practice" channels.
@@ -41,6 +41,8 @@ will attempt to install Conda, check for referenced packages (such as
perspective but other dependency resolution techniques are covered in
the `Galaxy docs <https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html>`__.

.. include:: _writing_conda_overview.rst

We can check if the requirements on a tool are available in best practice
Conda channels using an extended form of the ``planemo lint`` command. Passing
``--conda_requirements`` flag will ensure all listed requirements are found.
@@ -11,25 +11,25 @@ Specifying and Using `Software Requirements`_

.. note:: Why not just use containers?

Containers are great, use containers (be it Docker, Singularity, etc.) whenever possible to
increase reproducibility and portability - but building ad hoc containers to support CWL
tools has some limitations that this document describes a process for addressing.
Containers are great, use containers (be it Docker_, Singularity_, etc.) whenever possible to
increase reproducibility and portability of your tools and workflow. Building ad hoc containers
to support CWL tools (e.g. custom ``Dockerfile`` definitions) has serious limitations, in the next
tutorial on containers we will argue that using Biocontainers_ built or discovered
from your tool's `Software Requirements`_ is a superior approach.

There are technical reasons to describe `Software Requirements`_ in addition or in lieu
of just using ad hoc containers - it will allow your tool to be used in environments without
container runtimes available and the containers built from Conda software requirements are very
likely to be "best practice" (e.g. smaller than ad hoc containers). Perhaps the most important
reasons are less technical however such as reducing the opaqueness of traditional Docker
containers.
Besides leading to better containers, there are other reasons to describe
`Software Requirements`_ also - it will allow your tool to be used in environments without
container runtimes available and provides valuable and actionable metadata about the computation
described by the tool.

Read more about this in our preprint `Practical computational reproducibility in the life sciences
<https://www.biorxiv.org/content/early/2017/10/10/200683>`__
Read more about this whole dependency stack in our preprint `Practical computational reproducibility
in the life sciences <https://www.biorxiv.org/content/early/2017/10/10/200683>`__

The Common Workflow Language specification loosely describes
The `Common Workflow Language`_ specification loosely describes
`Software Requirements`_ - a way to map CWL hints to packages, environment
modules, or any other mechanism to describe dependencies for running a tool
outside of a container. The large and active Galaxy tool development community
has built a library and set of best practices for describing dependencies
has built an open source library and set of best practices for describing dependencies
for Galaxy that should work just as well for CWL. The library has been integrated
with cwltool_ and Toil_ to enable CWL tool authors and users to leverage the
power and flexibility of the Galaxy dependency management and best practices.
@@ -49,9 +49,12 @@ a ``SoftwareRequirement`` in the form the following the YAML fragment::
version:
- "1.2"

Planemo (and cwltool_ and Toil_) can interpret these ``SoftwareRequirement`` in varoius ways
including as Conda packages and install Conda packages referenced this way (such as ``seqtk``),
and install them as needed for tool testing.
Planemo (and cwltool_ and Toil_) can interpret these ``SoftwareRequirement`` annotations in various ways
including as Conda packages. When interpreting these as Conda packages
these runtimes can setup isolated, reproducible Conda environments for tool execution with the correct
packages installed (e.g. ``seqtk`` in the above example).

.. include:: _writing_conda_overview.rst

We can check if the requirements on a tool are available in best practice
Conda channels using an extended form of the ``planemo lint`` command (``planemo lint`` was
@@ -81,7 +84,7 @@ cwltool_ and Toil_).

$ planemo conda_install seqtk_seq.cwl
Install conda target CondaTarget[seqtk,version=1.2]
/home/john/miniconda2/bin/conda create -y --name __seqtk@1.2 seqtk=1.2
/home/john/miniconda3/bin/conda create -y --name __seqtk@1.2 seqtk=1.2
Fetching package metadata ...............
Solving package specifications: ..........

@@ -178,6 +181,184 @@ demonstrating using this tool.
Since ``seqtk`` isn't on the path and we did not use a container, we can see the SoftwareRequirement
resolution was successful and it found the environment we previously installed with ``conda_install``.

This can be used outside of Planemo testing as well, the following invocation shows running a job
with cwltool_ using an environment like the one created above:

::

$ cwltool --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
/Users/john/workspace/planemo/.venv/bin/cwltool 1.0.20180508202931
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
No handlers could be found for logger "rdflib.term"
[job seqtk_seq.cwl] /private/tmp/docker_tmpDQYeqC$ seqtk \
seq \
-a \
/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpQwBqPo/stg8cf2282a-d807-4f90-b94d-feeda004cacd/2.fastq > /private/tmp/docker_tmpDQYeqC/out
PREFIX=/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda
installing: python-3.6.3-h47c878a_7 ...
Python 3.6.3 :: Anaconda, Inc.
installing: ca-certificates-2017.08.26-ha1e5d58_0 ...
installing: conda-env-2.6.0-h36134e3_0 ...
installing: libcxxabi-4.0.1-hebd6815_0 ...
installing: tk-8.6.7-h35a86e2_3 ...
installing: xz-5.2.3-h0278029_2 ...
installing: yaml-0.1.7-hc338f04_2 ...
installing: zlib-1.2.11-hf3cbc9b_2 ...
installing: libcxx-4.0.1-h579ed51_0 ...
installing: openssl-1.0.2n-hdbc3d79_0 ...
installing: libffi-3.2.1-h475c297_4 ...
installing: ncurses-6.0-hd04f020_2 ...
installing: libedit-3.1-hb4e282d_0 ...
installing: readline-7.0-hc1231fa_4 ...
installing: sqlite-3.20.1-h7e4c145_2 ...
installing: asn1crypto-0.23.0-py36h782d450_0 ...
installing: certifi-2017.11.5-py36ha569be9_0 ...
installing: chardet-3.0.4-py36h96c241c_1 ...
installing: idna-2.6-py36h8628d0a_1 ...
installing: pycosat-0.6.3-py36hee92d8f_0 ...
installing: pycparser-2.18-py36h724b2fc_1 ...
installing: pysocks-1.6.7-py36hfa33cec_1 ...
installing: python.app-2-py36h54569d5_7 ...
installing: ruamel_yaml-0.11.14-py36h9d7ade0_2 ...
installing: six-1.11.0-py36h0e22d5e_1 ...
installing: cffi-1.11.2-py36hd3e6348_0 ...
installing: setuptools-36.5.0-py36h2134326_0 ...
installing: cryptography-2.1.4-py36h842514c_0 ...
installing: wheel-0.30.0-py36h5eb2c71_1 ...
installing: pip-9.0.1-py36h1555ced_4 ...
installing: pyopenssl-17.5.0-py36h51e4350_0 ...
installing: urllib3-1.22-py36h68b9469_0 ...
installing: requests-2.18.4-py36h4516966_1 ...
installing: conda-4.3.31-py36_0 ...
installation finished.
Fetching package metadata .................
Solving package specifications: .

Package plan for installation in environment /Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda:

The following packages will be UPDATED:

conda: 4.3.31-py36_0 --> 4.3.33-py36_0 conda-forge

conda-4.3.33-p 100% |#################################################################| Time: 0:00:00 1.13 MB/s


Package plan for installation in environment /Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda/envs/__seqtk@1.2:

The following NEW packages will be INSTALLED:

seqtk: 1.2-1 bioconda
zlib: 1.2.11-0 conda-forge


[job seqtk_seq.cwl] completed success
{
"output1": {
"checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
"basename": "out",
"location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
"path": "/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
"class": "File",
"size": 150
}
}
Final process status is success

This demonstrates that cwltool will install the packages needed on the first run, if we rerun cwltool it will
reuse that previous environment.

::

$ cwltool --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
/Users/john/workspace/planemo/.venv/bin/cwltool 1.0.20180508202931
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
No handlers could be found for logger "rdflib.term"
[job seqtk_seq.cwl] /private/tmp/docker_tmp4vvE_i$ seqtk \
seq \
-a \
/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpcvQ3Ph/stg2ef3a21c-9fb0-4099-88c2-36e24719901d/2.fastq > /private/tmp/docker_tmp4vvE_i/out
[job seqtk_seq.cwl] completed success
{
"output1": {
"checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
"basename": "out",
"location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
"path": "/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
"class": "File",
"size": 150
}
}
Final process status is success

And the same thing is possible with Toil_.

::

$ cwltoil --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
jlaptop17.local 2018-05-23 15:27:25,754 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
jlaptop17.local 2018-05-23 15:27:25,785 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: '92328fb2-33b7-44cd-879f-41d8cbf94555'
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
jlaptop17.local 2018-05-23 15:27:25,787 MainThread INFO cwltool: Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
jlaptop17.local 2018-05-23 15:27:27,002 MainThread WARNING rdflib.term: http://schema.org/docs/!DOCTYPE html does not look like a valid URI, trying to serialize this will break.
jlaptop17.local 2018-05-23 15:27:27,396 MainThread INFO rdflib.plugins.parsers.pyRdfa: Current options:
preserve space : True
output processor graph : True
output default graph : True
host language : RDFa Core
accept embedded RDF : False
check rdfa lite : False
cache vocabulary graphs : False

jlaptop17.local 2018-05-23 15:27:29,797 MainThread INFO toil.common: Using the single machine batch system
jlaptop17.local 2018-05-23 15:27:29,798 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxCores to CPU count of system (8).
jlaptop17.local 2018-05-23 15:27:29,798 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxMemory to physically available memory (17179869184).
jlaptop17.local 2018-05-23 15:27:29,808 MainThread INFO toil.common: Created the workflow directory at /var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/toil-92328fb2-33b7-44cd-879f-41d8cbf94555-132281828025877
jlaptop17.local 2018-05-23 15:27:29,808 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxDisk to physically available disk (202669449216).
jlaptop17.local 2018-05-23 15:27:29,815 MainThread INFO toil.common: User script ModuleDescriptor(dirPath='/Users/john/workspace/planemo/.venv/lib/python2.7/site-packages', name='toil.cwl.cwltoil', fromVirtualEnv=True) belongs to Toil. No need to auto-deploy it.
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: No user script to auto-deploy.
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: Written the environment for the jobs to the environment file
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: Caching all jobs in job store
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: 0 jobs downloaded.
jlaptop17.local 2018-05-23 15:27:29,911 MainThread INFO toil: Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.
jlaptop17.local 2018-05-23 15:27:29,911 MainThread INFO toil.realtimeLogger: Real-time logging disabled
jlaptop17.local 2018-05-23 15:27:29,937 MainThread INFO toil.toilState: (Re)building internal scheduler state
2018-05-23 15:27:29,937 - toil.toilState - INFO - (Re)building internal scheduler state
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
2018-05-23 15:27:29,938 - toil.leader - INFO - Found 1 jobs to start and 0 jobs with successors to run
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
2018-05-23 15:27:29,938 - toil.leader - INFO - Checked batch system has no running jobs and no updated jobs
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Starting the main loop
2018-05-23 15:27:29,938 - toil.leader - INFO - Starting the main loop
jlaptop17.local 2018-05-23 15:27:29,939 MainThread INFO toil.leader: Issued job 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
2018-05-23 15:27:29,939 - toil.leader - INFO - Issued job 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
jlaptop17.local 2018-05-23 15:27:31,409 MainThread INFO toil.leader: Job ended successfully: 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU
2018-05-23 15:27:31,409 - toil.leader - INFO - Job ended successfully: 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU
jlaptop17.local 2018-05-23 15:27:31,411 MainThread INFO toil.leader: Finished the main loop: no jobs left to run
2018-05-23 15:27:31,411 - toil.leader - INFO - Finished the main loop: no jobs left to run
jlaptop17.local 2018-05-23 15:27:31,411 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
2018-05-23 15:27:31,411 - toil.serviceManager - INFO - Waiting for service manager thread to finish ...
jlaptop17.local 2018-05-23 15:27:31,946 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.535056114197 seconds
2018-05-23 15:27:31,946 - toil.serviceManager - INFO - ... finished shutting down the service manager. Took 0.535056114197 seconds
jlaptop17.local 2018-05-23 15:27:31,947 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...
2018-05-23 15:27:31,947 - toil.statsAndLogging - INFO - Waiting for stats and logging collator thread to finish ...
jlaptop17.local 2018-05-23 15:27:31,960 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 0.0131621360779 seconds
2018-05-23 15:27:31,960 - toil.statsAndLogging - INFO - ... finished collating stats and logs. Took 0.0131621360779 seconds
jlaptop17.local 2018-05-23 15:27:31,961 MainThread INFO toil.leader: Finished toil run successfully
2018-05-23 15:27:31,961 - toil.leader - INFO - Finished toil run successfully
{
"output1": {
"checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
"basename": "out",
"nameext": "",
"nameroot": "out",
"http://commonwl.org/cwltool#generation": 0,
"location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
"class": "File",
"size": 150
}
jlaptop17.local 2018-05-23 15:27:31,972 MainThread INFO toil.common: Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x10554d490>
}2018-05-23 15:27:31,972 - toil.common - INFO - Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x10554d490>

.. include:: _writing_conda_search.rst

----------------------------------------------------------------
@@ -236,6 +417,10 @@ not work properly without modification.
.. include:: _writing_conda_recipe_complete.rst

.. _Software Requirements: https://www.commonwl.org/v1.0/CommandLineTool.html#SoftwareRequirement
.. _BioContainers: http://biocontainers.pro/
.. _Docker: https://www.docker.com/
.. _Singularity: https://singularity.lbl.gov/
.. _Common Workflow Language: https://www.commonwl.org/
.. _seqtk: https://github.com/lh3/seqtk
.. _fleeqtk: https://github.com/jmchilton/fleeqtk
.. _Bioconda: https://github.com/bioconda/bioconda-recipes
@@ -0,0 +1,4 @@

input1:
class: File
path: test-data/2.fastq
@@ -1,9 +1,6 @@

- doc: test generated from example command
job:
input1:
class: File
path: test-data/2.fastq
job: seqtk_seq_job.json
outputs:
output1:
path: test-data/2.fasta

0 comments on commit 2002b49

Please sign in to comment.