- Bio-y
- (pronounced "bio-ee") The adjective form of the noun "Bio"
Table of Contents
- Noah Hoffman
- Chris Rosenthal
- Tyler Land
- A unix-like system; tested primarily on Ubuntu 12.04
- Python 2.7.x
- setuptools
Some functions require the following python packages:
- numpy
- pandas
- biopython
And other require external programs, including:
- ssearch36 (http://faculty.virginia.edu/wrpearson/fasta/fasta36/)
- usearch6 (http://www.drive5.com/usearch/download.html)
- muscle (http://www.drive5.com/muscle/)
To install bioy and python dependencies, run setup.py or pip from the project directory:
% cd bioy % python setup.py install # or % pip install -U .
If you don't want to install the dependencies (numpy and pandas take a while to compile), use:
% pip install --no-deps -U .
Numpy and pandas require many dependencies to compile (and you'll likely need to compile them because versions in package managers are typically out of date). Fortunately, these can pretty easily be installed on Ubuntu 12.04 by running:
% sudo apt-get build-dep python-numpy python-pandas
A virtualenv containing a complete python execution environment can be created using dev/bootstrap.sh:
% dev/bootstrap.sh -h Create a virtualenv and install all pipeline dependencies Options: --venv - path of virtualenv [bioy-env] --python - path to the python interpreter [/usr/local/bin/python] --wheelstreet - path to directory containing python wheels; wheel files will be in a subdirectory named according to the python interpreter version; uses WHEELSTREET if defined. (a suggested location is ~/wheelstreet) [] --requirements - a file listing python packages to install [requirements.txt]
The bioy
script provides the user interface, and uses standard
UNIX command line syntax. Note that for development, it is convenient
to run bioy
from within the project directory by specifying the
relative path to the script:
% ./bioy Commands are constructed as follows. Every command starts with the name of the script, followed by an "action" followed by a series of required or optional "arguments". The name of the script, the action, and options and their arguments are entered on the command line separated by spaces. Help text is available for both the ``bioy`` script and individual actions using the ``-h`` or ``--help`` options:: usage: bioy [-h] [-V] [-v] [-q] {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch} ... Tools for microbial sequence analysis and classification. positional arguments: {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch} help Detailed help for actions using `help <action>` align_clusters Align reads contributing to a denoised cluster. all_pairwise Calculate all Smith-Waterman pairwise distances among sequences. blast Run blastn and produce classify friendly output children Return the children of a taxtable given a list of taxids classifier Classify sequences by grouping blast output by matching taxonomic names classify Classify sequences by grouping blast output by matching taxonomic names cmscores Convert raw cmalign alignment scores to csv format. consensus Calculate the consensus for a multiple aignment csv2fasta Turn a csv file into a fasta file specifying two columns csv2hdf5 Convert a csv file to HDF5 csvmod Add or rename columns in a csv file. dedup Fast deduplicate sequences by coalescing identical substrings denoise Denoise a fasta file of clustered sequences errors Tally and classify errors given ./ion rlaligns reference and query sequences fasta Run the fasta pairwise aligment tool and output in csv format. fasta2csv Turn a fasta file into a csv fastq_stats Describe distributions of sequencing quality scores gb2fa Outputs a standard Genbank Record File into fasta file format and optional seqinfo file in format ['seqname', 'tax_id','accession','description','length','ambig_cou nt','is_type','rdp_lineage'] index Add simple indices to an sqlite database map_clusters Create a readmap and specimenmap and/or weights file from a ncbi_fetch Fetch sequences from NCBI's nucleotide database using sequence identifiers (gi or gb) primer_trim Parse region between primers from fasta file pull_reads Parse barcode, primer, and read from a fastq file repl Replace strings in one or more files. reshape convert a tsv file to a csv with an optional split/add columns feature reverse_complement reverse complement rle and non-rle sequences rldecode Run-length decode a fasta file rlencode Run-length encode a fasta file split_barcodes Partition reads in a fastq file by barcode and write an annotated fasta file split_reads Parse reads from a fasta file by read to specimen csv map file ssearch Run the ssearch (Smith-Waterman) pairwise aligment tool and output in csv format. ssearch2csv Parse ssearch36 -m10 output and print specified contents ssearch_count Tally ssearch base count by position tree_edit Tree leaf name editor that wraps BioPython. tsv2csv convert a tsv file to a csv with an optional split/add columns feature usearch Run usearch global and produce classify friendly output optional arguments: -h, --help show this help message and exit -V, --version Print the version number and exit -v, --verbose Increase verbosity of screen output (eg, -v is verbose, -vv more so) -q, --quiet Suppress output
We use abbrevited git sha hashes to identify the software version:
% ./bioy --version 0128.9790c13
The version information is saved in bioy_pkg/data
when setup.py
is run (on installation, or even by executing python setup.py
-h
).
Unit tests are implemented using the unittest
module in the Python
standard library. The tests
subdirectory is itself a Python
package that imports the local version (ie, the version in the project
directory, not the version installed to the system) of the
package. All unit tests can be run like this:
% ./testall ........... ---------------------------------------------------------------------- Ran 11 tests in 0.059s OK
A single unit test can be run by referring to a specific module,
class, or method within the tests
package using dot notation:
% ./testone -v tests.test_utils
To build the Sphinx docs:
(cd docs && make html)
And to publish to GitHub pages:
ghp-import -p docs/_build/html
(ghp-import and Sphinx are both included in the requirements.txt)
Copyright (c) 2012 Noah Hoffman
Released under the GPLv3 License