16S rRNA gene sequence curation and phylogenetic reference set creation
The Easy Way
- confirm availability of necessary libraries to compile dependencies
sudo apt-get install gfortran libopenblas-dev liblapack-dev)
- Install Python 2.7
- run bin/bootstrap.sh
- run source deenurp-env/bin/activate
the deenurp executable should now be on your $PATH
The Hard Way
See required system libraries above.
First, install binary dependencies:
pip, for installing python dependencies (http://www.pip-installer.org/)
pip install PACKAGEfor every PACKAGE listed in requirements.txt, e.g.
cat requirements.txt | xargs -n 1 pip install
Infernal version 1.1 (http://infernal.janelia.org/)
pplacer suite (http://matsen.fhcrc.org/pplacer)
FastTree 2 (http://www.microbesonline.org/fasttree/#Install)
- muscle (http://www.drive5.com/muscle/)
python setup.py install
De-novo reference set creation
Similarity-search based reference sequence selection
deenurp package under the current directory provides to subcommands,
accessed via the script
deenurp.py, or the command
deenurp if installed.
Subcommands fall into two general categories:
- Building a set of reference sequences for use in refpkg building
- Selecting sequences for a specific reference package
Creating a sequence set for refpkg building
Removes outlier sequences from a reference database
Expands poorly-represented names in a sequence file by similarity search
Cluster reference sequences, first by tax-id at a specified rank
(default: species), then by similarity for unnamed sequences or
sequences not classified to the desired rank. Serves as input to
Selecting sequences for a reference package
Builds a set of hierarchical reference packages.
Searches a set of sequences against a FASTA file containing possible reference sequences.
This subcommand does searches sequences against a reference FASTA
file, saving the results and some metadata to a sqlite database for
Given the output of
attempts to find a good set of reference sequences.
For each reference cluster with a minimal amount of sequences having
best hits to the cluster, (see
cluster-refs), selects a set number
of sequences to serve as references.
Taxa who are the sole descendent of their parent can complicate taxonomic classification.
fill-lonely subcommand finds some company for these lonely
Fetches sequences from a sequence file which match the taxtable for a reference set at a given rank. Useful for adding type strains.
tax2tree program on a reference package, updating the
Sequences whose lineage changes are relabeled. The prior
added to the
seq_info file in the reference package.