No description or website provided.
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
GItaxidIsVert.py
LICENSE
README.md
nodes_to_vertebrate_ids.py
recursive_binary_search.py

README.md

GItaxidIsVert

This set of scripts parses a BLAST m8 file (BLAST tabular output file) for results that are sequences derived from vertebrate sources.

Contents

This tool includes the following scripts:
recursive_binary_search.py: used as a module in GItaxidIsVert.py
nodes_to_vertebrate_ids.py: used to process nodes.dmp file and output a list of all of the GenBank GIs that are vertebrates
GItaxidIsVert.py: primary script

Usage

# Please note that these tools, which rely on NCBI GenBank GI numbers will not be useful once GI numbers are phased out http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/.

$ git clone https://github.com/calacademy-research/GItaxidIsVert.git

# You can choose to follow all or parts of this tutorial.

# If you want to perform a BLAST to a local copy of NCBI's nt database,
# First find the Perl script "update_blastdb.pl" found in the bin directory of your installed BLAST+ package
# Then do the following in an empty directory:
$ perl update_blastdb.pl --timeout 300 --force --verbose nt
# Uncompress the downloaded files with your favorite method
$ for f in *.tar.gz; do tar xvfz $f; done

# Perform a BLAST search
$ blastn -db </path/to/downloaded/NCBI/database/nt> -query <fasta_query_file> -outfmt 10 -out <BLAST_output_file.m8>

# The next sections are more relevant specifically to GItaxidIsVert.py

# Obtain the following and uncompress them
$ wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
$ wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
$ tar xvfz taxdump.tar.gz
$ gunzip gi_taxid_nucl.dmp.gz

# Make a list of all of the GenBank GIs that are vertebrates in the same directory as the files above
# You need to execute this in the same directory as the .dmp files or provide the full path to the nodes.dmp file
$ nodes_to_vertebrate_ids.py nodes.dmp >vert_ids.dmp

# You will need to supply the path to the directory where your .dmp files are located # Get the full path to the directory where the above .dmp files are found $ pwd

$ cd GItaxidIsVert
$ python GItaxidIsVert.py <BLAST_output_file.m8> -dmpDir </path/to/dmp/directory/>

GItaxidIsVert.py options

$ python GItaxidIsVert.py

Usage: GItaxidIsVert.py <blast_m8_fmt_file> -dmpDir </path/to/dmp/directory/> [-e <eval_filter#>] [-t|-a] [-n] [-c]
       Writes out m8 records that are vertebrates.

       -e eval_filter # sets an eVal number the record must be <= to to be output. Default is 1e-12
          Don't forget a number before e (usually 1) and to use a minus sign after e.
       -dmpDir /path/to/dmp/directory/ # provide full path to directory holding the following dmp files:
            gi_taxid_nucl.dmp
            vert_ids.dmp
          Don't forget to use a '/' at the end of the path.
       -a all hits of suitable eValue for each query
       -t only top hit for each query (default)
       -n reverses meaning so you get nonvertebrates.
       -c writes out a comment line with gi number,tax_id and type instead of m8 record.

Citing

NCBI taxonomy database: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=howcite

Authorship

recursive_binary_search.py authors: Joe Koberg (http://stackoverflow.com/questions/744256/reading-huge-file-in-python), Zena Ng
nodes_to_vertebrate_ids.py authors: James B. Henderson, jhenderson@calacademy.org; Zachary R. Hanna
GItaxidIsVert.py authors: James B. Henderson, jhenderson@calacademy.org; Zachary R. Hanna
README.md authors: Zachary R. Hanna, James B. Henderson

Version 1.0.0

DOI