Phylogenomic Pipeline
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Groups.md
LICENSE
README.md

README.md

#iTree: phylogenomic pipeline

##Phylogenomics Phylogenomics, conventionally defined as the intersection of phylogenetics and genomics, has become a key instrument in a wide spectrum of biological studies, including resolution of complex evolutionary relationships, assignment of taxonomic affiliation, prediction of protein molecular functions, and tracing horizontal gene transfer event. iTree automates the execution of phylogenetic analyses under multithreaded or grid-computing environments, providing a scalable high-throughput platform for performing genome-wide evolutionary analyses. ##Databases A key step in a phylogenetic analysis is collecting homologous sequences to the query of interest. This step is typically done through a BLAST search against a database. The content of the database has a direct impact on the taxon sampling and the phylogeny to be inferred. To maximize the sampling, iTree uses the results of BLAST against NBCI RefSeq for protein phylogenies and SILVA for ribosomal RNA (rRNA) phylogenies. In both cases, there are BLAST-formatted database (via formatdb of a Fasta file) and the corresponding relational database. ###RefSeq To make the tree more readable in terms of taxonomic information, the sequences in RefSeq are renamed in the iTree version. The adopted naming convention is domain.group.genus_species-txid_gi, where:

Token Description
domain A : Archaea, B : Bacteria, E : Eukarya, V : Vira (Viruses)
group Major taxonomic group or clade
genus Genus name
species Species (or strain) name
txid NCBI taxon identifier
gi NCBI gi number

Although, this naming convention produces pretty long names (the average name length is 66 characters), it makes much easier to recognize the taxonomic classification and the relationship between lineages in a phylogenetic tree even for non-taxonomists.

The renamed RefSeq protein sequences are stored as Fasta for BLAST and MySQL (at least for now) for fast access and retrieval.

Because of a GitHub limitation on the size of files to be pushed to repositories (for more information, see Working with large files and What is my disk quota?), the iTree databases have been deployed to Sourceforge.

The current versions based on the RefSeq Release 61 (September 2013) can be downloaded from here.

Database File Description Size
itree_refseq_61.fas.bz2 Fasta sequences 6.5 GB
itree_refseq_61.sql.bz2 MySQL dump 6.8 GB

To load the MySQL dump:

$ bzip2 -d itree_refseq_61.sql.bz2
$ mysqladmin -u root -p create itreedb
$ mysql -u root -p itreedb < itree_refseq_61.sql

Given the large size of the dump (> 20 GB uncompressed), the last step takes quite some time, varying according to the power of the host machine. For example, on an Amazon EC2 medium instance, doing nothing else, it takes about 12 hours!

To format the Fasta database (to make it ready for BLAST):

$ bzip2 -d itree_refseq_61.fas.bz2
$ ln -s itree_refseq_61.fas itreedb
$ formatdb -i itreedb
$ rm itreedb

Generally, these databases (Fasta and MySQL) can be utilized independently of iTree. They might be plugged into other phylogenomic pipelines or other general-purpose usage.

###SILVA Coming soon...

##Citation Moustafa, A., Bhattacharya, D., and Allen, A.E. (2010). iTree: A high-throughput phylogenomic pipeline. Biomedical Engineering Conference (CIBEC), 2010 5th Cairo International, pp. 103–107.

DOI: 10.1109/CIBEC.2010.5716071