This page goes a little more in depth on the software and its goals.
Several different factors motivated dammit's development. The first of these was the sea lamprey transcriptome project, which had annotation as a primary goal. Many of dammit's core features were already implemented there, and it seemed a shame not share that work with others in a usable format. Related to this was a lack of workable and easy-to-use existing solutions; in particular, most are meant to be used as protocols and haven't been packaged in an automated format. Licensing was also a big concern -- software used for science should be open source, easily accessible, remixable, and free.
Implicit to these motivations is some idea of what a good annotator should look like, in the author's opinion:
- It should be easy to install and upgrade
- It should only use Free software
- It should make use of standard databases
- It should output in reasonable formats
- It should be relatively fast
- It should try to be correct, insofar as any computational approach can be "correct"
- It should give the user some measure of confidence for its results.
- TransDecoder
- BUSCO
- HMMER
- Infernal
- LAST
- crb-blast (for now)
- pydoit (under the hood)
All of these are Free Software, as in freedom and beer
- Pfam-A
- Rfam
- OrthoDB
- BUSCO databases
- Uniref90
- User-supplied protein databases
The last one is important, and sometimes ignored.
Building off Richard and co's work on Conditional Reciprocal Best BLAST, I've implemented a new version with Python and LAST -- CRBL. The original lives here: https://github.com/cboursnell/crb-blast
Why??
- BLAST is too slooooooow
- Ruby is yet another dependency to have users install
- With Python and scikit learn, I have freedom to toy with models (and learn stuff)
And, of course, some of these databases are BIG. Doing blastx
and tblastn
between a reasonably sized transcriptome and Uniref90 is not an experience you want to have.
ie, practical concerns.
- Reciprocal Best Hits (RBH) is a standard method for ortholog detection
- Transcriptomes have multiple multiple transcript isoforms, which confounds RBH
- CRBB uses machine learning to get at this problem
CRBB attempts to associate those isoforms with appropriate annotations by learning an appropriate e-value cutoff for different transcript lengths.
CRBBfrom http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365#s5
For CRBL, instead of fitting a linear model, we train a model.
- SVM
- Naive bayes
One limitation is that LAST has no equivalent to tblastn
. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the translated transcriptome versus database hits.