The default
pipeline is suitable for most purposes.
However, dammit has several alternative workflows that either
reduce the number of databases and tools run (the quick
pipeline)
or annotate with larger databases, such as UniRef90 ( full
pipeline).
Note: To use these pipelines, first run
dammit run databases
to make sure the relevant databases are installed. Then you can proceed withdammit run annotate
By default, dammit
runs the following:
- default:
busco
quality assessmenttransdecoder
ORF predictionshmlast
to any user databaseshmmscan
to Pfam-Acmscan
to RfamLAST
mapping to OrthoDB and Swiss-Prot
The databases used for this pipeline require approximately ~18GB of storage space, plus a few hundred MB per busco database. We recommend running this pipeline with at least 16GB of RAM.
code to run this pipeline:
dammit run databases --install
dammit run annotate
If specifying a custom location for your databases, add
--databases-dir /path/to/dbs
- quick (
--pipeline quick
):busco
quality assessmenttransdecoder
ORF predictionshmlast
to any user databases
The quick
pipeline can be used for running a minimal annotation run:
BUSCO quality assessment, ORF prediction with transdecoder, and shmlast
to map to any user databases. While this pipeline may require less database
space, we still recommend running with 16G of RAM, especially if mapping
to a user-provided protein database.
code to run this pipeline:
dammit run databases --install --pipeline quick
dammit run annotate --pipeline quick
If specifying a custom location for your databases, add
--databases-dir /path/to/dbs
warning: time and resource intensive!
The full
pipeline starts from the default
pipeline and adds a mapping
database, UniRef90.
UniRef90 is a set of UniProt sequences clustered by >=90% sequence identity. UniRef allows a searching to a larger set of sequence records while hiding redundant sequences. See the UniRef documentation for more.
- full (
--pipeline full
):busco
quality assessmenttransdecoder
ORF predictionshmlast
to any user databaseshmmscan
to Pfam-Acmscan
to RfamLAST
mapping to OrthoDB, Swiss-Prot, and UniRef90
As of fall 2020, the UniRef90 fasta is 26G (gzipped).
code to run this pipeline:
dammit run databases --install --pipeline full
dammit run annotate --pipeline full
If specifying a custom location for your databases, add
--databases-dir /path/to/dbs
warning: REALLY time and resource intensive!
nr is a very large database consisting of both non-curated and curated database entries. While the name stands for "non-redundant", this databse is no longer non-redundant. Given the time and memory requirments, nr is only a good choice for species and/or sequences you're unable to confidently annotate via other databases.
- nr (
--pipeline nr
):busco
quality assessmenttransdecoder
ORF predictionshmlast
to any user databaseshmmscan
to Pfam-Acmscan
to RfamLAST
mapping to OrthoDB, Swiss-Prot, and nr
As of fall 2020, the nr fasta is 81G (gzipped).
code to run this pipeline:
dammit run databases --install --pipeline nr
dammit run annotate --pipeline nr
If specifying a custom location for your databases, add
--databases-dir /path/to/dbs
Note: Since all of these pipelines use a core set of tools, and since dammit uses snakemake
to keep track of the files that have been run, dammit will not rerun the core tools if decide
to alter the pipeline you're running. So for example, you could start by running a quick
run, and later run default
if desired. In that case, dammit
would run only the new annotation
steps, and reintegrate the relevant outputs into new dammit
gff3 and annotated fasta files.