Skip to content

Latest commit



214 lines (166 loc) · 9.79 KB

File metadata and controls

214 lines (166 loc) · 9.79 KB

BioConda package Build Status

Kodoja: A workflow for virus detection in plants using k-mer analysis of RNA-sequencing data

Kodoja takes the raw data (either fasta or fastq) and uses Kraken, a k-mer-based tool, and Kaiju, which used the Burrows–Wheeler transform, to detect viral sequences in RNA-seq or sRNA-seq data.


There are three main scripts:

  • - classify RNA-seq data.
  • - download viral/host genomes and create new Kraken and Kaiju databases.
  • - pull out sequences of interest from results file.

Python files and contain the fuctions called by and, and are not intended for public use.

The .sh files are example scripts for submission to an SGE cluster.

For a examples of how to run the code please see the wiki page:

Additionally, for those of you using the Galaxy web-platform for running bioinformatics analysis from your web-browser, we have provided a Galaxy Wrapper for Kodoja available to install from the Galaxy Tool Shed.


Please cite the following manuscript for Kodoja:

Amanda Baizan-Edge et al. (2019), Kodoja: A workflow for virus detection in plants using k-mer analysis of RNA-sequencing data, Journal of General Virology


Kodoja is released under the MIT licence, see file LICENSE.txt for details.


The lower versions listed were those used in the initial development and/or local testing of Kodoja. Later updates will likely work unless the tool makes a backward incompatible change.

  • FastQC v0.11.5
  • Trimmomatic v0.36
  • Kraken v1.0
  • Kaiju v1.5.0

Python packages:

  • numpy v1.9
  • biopython v1.67
  • pandas v0.14
  • ncbi-genome-download v0.2.6

You can use Python 2.7 or Python 3, specifically Kodoja has been tested on Python 3.6.


A conda package has been prepared on the BioConda channel which will install Kodoja and the dependencies, all with just:

$ conda install -c bioconda kodoja

For manual installation, you must install all the dependencies by hand and then add the main scripts folder to your $PATH so that you can run etc at the command line.

Pre-built Databases

You can use to make your own databses, or download the pre-built database as described here.

The kodojaDB v1.0 was released Sept 2018 under the CC-BY 4.0 license. It can be downloaded and cited as (where the metadata describes how it was made). We suggest you install it as follows:

$ cd /mnt/shared/data/
$ mkdir kodojaDB_v1.0
$ cd kodojaDB_v1.0
$ wget
$ tar -zxvf kodojaDB_v1.0.tar.gz

You would then use this with as follows:

$ --kraken_db /mnt/shared/data/kodojaDB_v1.0/krakenDB \
                   --kaiju_db /mnt/shared/data/kodojaDB_v1.0/kaijuDB \


IMPORTANT: do not put original data in the output directory when executing kodoja_search! parameters:


  • --read1 - path to the single-end or first paired-end file (required)
  • --read2 - path to second paired-end file (default=False)
  • --data_format - specify the file-type for file1 ("fasta" or "fastq" - default='fastq')
  • --output_dir - path to the results folder (required)
  • --threads - number of threads on cluster (default=1)
  • --host_subset - tax id of host. Use this is a host genome was added to the databases and you do not wish to see the number of reads classifed to this group in the final table


  • --kraken_db - path to kraken database (required)
  • --kraken_quick - Quick operation mode of Kraken, where instead of querying all k-mers in the database, it stops at nth k-mer hit preload (default=False)


  • --kaiju_db - path to kaiju database, nodes.dmp and names.dmp files (required)
  • --kaiju_minlen - minimun required fragment length length (default=15)
  • --kaiju_mismatch - number of mismatches allowed by kaiju (default=1)
  • --kaiju_score - minimum required match if mismatches introduced (default=85)

Set parameter for kaiju: -x - used to enable filtering of query sequences containing low-complexity regions by using the SEG algorithm from the blast+ package. Enabling this option is always recommended in order to avoid false positive matches caused by spurious matches due to simple repeat patterns or other sequencing noise.


  • --trim_minlen - minimum length read after trimming (default=50)
  • --trim_adapt - fasta file with Illumina adaptor sequences to allow trimming (default=False)

Set parameters for trimmomatic ILUMINACLIP 2:30:10 (seed mismatches:palindrome threshold:simple clip threshold) - seedMismatches specifies the maximum mismatch count which will still allow a full match to be performed, palindromeClipThreshold specifies how accurate the match between the two 'adapter ligated', reads must be for PE palindrome read alignment, simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read. LEADING:20 - Specifies the minimum quality required to keep a base TRAILING:20 - Specifies the minimum quality required to keep a base parameters:

General parameters:

  • --output_dir - Output directory path where kraken and kaiju databases will be written, required')
  • --threads - number of threads on cluster (default=1)
  • --host - NCBI tax id for the host genome to be downloaded from refseq and added to the databases(default=False)
  • --extra_files - List of file paths (default=False)
  • --extra_taxids - List of tax ids corresponding to extra files (default=False)
  • --all_viruses - Build databases with viruses from all hosts
  • --db_tag - Suffix for databases (default=none)

Kraken database:

  • --kraken_kmer - Kraken kmer size type=int, (default=31)
  • --kraken_minimizer - Kraken minimizer size (default=15)


  • --download_parallel - number of genomes to download in parallel (default=4)
  • --no_download - Genomes have already been downloaded and are in output folder (default=False) parameters:

  • --file_dir - Path to directory of kodoja_search results (required)
  • --user_format - Sequence data format (default=fastq)
  • --read1 - Path to read 1 file (required)
  • --read2 - Path to read 2 file
  • --taxID - Virus tax ID for subsetting (default: All viral sequences)
  • --genus - Include sequences classified at the genus level in subset file
  • --stringent - Only subset sequences identified to same virus by both tools

Release History

Version Date Notes
0.0.10 2018-08-16 - Link to the online manual from command line help
- Support Kaiju v1.7.0 (mkbwt now has a prefix)
0.0.9 2018-10-16 - Fix v0.0.8 regression in
0.0.8 2018-09-14 - Output read ID not title in kraken_VRL.txt
- Omit /1 and /2 suffixes in kraken_VRL.txt
0.0.7 2018-09-07 - Document installing prebuilt database from Zenodo
- Optimise sorting of pandas dataframes
- Zero not blank in cols 6 and 7 of virus_table.txt
- Automated testing of pinned & latest dependencies
0.0.6 2018-09-04 - Python 3 fix for
- Automated testing of
- Also test paired reads without /1 and /2 suffixes
0.0.5 2018-08-29 - Refactor logging in
- Top level error handling, with logging in search
- dictionary changed size during iteration bug
0.0.4 2018-08-22 - Code style updates (no functional changes)
- Provide cut-down NCBI taxonomy for tests cases
- Additional database build testing
- Downloads virus files with HTTPS rather than FTP
0.0.3 2018-02-22 - Include genus level counts in search results
- Simplify internal renaming of sequencing reads
0.0.2 2018-01-22 - Now tested under Python 3.6 as well as Python 2.7
0.0.1 2018-01-15 - Initial release for BioConda packaging


Kodoja is on GitHub, and has auotmated testing running on TravisCI, see special file .travis.yml and webpage for details.

The release process includes:

  1. Update version in diagnosticTool_scripts/
  2. Update release history in this file.
  3. Commit changes.
  4. Tag the commit with git tag kodoja-vX.Y.Z
  5. Push commits and tags to github with git push origin master --tags
  6. Submit a pull request to BioConda to update the package, which usally just means bumping the version and updating the checksum in meta.yaml: