Skip to content

de novo targeted gene assembly

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.txt
Notifications You must be signed in to change notification settings

bcgsc/kollector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kollector

de novo Targeted Gene Assembly

Kollector is a targeted assembly tool using BioBloom Tools and ABySS.

Installation

Kollector is implemented as a bash script and can be run directly from the downloaded github repo.

Dependencies

Kollector requires the following tools:

The binaries for the above tools are needed to be added to your path. The easiest way to install these tools is to install Linuxbrew and add linuxbrew binary to your PATH:

export PATH="$HOME/.linuxbrew/bin:$PATH"

Installing any package will be as easy as:

brew install (package name)

For example to install ABySS run:

brew install abyss

To see an simple example for running Kollector please see the Example section below.

MacOS Support

Running Kollector on MacOS requires some additional dependencies. In particular, it requires the GNU versions of awk, du, and time.

Installing these dependencies can be done with homebrew:

brew install coreutils gawk gnu-time

Running Kollector

Kollector consists of three bash scripts:

  • kollector-extract.sh
  • kollector-recruit.sh
  • kollector.sh

kollector.sh is the main script.It calls kollector-recruit.sh to recruit PET reads,invokes ABySS to perform contig assembly and calls kollector-extract.sh to extract assembled targets and finally aligns input transcripts to the assembly with GMAP.

To run Kollector simply run :

./kollector.sh <params> <seed.fa> <read1.fa> <read2.fa>

The parameters options are as following:

    -h        show this help message
    -j N      threads [1]
    -r N      min match length for tagging reads. Decimal value are
              the proportion of the valid k-mers and integer values
              will require that minimum number of bases to match [0.7]
    -s N      min match length for recruiting reads [0.50]
    -k N      k-mer size for ABySS contig assembly [32]
    -K N      k-mer size for read overlap detection [25]
    -n N      max k-mers to recruit in total [10000]
    -o FILE   output file prefix ['kollector']
    -p FILE   Bloom filter containing repeat k-mers for
              exclusion from scoring calculations; must match
              k-mer size selected with -K opt [disabled]
    -B N      pass bloom filter size to abyss 2.0.2 
              (B option, to be written: ex - 100M, optional)

<seed.fa> is the input transcript sequence in a form of FASTA file to recruit reads.

<read1.fa> and <read2.fa> are the PET sequencing reads and could be in a form of FASTA/FASTQ files. All the input files can be gzipped.

The -r and -s parameters may be integer values if exact match lengths of the reads are desired rather than k-mer proportions during read tagging or recruitment.

Running Kollector Iteratively

kollector-multiple.sh is a wrapper script for running Kollector iteratively with a large number of targets. After each iteration targets that are successfully assembled are removed from the input, while the failed ones are re-tried in then next iteration with a lower r value. kollector-multiple.sh is run with same arguments as kollector.sh, with two additional parameters:

    -max_iterations N        number of iterations to be performed [5]
    -decrement N             decrement of the r parameter in each iteration [0.1]
  

By using the output file prefix_assembledtargets.fa is the primary output and the output file prefix_hitlist.txt can tell you which successfully assembled targets assocate to each bait sequence.

Example : Testing Kollector with C.elegans dataset

The test folder contains a Makefile that runs kollector on C.elegans data set. The test target C. elegans transcript C17E4.10 FASTA file (Acession NM_060106.6, RefSeq mRNA sequences longer than 1 kb) is provided in data folder. To run Kollector on C.elegans data set simply run make in the test folder. Make sure linuxbrew is on your PATH as mentioned in the in Dependancies section.

The Makefile installs samtools, biobloomtools,abyss,gmap-gsnap and bwa via linuxbrew. Then,it downloads WGS read pairs FASTQ.gz files (SRA Accession: DRR008444,read length:110pb, total number of base pairs:7.5G and 75x raw coverage) to the data folder. Finally, it runs kollector pipeline with the default parameters mentioned above. The output of kollector is test-kollector_assembledtargets.fa file.

Troubleshooting

Kollector.sh is expected to be run with its relative path to the other scripts and it should autodetect your path and add it during runtime. If you get errors related to missing scripts you can add the kollector bin directory directly to be added to your PATH:

export PATH="$HOME/kollector/bin:$PATH"

Citing Kollector

If you find Kollector useful in your work please cite: Erdi Kucuk, Justin Chu, Benjamin P. Vandervalk, S. Austin Hammond, René L. Warren, Inanc Birol; Kollector: transcript-informed, targeted de novo assembly of gene loci. Bioinformatics 2017 btx078. doi: 10.1093/bioinformatics/btx078