Skip to content
scalable genome assembly polishing
Branch: master
Clone or download
Latest commit 993be0b Mar 15, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
demo
lib
perl_version fixed repeat bug Mar 15, 2019
tarballs
.DS_Store update to FigS1 Feb 28, 2019
LICENSE added license, fixed readme Dec 3, 2018
Makefile
README.md Update README.md Mar 2, 2019
figS1.png
ntedit-logo.png
ntedit.cpp fixed repeat bug Mar 15, 2019

README.md

Logo

ntEdit

Scalable genome assembly polishing

2/2019

email: rwarren [at] bcgsc [dot] ca

Description


ntEdit is a scalable genomics application for polishing genome assembly drafts. ntEdit simplifies polishing and "haploidization" of gene and genome sequences with its re-usable Bloom filter design. We expect ntEdit to have additional applications in fast mapping of simple nucleotide variations between any two individuals or species’ genomes.

Implementation and requirements


ntEdit v1.1.0 is written in C++.

Install


Clone this directory and enter this directory.

https://github.com/bcgsc/ntEdit.git
cd ntEdit

Compile ntEdit.

make ntedit

Dependencies


  1. ntHits (https://github.com/bcgsc/nthits)
  2. BloomFilter utilities (provided in ./lib)
  3. kseq (provided in ./lib)

Documentation


Refer to the README.md file on how to run ntEdit and our manuscript for information about the software and its performance Questions or comments? We would love to hear from you! rwarren@bcgsc.ca

Citing ntEdit


René L Warren, Lauren Coombe, Hamid Mohamadi, Jessica Zhang, Barry Jaquish, Nathalie Isabel, Steven JM Jones, Jean Bousquet, Joerg Bohlmann and Inanç Birol
ntEdit: scalable genome assembly polishing
TBD

The experimental data described in our paper can be downloaded here: http://www.bcgsc.ca/downloads/btl/ntedit/ Thank you for using, developing and promoting this free software.

Credits


ntEdit: Rene Warren

nthits / nthash / BloomFilter.pm: Hamid Mohamadi (recursive/ntHash)

C++ implementation Jessica Zhang

How to run in a pipeline


  1. Running nthits (please see nthits documentation) nthits -c -k -j reads eg. ./nthits -c 1 --outbloom --solid -k 25 -j 48 Sim_100_300_1.fq Sim_100_300_2.fq or ./nthits -c 1 --outbloom --solid -k 25 -j 48 @reads.in

Where @reads.in is a file listing the path to all fastq files

  1. Running ntEdit (see complete usage below) ./ntedit -f -k -r -b -t eg. ./ntedit -f ecoliWithMismatches001Indels0001.fa -r solidBF_k25.bf -k 25 -b ntEditEcolik25 -t 48

Running ntEdit


e.g. ./ntedit -f ecoliWithMismatches001Indels0001.fa -r solidBF_k25.bf -k 25 -b ntEditEcolik25

 Options:
	-t,	number of threads [default=1]
	-f,	Draft genome assembly (FASTA, Multi-FASTA, and/or gzipped compatible), REQUIRED
	-r,	Bloom filter file (generated from ntHits), REQUIRED
	-b,	output file prefix, OPTIONAL
	-k,	kmer size, REQUIRED
	-z,	minimum contig length [default=100]
	-i,	maximum number of insertion bases to try, range 0-5, [default=4]
	-d,	maximum number of deletions bases to try, range 0-5, [default=5]
	-x,	k/x ratio for the number of kmers that should be missing, [default=5.000]
	-y, 	k/y ratio for the number of editted kmers that should be present, [default=9.000]
	-v,	verbose mode (-v 1 = yes, default = 0, no)

	--help,		display this message and exit 
	--version,	output version information and exit
Report bugs to rwarren@bcgsc.ca

Test data


Go to ./demo

(cd demo)

run:

./runme.sh (../ntedit -f ecoliWithMismatches001Indels0001.fa.gz -r solidBF_k25.bf -k 25 -b ntEditEcolik25)

ntEdit will polish an E. coli genome sequence with substitution error ~0.001 and indels ~0.0001 using pre-made nthits Bloom filter

Expected files will be: ntEditEcolik25.log ntEditEcolik25_changes.tsv ntEditEcolik25_edited.fa

How it works


Logo Sequence reads are first shredded into kmers using ntHits, keeping track of kmer multiplicity. The kmers that pass coverage thresholds (ntHits, -c option builds a filter with kmers having a coverage higher than c) are used to construct a Bloom filter (BF). The draft assembly is supplied to ntEdit (-f option, fasta file), along with the BF (-r option) and sequences are read sequentially. Sequence strings are shredded into words of length k (kmers) at a specified value (-k option) matching that used to build the BF, and each kmer from 5’ to 3’ queries the BF data structure for presence/absence (step 1). When a kmer is not found in the filter, a subset (Sk) of overlapping k kmers (defined by k over three, k/3) containing the 3’-end base is queried for absence (step 2). The subset Sk, representing a subsampling of k kmers obtained by sliding over 3 bases at a time over k bases, is chosen to minimize the number of checks against the Bloom filter. Of this subset, when the number of absent (-) kmers matches or exceeds a threshold defined by Sk- >= k/x (-x option), representing the majority of kmers in Sk, editing takes place (step 3 and beyond), otherwise step 1 resumes. In the former case, the 3’-end base is permuted to one of the three alternate bases (step 3), and the subset (Sk_alt) containing the change is assessed for Bloom filter presence (+). When that number matches or exceeds the threshold defined by Sk_alt+ >= k/y (-y option), which means the base substitution qualifies, it is tracked along with the number of supported kmers and the remaining alternate 3’-end base substitutions are also assessed (ie. resuming step 3 until all bases inspected). If the edit does not qualify, then a cycle of base insertion(s) and deletion(s) of up to –i and –d bases begins (step 4, -i option and step 5, -d option, respectively). As is the case for the substitutions, a subset of k kmers containing the indel change is tested for presence. If there are no qualifying changes, then the next alternate 3’-end base is inspected as per above; otherwise the change is applied to the sequence string and the next assembly kmer is inspected (step 1). The process is repeated until a qualifying change or until no suitable edits are found. In the latter case, we go back to step 1. When a change is made, the position on the new sequence is tracked, along with an alternate base with lesser or equal k kmer subset support, when applicable. Currently, ntEdit only tracks cases when edits are made (steps 3-5), and does not flag unedited, missing draft kmers (steps 1-2).

OUTPUT FILES


Output files Description
_changes.tsv tab-separated file; ID bpPosition+1 OriginalBase NewBase Support 25-mers (out of k/3) AlternateNewBase Alt.Support k-mers eg. U00096.3_MG1655_k12 117 A T 9
_edited.fa fasta file; contains the polished genome assembly

License


LINKS Copyright (c) 2018-2019 British Columbia Cancer Agency Branch. All rights reserved.

LINKS is released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca

You can’t perform that action at this time.