An automated protocol for finishing long-read genome assemblies using short reads. ntEdit polishes the draft assembly and flags erroneous regions, then Sealer fills assembly gaps and erroneous sequence regions flagged by ntEdit. The protocol is implemented as a Makefile pipeline.
The ntEdit+Sealer dependencies are available from Conda:
conda install -c bioconda nthits ntedit abyss
All dependencies are also available from Homebrew:
brew install brewsci/bio/nthits ntedit abyss
This repository, containing the Makefile pipeline and additional scripts, can be cloned from Github:
git clone https://github.com/bcgsc/ntedit_sealer_protocol.git
To run the protocol, ensure that all dependencies are on your PATH
.
For example, to run the pipeline on a draft long-read assembly draft-assembly.fa
with short read files reads_1.fq.gz
and reads_2.fq.gz
, k-mer lengths k=80
, k=65
and k=50
, specifying the ABySS-Bloom Bloom filter size to be 5G
:
ntedit-sealer finish seqs=draft-assembly.fa reads='reads_1.fq.gz reads_2.fq.gz' k='80 65 50' b=5G
The corrected, finished assembly can be found with the suffix .ntedit_edited.prepd.sealer_scaffold.fa
.
Usage: ntedit-sealer finish [OPTION=VALUE]
General options:
seqs Draft assembly name [seqs]. File must have .fa extension
reads Read file(s). All files must have .fq.gz extension. Must be separated by spaces and surrounded by quotes
k K-mer sizes. List must be descending, separated by spaces and surrounded by quotes
t Number of threads [8]
time If True, will log the time for each step [False]
ntEdit options:
X Ratio of number of kmers in the k subset that should be missing in order to attempt fix (higher=stringent) [0.5]
Y Ratio of number of kmers in the k subset that should be present to accept an edit (higher=stringent) [0.5]
ABySS-bloom options:
b Bloom filter size (e.g. 100M)
Sealer options:
L Length of flanks to be used as pseudoreads [100]
P Maximum alternate paths to merge; use 'nolimit' for no limit [10]
Notes:
- Pass all parameter list values (reads, k) as space-separated values surrounded by quotation marks, e.g. k='80 65 50'
- Ensure that all input files are in the current working directory, making soft-links if needed
- K-mer lengths will be used in the order they are provided. Ensure that they are sorted in descending order (largest to smallest)
Running ntedit-sealer help
prints the help documentation.
Thank you for your and for using, developing and promoting this free software!
If you use ntEdit-Sealer, ntEdit or Sealer in your research, please cite:
ntEdit+Sealer: Efficient targeted error resolution and automated finishing of long-read genome assemblies. Li JX, Coombe L, Wong J, Birol I, Warren RL. Curr. Protocols. 2022. 2:e442
ntEdit: Scalable Genome Sequence Polishing
ntEdit: Scalable genome sequence polishing. Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I. Bioinformatics. 2019. Nov 1;35(21):4430-4432. doi: 10.1093/bioinformatics/btz400.
Sealer: A Scalable Gap-closing Application for Finishing Draft Genomes
Sealer: A scalable gap-closing application for finishing draft genomes. Paulino D*, Warren RL*, Vandervalk BP, Raymond A, Jackman SD, Birol I. BMC Bioinformatics. 2015. 16:230
ntEdit-Sealer Copyright (c) 2015-2022 British Columbia Cancer Agency Branch. All rights reserved.
ntEdit and Sealer are released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca