Skip to content

Efficient targeted error resolution and automated finishing of long-read genome assemblies

Notifications You must be signed in to change notification settings

bcgsc/ntedit_sealer_protocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ntEdit+Sealer Assembly Finishing Protocol

Logo

An automated protocol for finishing long-read genome assemblies using short reads. ntEdit polishes the draft assembly and flags erroneous regions, then Sealer fills assembly gaps and erroneous sequence regions flagged by ntEdit. The protocol is implemented as a Makefile pipeline.

ntEdit+Sealer protocol flowchart

Dependencies

  • GNU Make
  • Python 3
  • ntHits v0.0.1+
  • ntEdit v1.3.5+
  • ABySS v2.3.2+ (includes Sealer and ABySS-Bloom)

Installation

The ntEdit+Sealer dependencies are available from Conda:

conda install -c bioconda nthits ntedit abyss

All dependencies are also available from Homebrew:

brew install brewsci/bio/nthits ntedit abyss

This repository, containing the Makefile pipeline and additional scripts, can be cloned from Github:

git clone https://github.com/bcgsc/ntedit_sealer_protocol.git

To run the protocol, ensure that all dependencies are on your PATH.

Example Command

For example, to run the pipeline on a draft long-read assembly draft-assembly.fa with short read files reads_1.fq.gz and reads_2.fq.gz, k-mer lengths k=80, k=65 and k=50, specifying the ABySS-Bloom Bloom filter size to be 5G:

ntedit-sealer finish seqs=draft-assembly.fa reads='reads_1.fq.gz reads_2.fq.gz' k='80 65 50' b=5G

The corrected, finished assembly can be found with the suffix .ntedit_edited.prepd.sealer_scaffold.fa.

Help Page

Usage: ntedit-sealer finish [OPTION=VALUE]

General options:
seqs			Draft assembly name [seqs]. File must have .fa extension
reads			Read file(s). All files must have .fq.gz extension. Must be separated by spaces and surrounded by quotes
k			K-mer sizes. List must be descending, separated by spaces and surrounded by quotes
t			Number of threads [8]
time			If True, will log the time for each step [False]

ntEdit options:
X			Ratio of number of kmers in the k subset that should be missing in order to attempt fix (higher=stringent) [0.5]
Y			Ratio of number of kmers in the k subset that should be present to accept an edit (higher=stringent) [0.5]

ABySS-bloom options:
b			Bloom filter size (e.g. 100M)

Sealer options:
L			Length of flanks to be used as pseudoreads [100]
P			Maximum alternate paths to merge; use 'nolimit' for no limit [10]

Notes:
 - Pass all parameter list values (reads, k) as space-separated values surrounded by quotation marks, e.g. k='80 65 50'
 - Ensure that all input files are in the current working directory, making soft-links if needed
 - K-mer lengths will be used in the order they are provided. Ensure that they are sorted in descending order (largest to smallest)

Running ntedit-sealer help prints the help documentation.

Citing ntEdit-Sealer, ntEdit and Sealer


Thank you for your Stars and for using, developing and promoting this free software!

If you use ntEdit-Sealer, ntEdit or Sealer in your research, please cite:

ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies

ntEdit+Sealer: Efficient targeted error resolution and automated finishing of long-read genome assemblies.
Li JX, Coombe L, Wong J, Birol I, Warren RL. 
Curr. Protocols. 2022. 2:e442 

ntEdit: Scalable Genome Sequence Polishing

ntEdit: Scalable genome sequence polishing.
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I.
Bioinformatics. 2019. Nov 1;35(21):4430-4432. doi: 10.1093/bioinformatics/btz400.

Sealer: A Scalable Gap-closing Application for Finishing Draft Genomes

Sealer: A scalable gap-closing application for finishing draft genomes. 
Paulino D*, Warren RL*, Vandervalk BP, Raymond A, Jackman SD, Birol I. 
BMC Bioinformatics. 2015. 16:230

License


ntEdit-Sealer Copyright (c) 2015-2022 British Columbia Cancer Agency Branch. All rights reserved.

ntEdit and Sealer are released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca

About

Efficient targeted error resolution and automated finishing of long-read genome assemblies

Resources

Stars

Watchers

Forks

Packages

No packages published