Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
This is a maintenance release with small bugfixes and speed/memory usage improvements. Fixed the bug with final gapclose failing on some assemblies. Some scripts have been converted from Perl to C++ for performance considerations. There is a new script
evaluate_consensus_error_rate.sh (run with -h to get usage)
This script follows guidelines and procedure of consensus quality evaluation described in https://www.nature.com/articles/nbt.4060 . It uses bwa to map Illumina data to the assembly, and then freebayes to get the variants. Any consensus variants (e.g. SNPs, indels) that are not supported by any Illumina reads, but an alternative is supported by at least 3 reads are called errors in the consensus. The script thus estimates the total number of errors in the genome assembly consensus and computes sequence quality. The output is <>.report file, where <> is the name of the input assembly fasta file.
MaSuRCA assemblies have very high consensus quality; in my experiments 30x Pacbio+ 100x Illumina assembly of A.thaliana had 99.9972% quality (3 errors per 100,000 bases), and human NA12878 assembly from 37x Nanopore+100x Illumina data (nanopore data from https://www.nature.com/articles/nbt.4060 and Illumina data from GIAB project), had 99.9913% quality (less than 1 error per 10,000 bases).
This release has major assembly quality improvements and some cosmetic changes.
-- reworked the joining algorithm for incorporating long high error read sequence into the corrected reads where the sequence could not be corrected by Illumina data
-- cleaned up the code and the output/error messages
-- added final gapclosing step for scaffold gaps spanned by long high error reads
-- bugfixes, such as error in executing do_consensus.sh on some systems
-- re-enabled overlap based trimming in CABOG assembler and reduced the default coverage input for correction to 25x; if you have more than 25x coverage, the assembler will use 25x coverage in the longest reads
The changes made significant impact on contiguity and correctness of large mammalian and plant genome assemblies, for some of my test assemblies now N50 contig increased from ~300Kbp to ~950Kbp on 20x Pacbio + 100x Illumina data set. The run time has increased about 10% over 3.2.7 release but still faster than 3.2.6 release.
This version has two significant updates over the 3.2.6 version.
-- consensus is done on the segments of long reads that are not corrected by Illumina reads before assembly, resulting in better contiguity and higher consensus accuracy
-- implemented new efficient trimming module for mega-reads that eliminates the need to run overlapper twice in CABOG assembler, resulting in assembly run time improvements (~50% faster in CABOG)
Version 3.2.6, this is a maintenance release, no new features, bigfixes and performance improvements only.
List of major improvements:
- occasional failure on overlapcorrection workaround
- Illumina-only assembly unitig consensus failure workaround
- running mega-reads on SGE grid improvements in performance and stability
- cleaned up the code and improved re-starting assemblies with Illumina-only data
- Updated version of MUMmer4 included
- Improved compilation and install script on platforms where @ is present in the PWD
- fixed bugs and improved performance of the assembly polishing code
- speed and stability improvements to the Oxford Nanopore correction code
- fixed bug that resulted in gap filling running in endless loop