Skip to content
GitHub no longer supports this web browser. Learn more about the browsers we support.
Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
C++ C Other
Branch: master
Clone or download
Cannot retrieve the latest commit at this time.
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore remove tmp files Jan 4, 2020
Assembly.cpp upadate man-page and readme Jan 4, 2020
Assembly.h README Dec 28, 2019
CommandLines.cpp improved N50 Jan 14, 2020
CommandLines.h improved N50 Jan 14, 2020
Correct.cpp fixed clang warnings Dec 26, 2019
Correct.h clean code Dec 26, 2019
Hash_Table.cpp clean code Dec 26, 2019
Hash_Table.h clean code Dec 26, 2019
LICENSE Initial commit May 9, 2019
Levenshtein_distance.cpp clean code Dec 26, 2019
Levenshtein_distance.h fixed clang warnings Dec 26, 2019
Makefile clean code Dec 26, 2019
Output.cpp clean code Dec 26, 2019
Output.h full error correction Jul 15, 2019
Overlaps.cpp
Overlaps.h improved N50 Jan 14, 2020
POA.cpp clean code Dec 26, 2019
POA.h
Process_Read.cpp README Dec 28, 2019
Process_Read.h README Dec 28, 2019
README.md Fixed a minor typo in README Jan 14, 2020
hifiasm.1 update readme Jan 4, 2020
kdq.h final DP-based assembly Nov 14, 2019
ketopt.h Using klib May 10, 2019
khash.h clean code Dec 26, 2019
kmer.cpp final DP-based assembly Nov 14, 2019
kmer.h clean code Dec 26, 2019
kseq.h added klib_unused to suppress kseq warnings Nov 22, 2019
ksort.h
kvec.h
main.cpp main() should return 0 on success Dec 30, 2019

README.md

Getting Started

# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
# Assembly
./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

Introduction

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. Unlike most existing assemblers, hifiasm starts from uncollapsed genome. Thus, it is able to keep the haplotype information as much as possible. The input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and its outputs consist of:

  1. Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
  2. Haplotype-resolved processed unitig graph without small bubbles (prefix.p_utg.gfa). This is usually the preferred output for highly heterozygous genomes.
  3. Primary assembly contig graph (prefix.p_ctg.gfa). This is the preferred output for inbred strains or human. For highly heterozygous genomes, this graph may represent multiple haplotypes. We plan to change this to represent one set of haplotypes.
  4. Alternate assembly contig graph (prefix.a_ctg.gfa).
  5. Haplotype-aware error corrected reads in fasta format (prefix.ec.fa).
  6. All-to-all overlaps in the PAF format (prefix.ovlp.paf).

So far hifiasm is still in early development stage, it will output phased chromosome-level high-quality assembly in the near future. In addition, hifiasm also outputs three binary files that save all overlap inforamtion (hifiasm.asm.ovlp, hifiasm.asm.ovlp.source, hifiasm.asm.ovlp.reverse in default). With these files, hifiasm can avoid the time-consuming all-to-all overlap calculation step, and do the assembly directly and quickly. This might be helpful when you want to get an optimized assembly by multiple rounds of experiments with different parameters.

Hifiasm is a standalone and lightweight assembler, which does not need external libraries (except zlib). For large genomes, it can generate high-quality assembly in a few hours. Hifiasm has been tested on the following datasets:

Dataset GSize Cov Asm options CPU time Wall time RAM unitig/contig N50[1]
[Human NA12878] 3Gb x28 -k 40 -t 42 -r 2 200h 5h32m 114G 93.5Kb/21.5Mb
[Human HG002] 3Gb x43 -k 40 -t 42 -r 2 405h10m 12h7m 146G 320kb/31.9Mb
[Human CHM13] 3Gb x27 -k 40 -t 42 -r 2 157h28m 5h10m 85.8G NA[2]/39.8Mb
[Butterfly] 358Mb x35 -k 40 -t 42 -r 2 -z 20 17h6m 36m 16G 7.5Mb/NA[3]

[1] unitig N50 is the N50 of assembly graph with haplotype information (i.e., bubbles), while the contig N50 is the N50 of haplotype collapsed assembly (i.e., without bubbles). [2] CHM13 is a homozygous sample, so that unitig N50 makes no sense. [3] Butterfly has high heterozygous rate, so that most chromosomes have been fully separated into two haplotypes. In this case, contig N50 makes no sense.

Note that different species need different assembly graphs. For homozygous genomes (i.e., Human CHM13), the primary assembly contig graph is the best choice. For species with high heterozygous rate (i.e., Butterfly), different haplotypes can be fully separated. It is important to remove small bubbles from the haplotype-resolved unitig graph. The reason is that some small bubbles are caused by somatic mutations or noise in data, which are not the real haplotype information. In this case, haplotype-resolved processed unitig graph without small bubbles should be better. For ordinary human genome (i.e., Human NA12878 and HG002), different haplotypes cannot be fully separated due to the low heterozygous rate. There are many small bubbles including haplotype information, which cannot be simply removed. Thus, it is necessary to use the haplotype-resolved raw unitig graph. Hifiasm will generate a universal haplotype-resolved contig graph for all species in the near future.

Usage

For Hifi reads assembly, a typical command line looks like:

./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

where NA12878.fq.gz is the input reads and -o specifies the output files. In this example, all output files can be found at NA12878.asm.*. -k, -t and -r specify the length of k-mer, the number of CPU threads, and the number of correction rounds, respectively. Note that at first run, hifiasm will save all overlaps to disk, which can avoid the time-consuming all-to-all overlap calculation next time. For hifiasm, once the overlap information has been obtained during the previous run in advance, it is able to load all overlaps from disk and then directly do assembly. If you want to ignore the pre-computed overlap information, please specify -i.

Please note that some old Hifi reads may consist of short adapters. To improve the assembly quality, adapters should be removed by -z as follow:

./hifiasm -o butterfly.asm -t 42 -z 20 butterfly.fq.gz

In this example, hifiasm will remove 20 bases from both ends of each read.

Getting Help

For detailed description of options, please see man ./hifiasm.1. The -h option of hifiasm also provides simple description of options. If you have further questions, please raise an issue at the issue page.

Limitations and future works

  1. For genome with low heterozygous rate, hifiasm only outputs haplotype-resolved assembly graph, instead of the phased chromosome-level assembly (will support such output in future).

  2. For different species, hifiasm outputs different assembly graphs, which are not easy to use. Hifiasm will generate a universal haplotype-resolved contig graph for all species in future.

  3. The running time and memory usage should be further reduced.

  4. The N50 should be further improved.

You can’t perform that action at this time.