Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misassembly introduced in hifiasm v0.5 #10

Closed
HenrivdGeest opened this issue Apr 17, 2020 · 16 comments
Closed

misassembly introduced in hifiasm v0.5 #10

HenrivdGeest opened this issue Apr 17, 2020 · 16 comments

Comments

@HenrivdGeest
Copy link

I am following hifiasm for our plant genome assembly, and noticed that with my code update from hifiasm v03 to v05 the assembly n50 increased but in my genome 2 appearant misassemblies got my attention.
I highlight one in the images below. We have a tetraploid plant of ~ 400MB haploid size, and we have >100x hifi coverage of the haploid genome, meaning ~25x per haplotype.
I already assembled this genome with hiCANU, and noticed also there that with changing the bogart (assemble) parameters too much, I easily introduce misassemblies. As far as I know, I can't alter the settings for hifiasm.
In this image you see the alignment of the contigs to a related public reference, with the assembly error:
image
If I look at the same positions (zoom in) but for the unitigs; I see that the unitigs do not contain this error:
image
Are their any ways to make the contigging more stringent? (version 0.3 did not have this error yet)
I have to say that I am not 100% sure this is not a true biological case, I am confindent that this is an assembly error, also from what I've seen with hiCanu. If you need more info, let me know

@chhylp123
Copy link
Owner

Thanks so much for point that. It seems that the mis-assemblies are caused by the our new purge_dup. I will expose its parameters to users today, and let you know when it is available.

@lh3
Copy link
Collaborator

lh3 commented Apr 17, 2020

v0.5 fixes one misassembly on our data, but generally hifiasm does produce large misassemblies occasionally. @chhylp123 will expose some purge_dups parameters, which may help. We are also thinking about the possibility to integrate some bogart heuristics in future.

@chhylp123
Copy link
Owner

I have exposed three purge_dups parameters: '-l', '-s', '-O'. I guess it may help for fixing assemblies. Please use the latest commit with version 0.5-dirty-r247 (hash: 7f6725e). Looking forward to your results @HenrivdGeest .

@HenrivdGeest
Copy link
Author

Thanks, I will start a sweep asap. Is their a way to re-use the error corrected reads? With some parameter testing, it can be faster to run that step once.

@lh3
Copy link
Collaborator

lh3 commented Apr 19, 2020

If you specify the same "-o prefix" option, hifiasm will reuse "prefix.*.bin" files and skip error correction and overlapping. Note that older assembly graphs will be overwritten if you do this.

By the way, what is the heterozygosity of your genome? v0.5 outputs multiple k-mer histograms in stderr. Is it possible to show us the first histogram? Thanks.

@HenrivdGeest
Copy link
Author

I am running the loop now. The first kmer plot:
image
I also ran genomescope on kmer=21 on the hifiasm error corrected reads ( on more data, so the coverage peaks are at different numbers) before:
image
Here you see clearly 4 peaks. The hifiasm just 2, or barely 3, but that kmer is at 51, meaning that 2 haplotypes might already get completely separated at that size?

@lh3
Copy link
Collaborator

lh3 commented Apr 20, 2020

Thanks a lot! Is this public data? We have been mostly using animal data for testing. The heterozygosity is much lower in comparison to some plant genomes. While redwood heterozygosity is high (higher than your genome) and its data is public, it takes several days to assemble, making testing very difficult. If your data is public or can be shared with us privately, it would be ideal for development purposes. Some hifiasm parameters are tuned for low heterozygosity, we may have a lot of room for improvement.

Also, what is the preferred output for your genome? Is it one primary assembly representing one haplotype?

@HenrivdGeest
Copy link
Author

I made 18 assemblies, all combinations of -l (0,1,2) -s (0,75,0,90) and -O (1,5,10). But all of them show exactly 2 misassemblies. I measured my version 0.3 the same, and that one has none. The n50's of 0.3 and the "v0.5 -l0,-s0.75 -l1" version are almost equal:
image
@lh3; My haploid genome size is around 400-600Mb, I was never able to collapse my assemblies to lower than 800Mb with purge_haplotigs, so I think some haplotypes are very heterozygous, and some not so, and can be easely collapsed. (hence the reason why I also never get a 2Gb genome fasta). Our plant is called a 'segmental allopolyploid' (so the other parts are autopolyploid).
Regarding what I want as output, in this case I want to have the unitigs, as much separating the haplotypes as possible. (I know all my previous remarks are about the contigs). The unitigs are upto 1.6Gb in size, but still contain collapsed haplotypes, probably because there are no snps to be found between alleles. This data is confidential, and can therefore, unfortenately, not be shared.

@chhylp123
Copy link
Owner

I see. Thanks a lot. Let me check what's the difference between v0.3 and v0.5...
BTW, could you please run v0.5 on the bin files generated by v0.3?

@chhylp123
Copy link
Owner

chhylp123 commented Apr 20, 2020

I think you can have try to further increase -s and -O. I guess even '-s 0.99' is fine for hifiasm. Actually '-s 1' doesn't mean hifiasm only purges exactly same haplotigs, it still allows differences.

@HenrivdGeest
Copy link
Author

I tried increasing -s to 0.999 and -O to 75, but that did not made any difference in the output regarding the misassembly. I tried running v05 on the v03 bin files, but for v03 I only have a *reverse.bin, and that alone doesn't seem to work, I think it starts assembling from scratch.

@chhylp123
Copy link
Owner

Thanks a lot, I will expose another option to users. I believe that can avoid these two mis-assemblies.

@HenrivdGeest
Copy link
Author

BY any change any luck on these new options?

@chhylp123
Copy link
Owner

Oh, I'm so sorry I forget that : ( . I will expose it this day.

@chhylp123
Copy link
Owner

chhylp123 commented May 26, 2020

Please wait me one more day, I will fix it soon. Thanks a lot : )

@chhylp123
Copy link
Owner

I have exposed '-u' to disable post-joining (0.7-dirty-r256). Hope it is helpful. I'm so sorry for the deay : (.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants