Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large number of 1bp or small contigs #115

Closed
osilander opened this issue Jun 22, 2023 · 4 comments
Closed

large number of 1bp or small contigs #115

osilander opened this issue Jun 22, 2023 · 4 comments
Labels
question Further information is requested Stale

Comments

@osilander
Copy link

I'm trying to assembly a ~2.3 Gbp genome from ~60X ONT reads (latest chemistry and Dorado basecalls). I get a very large number of contigs (40K) but many are 1bp or otherwise quite short, although contig N50 is relatively good (6Mbp) and contig N80 is ~3Mbp. This falls off right after with N95 around 35Kbp compared to 300-500Kbp N95 for other assemblers that otherwise behave similarly at the top end (raven and nextdenovo).
If I filter out contigs shorter than 2Kbp there are still 25K contigs. For the most part these contigs should not even exist, as the read set itself has no reads below 2Kbp.
Thanks, trying to get a handle on this output, seems like a very good assembler.

@lcoombe
Copy link
Member

lcoombe commented Jun 22, 2023

Hi @osilander,

Thanks for reaching out! Those shorter contigs are likely primarily due to the tigmint-long step, which detects and cuts the 'goldtigs' (golden path reads, pre-scaffolding) at putative misassemblies/chimeric regions. Depending on where those cuts are made you can end up with these very short sequences, which can be safely filtered out of the assembly.

It is also possible to have sequences shorter than the read lengths because the initial GoldPath stage performs some trimming on reads while generating the goldtigs/golden path (~1X representation of the underlying genome).

I hope that makes sense - just let us know if you have any other questions!

Thank you for your interest in GoldRush!
Lauren

@lcoombe lcoombe added the question Further information is requested label Jun 22, 2023
@osilander
Copy link
Author

Thanks for the explanation.
I was looking a little more into this and found that the contig length distribution seems quite odd. There are many contigs that are exactly (or very close) to specific (round) numbers - 2,000bp, 3,000bp, etc.

This becomes very apparent when you look at the histogram or cumulative curves(see below). For example, I have 40,014 total contigs. 2,059 are between 1,001bp and 1,999 bp in length but 2,673 are exactly 2,000bp in length. Similarly, 3,025 are between 2,001 and 2,999 in length; 138 are exactly 3,000bp, and 316 are between 2,999 and 3,001.

My read length distributions are very continuous (ONT 10.4.1, dorado basecalls). This contig length pattern continues up to approximately 20,000bp - there are unexpected bumps in contig lengths at 4,000 5,000 6,000 7,000 etc.

There is also a strange drop-off in contigs that are greater than 1,000bp compared to less than 1,000bp (attached).
goldrush-hist.pdf

Is this possibly something specific to my install? Ubuntu 20.04.5, goldrush v1.0.1 I get no errors/warnings during assembly. Have you ever seen this before?

@jwcodee
Copy link
Member

jwcodee commented Jun 23, 2023

Hello. The reason you see a lot of contigs at those specific lengths is because the GoldPath module within GoldRush evaluates each read as non-overlapping tiles, which is by default of length 1,000 bp with the exception of the last tile. Part of the GoldPath module involves trimming reads based on overlap and since GoldPath is evaluating reads as a collection of tiles, trimming is done by removing tiles. The trimmed read will either of length (x remaining tiles * 1000 bp) or (x -1 remaining tiles + length of last tile).

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in GoldRush!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants