-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choosing the best genome assembly #220
Comments
Both the k=144 and k=160 assemblies look good. k=144 is more contiguous. k=160 has a larger assembly by 50 Mbp (5%), which is pretty significant.
KAT is a great tool to compare k-mer spectra and assess whether the genome is over collapsed or over expanded. It may help you pick the better assembly. https://github.com/TGAC/KAT#readme
Does Pilon take 1,200 wall hours (50 days!) for a 1 Gbp genome? That seems high to me. |
So it appears you agree k128 < k144 < k160, right? Regarding pilon, um, yes, it really runs on all 104 cores available, java grew up to those 2.8TB after two days and does not ask for more (good), and provided pilon spits out once in a while a few lines to the log file speaking about regions processed, it appears those 45k region are 1/4 of my all 190k gaps in scaffolds. For example:
When I count the |
Do you have 2.8 TB of memory installed on that machine? If not, I'd recommend reducing the number of cores so that the job fits in available memory. It'll take forever when swapping to disk. |
I agree that both k=144 and k=160 are better than k=128. don't see a clear winner between k=144 and k=160. k=144 is more contiguous. k=160 appears more complete. All things being equal, I usually go for the more contiguous assembly. I'd prefer k=144 here. The results of KAT may help assess whether k=160 truly is more complete, or just over expanded. At this point I'd suggest running BUSCO on both k=144 and k=160, and pick whichever assembly is more complete. |
The machine has 3.2TB of RAM, is a NUMA architecture with 112 CPU cores. |
What's the structure of the NUMA memory? How much memory is available is a single uniform-access unit? I'm not familiar with NUMA terminology, so I'm not sure what the typical name is for that unit. |
Touch question to a molecular biologist. It is SGI UV200 with 14 x Intel Xeon E5-4627v2, 3.3 GHz, 8 cores each "node"/"chunk". Does the below answer you question better?
|
Maybe the short answer could be: whole 3.2TB of RAM appears as a local memory. |
Neat. I haven't yet used a NUMA system. As an experiment you may want to try Pilon with 8 threads and 248 GB of RAM, ensuring that the 8 threads are bound to the same socket. |
But Pilon seems to require 1.2TB at least for my data. I ran out of memory a few times when I tested with no extra Maybe pilon is just unpacking all the time my BAM files to realize there is not enough coverage? But working with SAM files would be probably much faster. I again forgot the difference in their indexing approaches. |
1.2 TB seems high to me for polishing. Reducing the number of threads may also reduce the memory requirement. The authors of Pilon may have more suggestions. You could split your assembly up into 12 equal-sized chunks, and polish each chunk separately. |
And some intermediate results if I am parsing/interpreting the log file properly. There are 195687 scaffolds involved ...remember the line from
|
I cancelled the pilon process, I think it was re-aligning in every gap region not only reads flanking the region but also all unaligned reads from the BAM files. Provided the long mate-pair datasets are a mixture of FR and RF reads (and badly separated into sub-group by any tool) I think it was unnecessarily spending too much time on this. |
PE2016 vs. PE2017 |
If bubble popping were perfect (which it usually is not), the black and red components at 0.5x copy number (in your case ~20x depth) would be roughly equal sized for a diploid organism. So it looks somewhat overexpanded, but not not badly so. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Hi,
I am moving my posts from a thread which I somewhat hijacked in the past (#178) for easier reading. I will add the comments of @sjackman here too.
Interesting, the k=128 assembly went better IMO than the k=64 assembly. These are the contigs but I think it is correct to check those well-assembled segments before scaffodling happens. What am I missing here? Do you think the counts are larger because the k=128 assembly contains more redundant contigs? Was our discussion only about sealing gaps?
And when I compare the final scaffolds:
@sjackman commented on Feb 8
k=64
is the more contiguous assembly. Tryk=80,96,112
.I'd suggest using ntCard and GenomeScope first before assembling with ABySS to get a sense of the coverage and k-mer distribution of your data.
@mmokrejs commented on Feb 8
Those numbers were from step -3. In final scaffolds in -8 files there is:
Those KB are certainly not kilobytes, it is just weird.
@sjackman commented on Feb 8
Perhaps it's a a bug in
BBmap-37.90/stats.sh
I'm guessing it should be2,736
.k=64
is more contiguous for both unitigs and scaffolds.@mmokrejs commented on Feb 8
I asked Brian Bushnell for meaning of the MB and KB in the output, here is the answer.
Indeed, those are Kbp and Mbp; I should change them. You can add the flag "format=2" to print whole numbers of bp with no commas or decimals, instead of Kbp or Mbp. Also, note that "L50" for stats.sh means a Length and "N50" means a Number of contigs, so the L50 (length of 50th percentile contig) increased from 232 to 2736 in that case, when K increased, meaning continuity improved. There are some programs that have L50 and N50 reversed.
@sjackman commented on Feb 9
I'm afraid Brian has the definition of N50 and L50 reversed. It's a common misunderstanding due to an unfortunate nomenclature. N50 is the length of the contig, and L50 is the number of the contigs whose size is N50 or larger. Yes it's weird, but that's the way it is.
See https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics
and http://quast.bioinf.spbau.ru/manual.html#sec3.1.1
@mmokrejs commented on Feb 11
You asked for the numbers from
abyss-fac
so that we do not have to think of what Brian has eventually swapped. The mapper used was the abyss-map (I sticked to the default so far).Based on uncorrected input reads:
@sjackman commented on Feb 13
k=128
looks like the better of these two assemblies to me. I'd suggest trying larger values of k. Note that the assembly size differs quite a bit between the two assemblies. That's something to keep in mind when interpreting these numbers. I'd suggest adding a-G
parameter toabyss-fac
that is your best estimate of the genome size to calculateNG50
, which is generally better for comparing the contiguity of two assemblies. Note thatabyss-fac
can accept multiple FASTA files on its command line.@mmokrejs commented on Feb 13
I am glad that now you agree the
k128
result is better than k64. Pity I confused you with the output fromstats.sh
. Thek112
andk156
computations are ongoing, thek156
will take 2 extra days to those about 35hrs for computation due to cluster maintenance.@sjackman commented on Feb 13
You could try some ABySS 2 bloom-filter dBG assemblies in the mean time if you liked, which require about a tenth of the RAM. Perhaps that'd be easier for you to get scheduled on your cluster.
@mmokrejs commented on Mar 19
So, finally I have the assemblies with k160 and k192. I updated the tables above to reflect the new values.
@sjackman commented on Mar 19
k=160
has the peakNG50
.k=128
has the peakE-size
andN50
andlargest
scaffold. Either of those look assemblies looks good to me.@mmokrejs commented on Mar 20
I had the impression I shall try
k144
too. The input fork160
andk192
were error-corrected reads, for assemblies<=k128
I used uncorrected reads. I will use the error-corrected reads fork144
again. It decreases the memory usage and increases the average coverage by a tiny little but maybe still helps overall the assembly.Based on uncorrected input reads:
Based on error-corrected input reads (using tadpole.sh and k=64):
@sjackman would you please comment on the k144 and k160 assemblies. I think k144 is the best from k128-k192 range (has very high actually assmbled genome size and still very high max scaffold,
L50
is even better than ink128
, NG50 is somewhat in between) . Provided the gap closing steps are so time consuming, isn't it however better approach to opt fork160
assembly? It provides almost complete genome (unless the actually assembled genome size is inflated due to redundant contigs) and is maybe a better base for future work.I am running
java -Xmx2800G -jar pilon-1.22.jar --output abyss_ecc_k128_ecc_N10_pilon --diploid --changes --vcf --fix all,amb --threads 104 ...
but it will last for about 1200wallclock hours. As we discussed elsewhere, abyss' Sealer runs multithreaded in the very beginning but then switches to a single-threaded job. And moreover, I have the impression it would just make a perfect consensus of my sequences inside "gaps" whereas pilon scans the BAM files and works with the spanning read pairs (so it could pick the right set of two alleles). Unfortunately, it works only with the proper pairs and way too many read-pairs I obtained from BBmap have "wrong" orientation or insert size (note the mix in Nextera mate-pair libs) so it works with much less of reads. What is your strategy? I shall try the BAM's created by
abyss-map
too if it cares to set the proper read flags. BWA doesn't I think so that is why I did not try.The text was updated successfully, but these errors were encountered: