-
Notifications
You must be signed in to change notification settings - Fork 58
Question on assembly seeds. #296
Comments
Thank you for the nice words. The variability is caused by the use of dynamic load balancing in the parallel phases of computation. See this Wikipedia article for a detailed explanation, but in a few words it goes like this. Say I have This process is not deterministic because the time to execute each task is different and can vary slightly from one execution to another. So if you run the process twice the Clearly, this process is faster (possibly much faster) than assigning in advance each of the In Shasta, I made a design decision to favor performance over detailed repeatability, and therefore the code has no attempt to achieve detailed reproducibility. Clearly, there are pro's and con's to this design decision. So, you can look at the lack of detailed reproducibility as a price to be paid for the high assembly performance achieved by Shasta. |
Hi @paoloczi , Thanks for the thorough and very informative answer. Really appreciate the time taken to answer it. I can say that I understand it now the behaviour of the tool in this context and it all makes sense. Also think that compared to other tools, it was very clever in that sense to opt for speed in this trade off so the tool can stand its ground as a competitor. Thanks again for the time taken and for this awesome work. |
Hi,
recently we have been working on trying to include the tool as a module in nf-core community.
nf-core/modules#1615 (review)
however, during testing executions, we saw that the assembly sequences vary if the same dataset is run twice.
Is this somewhat expect based on how the tool works? Is there any use of random seeds or some similar that we can define to try to make the assembly not change across executions? Or we are doing something wrong?
Thanks for this awesome tool :)
The text was updated successfully, but these errors were encountered: