Question on assembly seeds. #296

fmalmeida · 2022-09-29T11:17:20Z

Hi,

recently we have been working on trying to include the tool as a module in nf-core community.

however, during testing executions, we saw that the assembly sequences vary if the same dataset is run twice.

Is this somewhat expect based on how the tool works? Is there any use of random seeds or some similar that we can define to try to make the assembly not change across executions? Or we are doing something wrong?

Thanks for this awesome tool :)

paoloczi · 2022-09-29T15:07:57Z

Thank you for the nice words. The variability is caused by the use of dynamic load balancing in the parallel phases of computation. See this Wikipedia article for a detailed explanation, but in a few words it goes like this. Say I have N tasks to execute, and I don't know in advance the time each task will take. I use M threads in parallel and for simplicity in the description let's assume that M<N. In dynamic load balancing, each of the M threads starts running one of the N tasks. When each threads finishes, it starts running one of the N tasks that was not already started. The process finishes when all threads are done running their assigned tasks, and all N tasks have been run.

This process is not deterministic because the time to execute each task is different and can vary slightly from one execution to another. So if you run the process twice the N tasks can be executed by different threads, and when combining the results this can cause differences in numbering or ordering of objects which in turn can cause the slight differences in results that you observe. It is often possible, in principle, to take coding steps after the facts that reduce or eliminate these small differences, but doing this would generally have an additional performance cost, in addition to development and maintenance, because they would add additional constraints to the code.

Clearly, this process is faster (possibly much faster) than assigning in advance each of the N task to one of the M threads in a predetermined way (static load balancing), particularly if some of the N tasks run much slower than others, and/or the time to run each task cannot be predicted accurately in advance.

In Shasta, I made a design decision to favor performance over detailed repeatability, and therefore the code has no attempt to achieve detailed reproducibility. Clearly, there are pro's and con's to this design decision. So, you can look at the lack of detailed reproducibility as a price to be paid for the high assembly performance achieved by Shasta.

fmalmeida · 2022-09-29T15:43:59Z

Hi @paoloczi ,

Thanks for the thorough and very informative answer. Really appreciate the time taken to answer it.

I can say that I understand it now the behaviour of the tool in this context and it all makes sense. Also think that compared to other tools, it was very clever in that sense to opt for speed in this trade off so the tool can stand its ground as a competitor.

Thanks again for the time taken and for this awesome work.

fmalmeida mentioned this issue Sep 29, 2022

shasta module added nf-core/modules#1615

Merged

12 tasks

fmalmeida closed this as completed Oct 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on assembly seeds. #296

Question on assembly seeds. #296

fmalmeida commented Sep 29, 2022 •

edited

paoloczi commented Sep 29, 2022

fmalmeida commented Sep 29, 2022 •

edited

Question on assembly seeds. #296

Question on assembly seeds. #296

Comments

fmalmeida commented Sep 29, 2022 • edited

paoloczi commented Sep 29, 2022

fmalmeida commented Sep 29, 2022 • edited

fmalmeida commented Sep 29, 2022 •

edited

fmalmeida commented Sep 29, 2022 •

edited