Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Question on assembly seeds. #296

Closed
fmalmeida opened this issue Sep 29, 2022 · 2 comments
Closed

Question on assembly seeds. #296

fmalmeida opened this issue Sep 29, 2022 · 2 comments

Comments

@fmalmeida
Copy link

fmalmeida commented Sep 29, 2022

Hi,

recently we have been working on trying to include the tool as a module in nf-core community.

nf-core/modules#1615 (review)

however, during testing executions, we saw that the assembly sequences vary if the same dataset is run twice.

Is this somewhat expect based on how the tool works? Is there any use of random seeds or some similar that we can define to try to make the assembly not change across executions? Or we are doing something wrong?

Thanks for this awesome tool :)

@paoloczi
Copy link
Contributor

Thank you for the nice words. The variability is caused by the use of dynamic load balancing in the parallel phases of computation. See this Wikipedia article for a detailed explanation, but in a few words it goes like this. Say I have N tasks to execute, and I don't know in advance the time each task will take. I use M threads in parallel and for simplicity in the description let's assume that M<N. In dynamic load balancing, each of the M threads starts running one of the N tasks. When each threads finishes, it starts running one of the N tasks that was not already started. The process finishes when all threads are done running their assigned tasks, and all N tasks have been run.

This process is not deterministic because the time to execute each task is different and can vary slightly from one execution to another. So if you run the process twice the N tasks can be executed by different threads, and when combining the results this can cause differences in numbering or ordering of objects which in turn can cause the slight differences in results that you observe. It is often possible, in principle, to take coding steps after the facts that reduce or eliminate these small differences, but doing this would generally have an additional performance cost, in addition to development and maintenance, because they would add additional constraints to the code.

Clearly, this process is faster (possibly much faster) than assigning in advance each of the N task to one of the M threads in a predetermined way (static load balancing), particularly if some of the N tasks run much slower than others, and/or the time to run each task cannot be predicted accurately in advance.

In Shasta, I made a design decision to favor performance over detailed repeatability, and therefore the code has no attempt to achieve detailed reproducibility. Clearly, there are pro's and con's to this design decision. So, you can look at the lack of detailed reproducibility as a price to be paid for the high assembly performance achieved by Shasta.

@fmalmeida
Copy link
Author

fmalmeida commented Sep 29, 2022

Hi @paoloczi ,

Thanks for the thorough and very informative answer. Really appreciate the time taken to answer it.

I can say that I understand it now the behaviour of the tool in this context and it all makes sense. Also think that compared to other tools, it was very clever in that sense to opt for speed in this trade off so the tool can stand its ground as a competitor.

Thanks again for the time taken and for this awesome work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants