-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datasets do not appear in the history for hours #3816
Comments
This is 17.01 I assume. Are the jobs created / finished? How complex are
the workflows (some scheduling issues?)
…On Sun, Mar 26, 2017, 14:54 Björn Grüning ***@***.***> wrote:
We getting more and more requests that workflows do not put datasets into
the history for hours, a few users have encountered a 24h delay.
We do have this setting enabled history_local_serial_workflow_scheduling
<https://github.com/galaxyproject/galaxy/blob/dev/config/galaxy.ini.sample#L1094>
as your users need to have the order of datasets preserved in a history.
Has anyone seen this as well?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3816>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABuxqnzsoAcV3povhqKJbEiiFfkm5FcQks5rprRcgaJpZM4MpjA4>
.
|
15 minutes later, ok, looks like more are queued. That was just very very unsettling! I imagine @bgruening has this problem much worse than I do, based on the workflows I see him pictured on twitter with |
@natefoo @jmchilton ping |
I'm glad I'm not the only one that sees this. This can also related to the usage of the cluster, so that the visual feedback is lacking even more I suppose. |
Locally, if I run a workflow with one tool (htseq-count) it is considerably slower than running this one tool 10 times without a workflow. I tried this on usegalaxy.org. It is faster than locally but I was able to produce the following, which is also bad. Multiple same IDs in one history on usegalaxy.org. |
Duplicate HIDs seem unrelated IMO so I have created a separate issue here in #3818. |
Locally with There is some overhead associated with backgrounding the workflows and waiting for a job handler to pick them up versus just scheduling 10 jobs right in a web thread. There are a lot of optimizations that have been applied to the tool execution thread that the workflow scheduling thread cannot leverage as architected because it processes workflows one at a time. Creating 10 jobs from a tool submission is not 10x slower than creating 1 job - scheduling 10 workflows is 10x slower than scheduling 1 workflow. So yes - running tools is faster than running workflows with single tools - this would be magnified by There are some things we can do improve the setting I also need to write up some documentation on separating workflow and job scheduling threads - I think performance problems and debugging would be much clearer if these were separate processes. Right now I/O problems and such in job schedulers can potentially slow down workflow scheduling in more ways than they should be able to. It also means that if a job runner process dies (which we have observed to happen for SLURM and PBS) - workflow scheduling dies. I also think #3659 would mean Galaxy would clean up older workflows that maybe have error-ed out in a way I don't understand yet (possible ideas in #3555 which sort of staled waiting on merge of #3619). I'll open a PR for the two optimizations to Update: I believe the above shows a good faith effort to try to address the problems caused by invoking many flat workflows with individual datasets - but I do want to point out that in addition to better history UX organization I believe Galaxy will schedule a single workflow invocation with collections much faster since it can leverage the tool execution optimizations aimed at creating homogenous jobs together. |
…dom handler. Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474). Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)). For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling. For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once. - Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours. - With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users. - With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user. The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
Yes, on my side yes. @erasche has probably not enabled this.
Belief me we are trying to convince people to use collections more and more. One more thing I have seen, not sure how this relates. I have a history with 6 BAM files, and a workflow with one step (rmdup). Running this workflow on all 6 BAM files, will show me 5 new datasets in around 20s. The last one shows up when the other 5 are finished with computing - this can take minutes to hours. |
…dom handler. Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474). Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)). For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling. For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once. - Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours. - With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users. - With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user. The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
I have not enabled it, no.
Oh, I imagine they have it much, much worse. At my org, for now / the foreseeable future I'm the only person launching 20 histories with 50 steps in each. I don't mind knowing that it can take some time, I can document this for others here. Can also provide timing data / etc.
Us too! But @jmchilton, I tried collections, and they look like they'll be great but I cannot recommend that my users use them until #740 is solved. Please don't get me wrong, I'm excited for how much they'll simplify this sort of data processing. |
@bgruening How much of your immediate problems have been solved by #3820 and #3830 and disabling |
I think you solved this, at lest my testing so far looks very good. I would need this both merged and need a few days to test this with more and bigger workflows. But so far everything looks good. Thanks so much! |
…dom handler. Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474). Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)). For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling. For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once. - Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours. - With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users. - With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user. The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
We getting more and more requests that workflows do not put datasets into the history for hours, a few users have encountered a 24h delay.
We do have this setting enabled
history_local_serial_workflow_scheduling
as your users need to have the order of datasets preserved in a history.Has anyone seen this as well?
The text was updated successfully, but these errors were encountered: