Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets do not appear in the history for hours #3816

Closed
bgruening opened this issue Mar 26, 2017 · 14 comments
Closed

datasets do not appear in the history for hours #3816

bgruening opened this issue Mar 26, 2017 · 14 comments

Comments

@bgruening
Copy link
Member

We getting more and more requests that workflows do not put datasets into the history for hours, a few users have encountered a 24h delay.

We do have this setting enabled history_local_serial_workflow_scheduling as your users need to have the order of datasets preserved in a history.

Has anyone seen this as well?

@hexylena
Copy link
Member

hexylena commented Mar 26, 2017

This is happening to me as well on 17.01. I have a jenkins job which launches workflows. I just hit the button and maybe 50% only showed three jobs when 50+ should have been scheduled.

utvalg_093

@martenson
Copy link
Member

martenson commented Mar 26, 2017 via email

@hexylena
Copy link
Member

utvalg_094

as for me, no idea if the jobs will be created. I killed the histories and re-started with a delay. This time even fewer jobs are scheduled.

utvalg_095

@hexylena
Copy link
Member

utvalg_097

Will be watching over the course of the day.

@hexylena
Copy link
Member

15 minutes later, ok, looks like more are queued. That was just very very unsettling! I imagine @bgruening has this problem much worse than I do, based on the workflows I see him pictured on twitter with

utvalg_098

@nekrut
Copy link
Contributor

nekrut commented Mar 26, 2017

@natefoo @jmchilton ping

@bgruening
Copy link
Member Author

I'm glad I'm not the only one that sees this. This can also related to the usage of the cluster, so that the visual feedback is lacking even more I suppose.

@bgruening
Copy link
Member Author

Locally, if I run a workflow with one tool (htseq-count) it is considerably slower than running this one tool 10 times without a workflow. I tried this on usegalaxy.org. It is faster than locally but I was able to produce the following, which is also bad. Multiple same IDs in one history on usegalaxy.org.

image

https://usegalaxy.org/u/bgruening/h/hiv-coverage

@jmchilton
Copy link
Member

Duplicate HIDs seem unrelated IMO so I have created a separate issue here in #3818.

@jmchilton
Copy link
Member

jmchilton commented Mar 26, 2017

Locally, if I run a workflow with one tool (htseq-count) it is considerably slower than running this one tool 10 times without a workflow.

Locally with history_local_serial_workflow_scheduling enabled?

There is some overhead associated with backgrounding the workflows and waiting for a job handler to pick them up versus just scheduling 10 jobs right in a web thread. There are a lot of optimizations that have been applied to the tool execution thread that the workflow scheduling thread cannot leverage as architected because it processes workflows one at a time. Creating 10 jobs from a tool submission is not 10x slower than creating 1 job - scheduling 10 workflows is 10x slower than scheduling 1 workflow.

So yes - running tools is faster than running workflows with single tools - this would be magnified by history_local_serial_workflow_scheduling though. With history_local_serial_workflow_scheduling enabled you may have to walk through everyone's open/ready workflows on the server before the first one of yours schedules, and then again for the second - and so on. The behavior @erasche is seeing above - 15 minutes to schedule potentially thousands of datasets is concerning and we need to optimize (and ask the Canadians - we've done a lot to optimize this - and I still have two WIP threads pursuing more optimizations) - but all of this slow down would be magnified for individual users with history_local_serial_workflow_scheduling if they have many workflows ready to go in the same history.

There are some things we can do improve the setting history_local_serial_workflow_scheduling - we can load the workflows in order so that the oldest one for each history is loaded first. That would probably improve the turn around time for these users since you should only have to walk all the open workflows once to get these things scheduled. We could also move the double nested loop that determines if "this" workflow is the correct workflow to schedule into SQL to speed up every check on every loop.

I also need to write up some documentation on separating workflow and job scheduling threads - I think performance problems and debugging would be much clearer if these were separate processes. Right now I/O problems and such in job schedulers can potentially slow down workflow scheduling in more ways than they should be able to. It also means that if a job runner process dies (which we have observed to happen for SLURM and PBS) - workflow scheduling dies. I also think #3659 would mean Galaxy would clean up older workflows that maybe have error-ed out in a way I don't understand yet (possible ideas in #3555 which sort of staled waiting on merge of #3619).

I'll open a PR for the two optimizations to history_local_serial_workflow_scheduling Monday as well as trying to figure out the failed test case to get #3659 into 17.01 - I think these are the best things we can do right away to improve the Freiburg situation improved. Later in the week I'll open a PR for instructions on splitting up workflow scheduler threads and I'll continue working on the test cases in #3555 and see if I can find workflow bugs that might lead to stuck workflow invocations that would slow down scheduling. If I get through all of that this week I will also finish up work on #1957 which could speed up workflow scheduling and see if I can make improvements to tool submission bursting in an optimized way with workflow threads - that could also really speed up workflow invocations.

Update: I believe the above shows a good faith effort to try to address the problems caused by invoking many flat workflows with individual datasets - but I do want to point out that in addition to better history UX organization I believe Galaxy will schedule a single workflow invocation with collections much faster since it can leverage the tool execution optimizations aimed at creating homogenous jobs together.

jmchilton added a commit to jmchilton/galaxy that referenced this issue Mar 27, 2017
…dom handler.

Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474).

Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)).

For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling.

For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once.

- Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours.
- With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users.
- With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user.

The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
@bgruening
Copy link
Member Author

Locally with history_local_serial_workflow_scheduling enabled?

Yes, on my side yes. @erasche has probably not enabled this.

Update: I believe the above shows a good faith effort to try to address the problems caused by invoking many flat workflows with individual datasets - but I do want to point out that in addition to better history UX organization I believe Galaxy will schedule a single workflow invocation with collections much faster since it can leverage the tool execution optimizations aimed at creating homogenous jobs together.

Belief me we are trying to convince people to use collections more and more.

One more thing I have seen, not sure how this relates. I have a history with 6 BAM files, and a workflow with one step (rmdup). Running this workflow on all 6 BAM files, will show me 5 new datasets in around 20s. The last one shows up when the other 5 are finished with computing - this can take minutes to hours.

jmchilton added a commit to jmchilton/galaxy that referenced this issue Mar 27, 2017
…dom handler.

Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474).

Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)).

For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling.

For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once.

- Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours.
- With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users.
- With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user.

The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
@hexylena
Copy link
Member

I have not enabled it, no.

The behavior @erasche is seeing above - 15 minutes to schedule potentially thousands of datasets is concerning and we need to optimize (and ask the Canadians - we've done a lot to optimize this - and I still have two WIP threads pursuing more optimizations) - but all of this slow down would be magnified for individual users with history_local_serial_workflow_scheduling if they have many workflows ready to go in the same history.

Oh, I imagine they have it much, much worse. At my org, for now / the foreseeable future I'm the only person launching 20 histories with 50 steps in each. I don't mind knowing that it can take some time, I can document this for others here. Can also provide timing data / etc.

Belief me we are trying to convince people to use collections more and more.

Us too! But @jmchilton, I tried collections, and they look like they'll be great but I cannot recommend that my users use them until #740 is solved. Please don't get me wrong, I'm excited for how much they'll simplify this sort of data processing.

@jmchilton
Copy link
Member

@bgruening How much of your immediate problems have been solved by #3820 and #3830 and disabling history_local_serial_workflow_scheduling? Have we gotten the delay down to minutes from hours at least?

@bgruening
Copy link
Member Author

I think you solved this, at lest my testing so far looks very good. I would need this both merged and need a few days to test this with more and bigger workflows. But so far everything looks good.

Thanks so much!

jmchilton added a commit to jmchilton/galaxy that referenced this issue Mar 30, 2017
…dom handler.

Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (galaxyproject#3474).

Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced galaxyproject#3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](galaxyproject#3816 (comment)).

For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling.

For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once.

- Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours.
- With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users.
- With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user.

The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants