Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[17.01] Restrict workflow scheduling within a history to a fixed, random handler. #3820

Merged
merged 2 commits into from Mar 31, 2017

Conversation

Projects
None yet
4 participants
@jmchilton
Copy link
Member

commented Mar 27, 2017

xref #3816

Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (#3474).

Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced #3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined here.

For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling.

For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once.

  • Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours.
  • With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users.
  • With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contiguous and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user.

The only drawback of this new default behavior versus the previous default is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.

Since it leverages the testing enhancements therein, this is built on #3659.

This may (probably does) incidentally fix #3818.


# This is called by Tool.get_job_handler()
def get_handler(self, id_or_tag):
def get_handler(self, id_or_tag, random_index=None):
"""Given a handler ID or tag, return the provided ID or an ID matching the provided tag

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Mar 27, 2017

Member

"""Given a handler ID or tag, return a handler matching it

@@ -608,27 +608,31 @@ def get_job_tool_configurations(self, ids):
rval.append(self.default_job_tool_configuration)
return rval

def __get_single_item(self, collection):
def __get_single_item(self, collection, random_index=None):

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Mar 27, 2017

Member

Can we not call it random_index but just index? I think it's misleading.

This comment has been minimized.

Copy link
@jmchilton

jmchilton Mar 27, 2017

Author Member

But index sounds like we are choosing an intentional handler index - it is meant to be sort of a random seed - I sort of think it is just as misleading. I'll change this because I also hate random_index though. Clearly bumping into on of the two hard problems.

@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 27, 2017

Thanks John for putting this together. I will try to find more reproducible examples to test this.

@jmchilton jmchilton force-pushed the jmchilton:fixed_history_handler branch from 3ad0305 to 48d131c Mar 27, 2017

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2017

@nsoranzo I rebased with your suggestions - thanks as always for the detailed review.

@bgruening Great - I'll pull it out of WIP when I get your +1 or when I get more testing done myself. I have a test case in there that verifies the handler setting is correct and I've reviewed the code a few times and I don't think a single handler would ever process two workflows at once so I think this should all work - but I haven't observed an actual concrete problem and verified this fixes it yet.

# This is the maximum amount of time a workflow invocation may stay in an active
# scheduling state in seconds. Set to -1 to disable this maximum and allow any workflow
# invocation to schedule indefinitely. The default corresponds to 1 month.
#maximum_workflow_invocation_duration = 2678400

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 27, 2017

Member

Shouldn't parallelize_workflow_scheduling_within_histories appear here too?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Mar 28, 2017

Author Member

Fixed with 64b696f. Thanks.

@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 28, 2017

Running this now for a day, it does not break more things. But still does not solve the non persistent order of inputs in one-single history.

We have 8 BAM files in a history and one workflow with one tool (rmdup). We execute this workflow on all inputs and get 8 ordered outputs (run on dataset 1-8). So far so good. During the execution of these 8 rmdup runs, we start 16 new workflows (8 initial BAMs, 8 in yellow state, currently running). The result is run on data 2, run on data 4, run on data 6, run on data 1.
I thought this is because of a few datasets are in the running state, but executing the same workflow on the initial 8 BAMs again, also yields a mixed order now - so maybe the first time was just luck.

@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 28, 2017

It can also be a UI bug. If I select multiple files from my history does it pass this in an ordered way to the invocation step?

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2017

@bgruening I thought the problem was ordering within a workflow invocation - not ordering of the workflow invocations themselves. I can make sure the workflows get processed in the handlers by ascending ID if this is a problem. Though I don't know if they are sent to the invocation endpoint in a fixed, ordered way. I can add some test cases for this stuff though.

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2017

@bgruening Pairing this with #3830 should fix the order of the invocations to match your expectations. This PR just makes the datasets within simple invocations contiguous.

@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 28, 2017

There were two problems, one is inside a workflow and the other is with multiple workflows. We spend the day to test larger worklfows and run them multiple time in one history and it seems this is working fine :)
We will now test the other patch and keep an eye of the sorting of workflows. Given that these are two different problems 👍 from my me.

Thanks John!!!

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Mar 28, 2017

[17.01] Order processing evaluation of workflow invocations when sche…
…duling.

xref galaxyproject#3820 (comment)

Also optimize the query to just pull out workflow invocation IDs per the suggestion of @nsoranzo [here](galaxyproject#3830 (review)).

@jmchilton jmchilton changed the title [WIP][17.01] Restrict workflow scheduling within a history to a fixed, random handler. [17.01] Restrict workflow scheduling within a history to a fixed, random handler. Mar 28, 2017

@jmchilton jmchilton added status/review and removed status/WIP labels Mar 28, 2017

jmchilton added some commits Feb 23, 2017

[17.01] By default, do not allow workflow invocations to schedule ind…
…efinitely.

Give up after a month but allow admins to reduce this amount as well.
[17.01] Restrict workflow scheduling within a history to a fixed, ran…
…dom handler.

Lets revisit the problem that background scheduling workflows (as is the default UI behavior as of 16.10) makes it easier for histories to contain datasets interleaved from different workflow invocations under certain reasonable conditions (#3474).

Considering only a four year old workflow and tool feature set (no collection operations, no dynamic dataset discovery, only tool and input workflow modules), all workflows can and will fully schedule on the first scheduling iteration. Under those circumstances, this solution is functionally equivalent to history_local_serial_workflow_scheduling introduced #3520 - but should be more performant because all such workflows fully schedule in the first iteration and the double loop introduced here https://github.com/galaxyproject/galaxy/pull/3520/files#diff-d7e80a366f3965777de95cb0f5b13a4e is avoided for each workflow invocation for each iteration. This addresses both concerns I outlined [here](#3816 (comment)).

For workflows that use certain classes of newer tools or newer workflow features - I'd argue this approach will not degrade as harshly as enabling history_local_serial_workflow_scheduling.

For instance, imagine a workflow with a dynamic dataset collection output step (such as used by IUC tools Deseq2, Trinity, Stacks, and various Mothur tools) half way through that takes 24 hour of queue time to reach. Now imagine a user running 5 such workflows at once.

- Without this and without history_local_serial_workflow_scheduling, the 5 workflows will each run as fast as possible and the UI will show as much of each workflow as can be scheduled but the order of the datsets may be shuffled. The workflows will be complete for the users in 48 hours.
- With history_local_serial_workflow_scheduling enabled, only 1 workflow will be scheduled only half way for the first 24 hours and the user will be given no visual indication for why the other workflows are not running for 1 day. The final workflow output will take nearly a week to be complete for the users.
- With this enabled - the new default in this commit - each workflow will be scheduled in two chunks but these chunks will be contingious and it should be fairly clear to the user what tool caused the discontinuity of the datasets in the history. So things are still mostly ordered, but the draw backs of history_local_serial_workflow_scheduling are avoided entirely. Namely, the other four workflows aren't hidden from the user without a UI indication and the workflows will still only take 48 hours to be complete and outputs ready for the user.

The only drawback of this new default behavior is that you could potentially see some performance improvements by scheduling multiple workflow invocations within one history - but this was never a design goal in my mind when implementing background scheduling and under typical Galaxy use cases I don't think this would be worth the UI problems. So, the older behavior can be re-enabled by setting parallelize_workflow_scheduling_within_histories to True in galaxy.ini but it won't be on by default or really recommended if the Galaxy UI is being used.

@jmchilton jmchilton force-pushed the jmchilton:fixed_history_handler branch from 64b696f to aba31bc Mar 30, 2017

@dannon dannon merged commit 07887be into galaxyproject:release_17.01 Mar 31, 2017

5 checks passed

api test Build finished. 257 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 136 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 27 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 580 tests run, 0 skipped, 0 failed.
Details
@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 31, 2017

Awesome! Thanks for the merge!
John - I own you a few brownies!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.