-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only return a subset of jobs ready to run per user when querying for jobs ready to run #11358
Only return a subset of jobs ready to run per user when querying for jobs ready to run #11358
Conversation
Sounds very related to my issue from the other day, #11130 |
@@ -385,6 +385,15 @@ | |||
(db-skip-locked, db-transaction-isolation) and the value is an integer > 0. Default is to grab as many | |||
jobs ready to run as possible. | |||
|
|||
- `ready_window_size` - Handlers query for and dispatch jobs ready to run in a loop. If the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #11228 (comment)
We should stop updating this file and update the YAML version instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree, we need to update all of the documentation and move the YAML over to the sample dir, update the Ansible role, etc. AFAIK I am the only person using it so it seems a bit premature to dump the XML without at least having the docs to help people transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase then - please update both sources - I put a lot of effort into migrating this documentation to YAML. If you are only going to document it in one place - I would prefer it to be the YAML because that will be the source of truth at some point (you promised to do the cutover for me like two years ago 😉).
I approved the PR though, feel free to ignore me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok - I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updating both thing, not the ignoring. ;D
Do you want to target 21.01 for the smorgasboard event ? It's not like we can really test this against test ... |
No, I don't feel comfortable getting this into 21.01, I'll cherry pick to the |
@hexylena I'll also say that because handler assignment is random and usegalaxy.eu runs like what, 20 handlers? That this is not going to work exactly as you might be hoping as a form of concurrency limiting. Each handler will still be able to dispatch |
|
The output of the timed-out test won't load for me, did I break anything here? |
If you click on the extended menu you can download the raw logs:
|
Thanks @jmchilton @mvdbeek ! Templating that query and manually running it on a Galaxy sqlite database using I am thinking to work around this I will just stick a branch in there to use the un-windowed query when the engine is SQLite, even if it may work with newer versions of SQLite, because we'd never need this functionality on SQLite anyway. |
e6e4744
to
103d2f0
Compare
Observing a problem we currently have on usegalaxy.org, where there are 21,000 jobs in the
new
state, but very few of them are dispatched on each iteration of the handler monitor thread because those jobs are mostly owned by a small number of users and the concurrency limits are preventing their new jobs from dispatching. Every iteration of the loop the monitor thread will resolve destinations (i.e. run them through your dynamic rules...), check limits, etc. for all 21k of these jobs and then defer the overwhelming majority of them until the next loop. All of that takes a lot of time. In the meantime, any new jobs submitted by other users will not even be dispatched to a runner plugin until the end of the next iteration of the loop.So this change uses windowing/partitioning to only return the top
ready_window_size
(default: 100) jobs per user on each iteration of the monitor thread.A few things to be aware of:
Anonymous users are treated as a single user by this algorithm. This is probably fixable by having a separate query for anonymous jobs if this becomes a problem (we could even do the rank/window by session) since one extra query is unlikely to be problematic from a performance standpoint.
This applies to jobs whose inputs are terminal - jobs whose inputs are not terminal are already filtered out of the query.
If you are using per-destination (or per-destination-tag) limits, small window sizes could prevent a user's jobs on destinations where they are under the limit from running if their oldest jobs are on destinations where they are over the limit. For example let's say you have set the window size to 20, you have configured 2 job destinations,
multicore
andsinglecore
each with a limit of 10 concurrent jobs per user, and you have 10 jobs running on themulticore
destination but 0 jobs running onsinglecore
. You also have 100 independent jobs in thenew
state whose inputs are all ready, the first 40 of which are destined formulticore
and the last 60 of which are destined forsinglecore
.Your
singlecore
jobs will have to wait until there are only 19multicore
jobs waiting to be dispatched before any of thesinglecore
jobs will start being dispatched, even though you are under your limits there.You can mitigate this by not setting
ready_window_size
too low, or avoid it entirely by having separate job handlers (see thehandler
attrib for<tool>
tags in the job conf) for these destinations.Although you can use this as a form of concurrency limits, I think it's better to use this as a safety measure so one person having 10k pending jobs doesn't destroy that handler for everyone else. If you need concurrency limits, Galaxy already does that in a much more fine-grained manner.
I thought about making this configurable such that the old non-windowed query would be the default but this seems to work fine even with SQLite.
As for performance: the regular job readiness query for one handler on usegalaxy.org currently returns 6902 rows, with this query and
ready_window_size
set to 20, it returns 374 rows, with no hit on query performance (both run in ~320ms).