Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
wreck: rework job.submit, job.create, wrexecd.run handlers with continuations #1472
This is a hopefully somewhat better structured version of PR #1471.
This one doesn't add a hash of active jobs and doesn't limt the job module's ability to ingest jobs on ranks other than 0.
It adds a
It also satisfies a long standing feature request for an ability to synchronously publish events; that is, get a response indicating that the event has received a sequence number, as discussed in #337 and #342. That was necessary to remove the final synchronous step in job.create and job.submit, where the handlers block listening for an event to be published before sending the response to the user.
FWIW I reran a little script inspired by @trws that that submits 50K jobs (wtih the sched module check disabled), running 200 submits in parallel:
#!/bin/bash ulimit 65536 seq 1 50000 | xargs -n 1 -P 200 \ flux submit -n 1 sleep 0
Before this PR I got 134 job per second. After I get 442.
I haven't tried to probe the effect on job launch, but it should improve latency under load - for example if there are many small jobs being launched, the overhead of checking a new job not targetting the local rank should not slow down (as much) the launch of one that does target the local rank, since new events can start to be handled before the last one has completed. E.g. there could be many fetches of
Sorry the "rework" commits are basically unreadable. They each effectively rewrite a handler and its helper functions as a chain of continuations. It might be better to look at the final result.
@@ Coverage Diff @@ ## master #1472 +/- ## ========================================== + Coverage 78.71% 78.72% +<.01% ========================================== Files 163 164 +1 Lines 30264 30328 +64 ========================================== + Hits 23823 23875 +52 - Misses 6441 6453 +12
Very nice result @garlick! Thanks for thinking of doing that test. I wonder if we'd get even better throughput with a bulk submit tool that can issue
I'll just review the final result, but I'm sure everything is much improved.
Thanks also for renaming
Just a thought: On commit descriptions, I've gone to using
Nice improvement on the event publishing too, I'm sure that helped throughput as well!
Thanks - I'll fix the commit messages, and look at renaming
The test above does run
@garlick, this is awesome and will make certain concurrent programming much more intuitive without hitting a performance hit!
One question so that I understand its semantics. The sync eventing is suitable when you have a single receiver who needs to be sync'ed with the sender? We will have to use different mechanism if you have multiple subscribers?
No, event delivery is still "open loop" or "fire and forget". This RPC returns when the event has been accepted on rank 0, meaning it has a place in the event sequence. The issue it was intended to solve, as I understand it, is event ordering in the following scenario:
Say A and B both implement a service that publishes an event upon request.
If C first sends an RPC to A generating E1, then sends an RPC to B, generating E2, what order do subscribers see the events? We don't know, it could be E1 then E2, or E2 then E1. This is because events are "fire and forget" from A and B's perspective, and maybe A's broker is busier than B's, so its messages reach the rank 0 broker after B's.
This new thing just provides an optional way to publish events with a response from the rank 0 broker indicating that the event has been accepted and assigned a sequence number. So if A and B wait for the response to the new RPC before responding to C, then C knows the events will be sent in the order requested.
Before in the job module, we were listening for events to loop back to the sender, which accomplished the same thing, but was a little more awkward to incorporate into the chain of continuations pattern in use in this rework.
It's kind of a niche use case I think, so don't get too excited :-)