Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager / job-info: get job data changes from job-manager to job-info so job listings will have up to date information #3208

Merged
merged 15 commits into from
Sep 28, 2020

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Sep 14, 2020

Per discussion in #3052, the job-manager did not have a way to inform the job-info module that a job's priority had changed, thus leading to invalid job listings.

Several issues were discussed in #3052, this PR just tried to look at the notification of the priority change between job-manager & job-info.

For prototyping purposes, I implemented about the most straightforward thing I could think of, adding a new job-updates event that the job-manager publishes. It publishes a change record of "jobid, timestamp, field that changed, value", such as "JOBID, TIMESTAMP, "priority", 14" would be a potential change. I transfer the "value" as a json object. In the future, I believe a json null can serve as "cache invalidate", when the data that has changed isn't known to the job-manager.

In the job-info module, it has to keep track of the timestamp of when the priority field was last set, to avoid any racing that can occur.

  • b/c the priority field is (currently) the only job data that can change, I think this implementation is ok. But as more data can change, this approach will become cumbersome, managing timestamps of when all of the job data was last updated. So a refactoring would be necessary.

  • The job-updates and job-annotations events could be combined into one event. If we go with this approach, I think that would be a commit to add to the top of this series.

Any thoughts / comments?

@garlick
Copy link
Member

garlick commented Sep 14, 2020

Just read through #3052 again.

The job-updates and job-annotations events could be combined into one event. If we go with this approach, I think that would be a commit to add to the top of this series.

That makes a lot of sense.

The one thought that occurs to me regarding a combined protocol for "job updates" is that we did invest some design effort already in job eventlogs. Maybe rather than creating a new update protocol, we should just publish eventlog entries (adding jobid to context)?

For that matter, would it be better to just "cc" all job eventlog updates to an in-memory circular buffer in the job manager that job-info could consume via a streaming RPC? Call it the job manager's "journal". Published events are easy to work with but the big downside is that they are broadcast across the instance, for naught if there is only the one consumer.

For the annotations and upcoming fair share priority changes that probably would not be appropriate to land in the job's eventlog, we still add an eventlog entry to the in-memory journal?

Edit: a generic API for consuming eventlogs could potentially be used by job-info to consume this "eventlog" too.

@chu11
Copy link
Member Author

chu11 commented Sep 14, 2020

The one thought that occurs to me regarding a combined protocol for "job updates" is that we did invest some design effort already in job eventlogs. Maybe rather than creating a new update protocol, we should just publish eventlog entries (adding jobid to context)?

I like this idea. The eventlog entry is most of what is needed anyways for the update, so it makes sense.

For that matter, would it be better to just "cc" all job eventlog updates to an in-memory circular buffer in the job manager that job-info could consume via a streaming RPC? Call it the job manager's "journal". Published events are easy to work with but the big downside is that they are broadcast across the instance, for naught if there is only the one consumer.

At one point in time I did something similar to this for the annotations. IIRC it worked, but I disliked it b/c unlike most streaming RPCs this streaming RPC had to "last forever". i.e. gotta always re-connect to job-manager if anything goes wrong. Was a bit annoying to deal with, but now that I think of it, I may have overthought my implementation. Probably just having a timer reactor that will just connect to the job-manager if it's not up is more than enough.

For the annotations and upcoming fair share priority changes that probably would not be appropriate to land in the job's eventlog, we still add an eventlog entry to the in-memory journal?

Sounds like a decent idea.

@garlick
Copy link
Member

garlick commented Sep 14, 2020

OK, let me open an issue on that idea.

@chu11 chu11 changed the title [WIP] job-manger / job-info: publish job data changes so job listings will have up to date information [WIP] job-manager / job-info: get job data changes fro job-manager to job-info so job listings will have up to date information Sep 17, 2020
@chu11
Copy link
Member Author

chu11 commented Sep 17, 2020

re-pushed with a rough implementation (needs to be cleaned up) where the job-info module gets information about job priority changes via a new job-manager.updates streaming RPC. The "record" of changes is done via an eventlog entry per discussion in #3209.

Currently the streaming RPC sends out one eventlog entry every time there is a priority change. This can be bad / slow b/c there can be a resulting re-sort of the pending job queue in the job-info module everytime it receives a priority change.

I was going to update the job-manager to "batch" send priority changes, not so dissimilar from the job-manager's current job state transitions batching. BUT then I realized, this same problem exists in the job-manager. Changing a job's priority is a single RPC (job-manager.priority rpc) and that can result in a re-sorting of the job-manager's alloc queue each time a priority is changed.

As discussed in #3052 there are multiple issues in play. How to update many priorities in the job-manager at once (RPC wise and internal data structure wise), how to send that data to the job-info manager (both to be done + done efficiently), and how to re-sort in the job-info module (data structure wise).

Many of these issues / solutions are still TBD. So I think that the current non-optimal implementation may be a good round 1 implementation to atleast solve the initial problem brought up in #3052. But we should split out new issues for the performance issue on multiple fronts. I suspect how to update a large number of job priorities at one time in the job-manager will lead us to how to update large number of job priorities in job-info as well.

@garlick
Copy link
Member

garlick commented Sep 17, 2020

If the priority of one job changes, the reordering of the alloc queue will not change hte priorities of other jobs, so that's still one event, right?

@chu11
Copy link
Member Author

chu11 commented Sep 17, 2020

@garlick correct. Sorry, perhaps I didn't explain my thoughts correctly. Basically the issue I was thinking about was, this is how to change the priority of multiple jobs right now (obviously in command line form, but underneath its just the RPCs):

for id in 1 2 3 4 5
do
    flux job priority $id 31
done

this will lead to 5 re-sorts of the alloc queue. Which is probably not good.

And in my PR the above will lead to 5 events sent to the job-info module, leading to 5 re-sorts of the job listing pending queue. Also probably not good.

@garlick
Copy link
Member

garlick commented Sep 17, 2020

The list is stored sorted, so it's just a remove and insert, lt least in the job manager case.

We might be generating many eventlog entries in this "journal" thing if we change a lot of priorities at once, e.g. via bank scheduling, but I'm not super worried about say thousands of short messages sent point to point on the same node. Maybe I'm not envisioning the full horror though. (Like with a million jobs in the queue, and updating the banks ever 5 minutes?)

As you say batching can help...

@garlick
Copy link
Member

garlick commented Sep 17, 2020

p.s. I was picturing that we would just add code to send all events that go to job eventlogs to this journal, and then we have a generic method of tracking all changes to all jobs, rather than a one off for priority. Maybe that's where you're going but wouldn't it be easier to just attach to the current batch mechanism and send out all events now?

@chu11
Copy link
Member Author

chu11 commented Sep 17, 2020

@garlick ohhh, you call zlistx_reorder () instead of zlistx_sort (), which is much better. DIdn't know about the reorder function. I'll do that as well.

@chu11
Copy link
Member Author

chu11 commented Sep 17, 2020

p.s. I was picturing that we would just add code to send all events that go to job eventlogs to this journal, and then we have a generic method of tracking all changes to all jobs, rather than a one off for priority. Maybe that's where you're going but wouldn't it be easier to just attach to the current batch mechanism and send out all events now?

hmmm, I didn't think about that, b/c that's a lot of events given the majority aren't needed by job-info. Perhaps if there was a filtering mechanism, so that only certain eventlog entries are sent out on the "updates" requests?

@garlick
Copy link
Member

garlick commented Sep 17, 2020

Great idea. A filter option in the request (like allow/deny list) makes a lot of sense!

@chu11 chu11 force-pushed the issue3052_job_info_sync branch 2 times, most recently from f10c7c8 to 54d3a8d Compare September 18, 2020 20:09
@chu11
Copy link
Member Author

chu11 commented Sep 18, 2020

Just re-pushed. This version renames the job-manager.updates RPC to job-manager.events. job-manager.events is now also tied to the job-manager batch infrastructure, so after the timer goes off is when an array of events (instead of just 1 event before) are sent for each listener. Each listener can input a filters option to limit which events they only want to listen to.

I think this infrastructure works out really well with the circular buffer in #3209 being considered for the future.

  • Because each listener has its own filters, each "listener" has their own json events array containing the events it will send on the next batch timer send. This is in contrast to keeping everything in one list / buffer and parsing / filtering everything when its time to send.

  • B/c of the above, the "circular buffer" doesn't have to be so complex and a bunch of "where was I in the buffer" pointer management has to be done. A simple zlist or json array can service as the "buffer" of the most recent N events. Will try and prototype this afternoon.

  • As a follow up to this PR, I think we can merge annotations into this infrastructure and eliminate the job-annotations event.

@garlick
Copy link
Member

garlick commented Sep 19, 2020

each "listener" has their own json events array containing the events it will send on the next batch timer send.

That sounds great. I was picturing using the actual newline separated, RFC-defined eventlog format in basically a cbuf, but I'm on board with this approach if it works out better. Hopefully the events in the array are objects not string-encoded JSON? (Double encoding JSON should probably be an official anti-pattern by now :-)

@chu11
Copy link
Member Author

chu11 commented Sep 19, 2020

That sounds great. I was picturing using the actual newline separated, RFC-defined eventlog format in basically a cbuf, but I'm on board with this approach if it works out better. Hopefully the events in the array are objects not string-encoded JSON? (Double encoding JSON should probably be an official anti-pattern by now :-)

Yup, arrays of objects, and super cheap to just json_incref() the objects into the arrays.

The one inefficiency is the regularly looping over the "filters" for comparison.

            json_array_foreach (el->filters, index, value) {
                const char *str = json_string_value (value);
                if (!strcmp (str, name)) {
                    if (json_array_append (el->events, entry) < 0)
                        goto nomem;
                    event->batch->events_to_send = true;
                }
            }

I could do some hashing tricks in the future to eliminate the regular looping, but figure this was fine for now.

@chu11 chu11 changed the title [WIP] job-manager / job-info: get job data changes fro job-manager to job-info so job listings will have up to date information [WIP] job-manager / job-info: get job data changes from job-manager to job-info so job listings will have up to date information Sep 21, 2020
@chu11 chu11 changed the title [WIP] job-manager / job-info: get job data changes from job-manager to job-info so job listings will have up to date information job-manager / job-info: get job data changes from job-manager to job-info so job listings will have up to date information Sep 21, 2020
@chu11
Copy link
Member Author

chu11 commented Sep 21, 2020

Just re-pushed, doing some cleanup + fixes + adding more tests. I removed WIP b/c we're approaching something that I think will be our final solution.

@chu11 chu11 force-pushed the issue3052_job_info_sync branch 2 times, most recently from e4369ab to b5d22ae Compare September 21, 2020 13:48
@chu11
Copy link
Member Author

chu11 commented Sep 21, 2020

re-pushed adding 1 more test for "reconnection" of the job-info module to the job-manager being reloaded. Perhaps this additional test was overkill.

@garlick
Copy link
Member

garlick commented Sep 21, 2020

LMK when you're ready for a review @chu11.

Call flux_log() instead of flux_log_error() on functions that do not
set errno.
Add forgotten unsubscribe of 'job-annotations' during teardown.
Do not re-order the alloc queue if the priority of the job
did not change with a priority change request.
Move the job-manager.disconnect event from wait.c to job-manager.c
in preparation for additional disconnect needs in the future.
When first loading job data for listings, in addition to parsing
the "submit" event, also search for and look for "priority" events,
just in case the job has updated its priority.
Support new job-manager.events rpc.  The rpc will stream
job eventlogs as they occur.  Streamed events can be configured
via a filters option in the RPC.

Support an associated job-manager.updates-cancel RPC as well.
Add job manager stats callback to output the number of listeners
currently active in the job-manager.
Store the timestamp when a priority of a job was changed.
@garlick
Copy link
Member

garlick commented Sep 28, 2020

the big one ... consider refactoring to use only job-manager.events and not job-state. I have to think about this more if its a wise idea.

I hope so, otherwise this mechanism wasn't the good idea I thought it was...

@chu11 chu11 force-pushed the issue3052_job_info_sync branch 2 times, most recently from 5f0ad64 to 30d15a8 Compare September 28, 2020 18:28
Request and listen to job-manager events from the job-manager.events
streaming RPC.  Filter out all events except for the "priority" event.

Handle changes to job priorities and re-order pending queue as needed.

Fixes flux-framework#3052
With the job-info module now dependent on the job-manager.event
RPC stream, the job-info module needs to be loaded after the job-manager
module is loaded.

To avoid unnecessary errors, the job-info module is now unloaded before
the job-manager as well.
With job-info module now dependent on job-manager, update tests
for this fact.
Add new tests to test the job-manager.events and job-manager.events-cancel
handlers.

Add new testing tool t/job-manager/event_stream to stream events from the
job-manager.
Add new tests to ensure that priority changes made to jobs in the
job-manager eventually reach the job-info module.
@chu11
Copy link
Member Author

chu11 commented Sep 28, 2020

re-pushed with all of the fixes talked about above, squashed

@chu11
Copy link
Member Author

chu11 commented Sep 28, 2020

I hope so, otherwise this mechanism wasn't the good idea I thought it was...

It minimally solves the "update job-info with info about changes in the job-manager" which is the important thing. I think it'll work out, is just 1 or 2 steps ahead and hard to see the end at the moment.

@codecov-commenter
Copy link

Codecov Report

Merging #3208 into master will increase coverage by 0.00%.
The diff coverage is 71.48%.

@@           Coverage Diff            @@
##           master    #3208    +/-   ##
========================================
  Coverage   81.40%   81.40%            
========================================
  Files         292      292            
  Lines       44398    44632   +234     
========================================
+ Hits        36141    36334   +193     
- Misses       8257     8298    +41     
Impacted Files Coverage Δ
src/modules/job-info/job_state.c 71.52% <50.60%> (-0.62%) ⬇️
src/modules/job-manager/job-manager.c 55.55% <63.63%> (+1.26%) ⬆️
src/modules/job-manager/event.c 78.87% <81.28%> (+1.50%) ⬆️
src/modules/job-manager/priority.c 94.59% <100.00%> (+0.30%) ⬆️
src/modules/job-manager/submit.c 79.04% <100.00%> (+0.20%) ⬆️
src/modules/job-manager/wait.c 79.31% <100.00%> (ø)
src/common/libutil/dirwalk.c 93.70% <0.00%> (-0.70%) ⬇️
src/broker/broker.c 74.02% <0.00%> (+0.11%) ⬆️
src/common/libflux/message.c 82.98% <0.00%> (+0.12%) ⬆️
... and 5 more

@garlick
Copy link
Member

garlick commented Sep 28, 2020

If you're ready I'm fine with MWP!

@chu11
Copy link
Member Author

chu11 commented Sep 28, 2020

@garlick thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants