Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add notification service design doc #414

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wihobbs
Copy link
Member

@wihobbs wihobbs commented May 13, 2024

This is a RFC-style design document being put up to document the design we have for the flux job notification service.

I'm looking to get feedback on the clarity of this document and make sure we have as many edge cases as possible covered. If this isn't an appropriate RFC for the whole project it can be moved somewhere else, but the github PR interface provides a nice way for us to engage in a feedback loop.

I wanted to add a flow chart detailing some of the interactions but am still working on that. I think there's enough here for a first pass review, though.

@wihobbs wihobbs force-pushed the flanrfc branch 2 times, most recently from a56d23d to 60db4cf Compare May 13, 2024 15:50
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start @wihobbs! I've left some comments mostly just looking at grammar; feel free to take them or leave them. Sorry in advance if I am misunderstanding some of the RFC language here!

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated
- Support notification after any event of the job, where events are defined in
:doc:`spec_21`.
- Support email notifications, as well as a driver capable of sending POST
requests to any chat service, provided they have an API capable of accepting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore this suggestion, but I would perhaps be cognizant here of putting "any" in front of "chat services". :-) IMHO I would stick with the chat services you previously mentioned, i.e Slack & Mattermost to start, and build up from there!

"Support email notifications, as well as a driver capable of sending POST requests to chat services, provided they have an API capable of accepting such requests, including but not limited to Mattermost and Slack."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its necessary to have the definition of chat services in the Terminology section? On reflection, it seems redundant to have a definition of chat services as mattermost and slack and then reiterate that later in the doc too. I think I can drop the "Chat services" definition in terminology...? Thoughts?

Copy link
Contributor

@grondo grondo May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason the design of the notification service is split into a jobtap plugin and Python service is to make it easier to make the notification types extensible via plugins (I assume that is why the term "adaptable" appears in the name here).

It might be clearer to make the (perhaps eventual) extensibility of the notification service a first class design goal in this document, and mention this is done "in order to allow easy support of alternate notification platforms, such as Slack or Mattermost, via Python plugins" and leave it at that.

Then, make the only requirement that the notification service SHALL support notification via email.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, "adaptable" in the name is also meant to support the goal of making the notifications flexible to user demands, i.e., a user can specify what job metadata and formatting would be relevant and useful to them, and the service will include those fields/comply with that formatting in the notification. (This is the feature described, albeit breifly, in "user interface/advanced use cases" for possible support in v2.)

I like the idea of having to only support emails as the requirement for v1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such as Slack or Mattermost, via Python plugins" and leave it at that.

+1 to provide these are "examples of such plugins" and not make any promises. Indeed a notification service on the level of a user is going to have a different level of permission than one that goes to an entire channel, or a single email. And the design you choose for each is going to be subtly different. I've made bots for Twitter, slack, Mastodon, Discord, GitHub, Reddit, email, and niche ones like Uservoice or other help desks, and they are all special snowflakes. They change their APIs too (I'm looking at you, Twitter)!

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Show resolved Hide resolved
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

I had a few minor suggestions about the flow of the document. Feel free to ignore if it doesn't make sense to you!

index.rst Outdated
Comment on lines 289 to 290
The Flux Library for Adaptable Notifications (FLAN) provides a connection to
external notification services (such as email) for steps in a batch job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent name 👍

Did you mean "job state transitions" rather than "steps" (technically, we don't have steps in flux)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is good feedback. I think I meant to say events actually. Is that appropriate here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And glad you like the name! I will have to learn how to make one when the work is finished. 🍰

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, thought of a good tagline for this project:

flux-flan: sweet job notifications for flux

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love me those sweet, sweet notifications! 🍰

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated
Comment on lines 70 to 73
The jobtap plugin
A shared library based on the API defined in
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#jobtap-plugin-names>`_
which streams the jobids of notification-enabled jobs to the python driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe it's too early to define the implementation in terms of jobtap plugin and python driver since we're not to the requirements yet?

Copy link
Member Author

@wihobbs wihobbs May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked having definitions for these, since "jobtap plugin" etc. has a broader meaning outside this RFC, so I moved these to below "Implementation" and kept the existing definitions. On a second pass let me know if that formatting change is helpful/harmful to the flow.

spec_44.rst Outdated Show resolved Hide resolved
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

Or is the python driver something the user would start with their subinstance? How would we expect them to generlaly start the process? Like add a python myscript.py & in their batch file?

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated
Comment on lines 171 to 180
The default behavior SHALL be to send a notification to the users' primary email
address, as provided by an LDAP query, when the job reaches the START and FINISH
events.
Copy link
Member

@chu11 chu11 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially read this and thought this implementation was way too specific. But then I thought maybe you're just giving a use case example. Perhaps the words "The default behavior" threw me off. Maybe "For example, the default behavior could be ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior is supposed to be detailing what happens when a user says flux batch --setattr=system.notify=default. The eventual goal is to support doing crazy things like --setattr=system.notify.service=slack --setattr=system.notify.events=validate,epilog-start --setattr=system.notify.include="{id.f58} {R} {eventlog}, etc. so I wanted to give the behavior in the general case (which is probably all we'll actually support in v1.)

I do think that maybe specifying LDAP is a tad specific. Though I've generally tried to be as specific as possible in this RFC since its more of a design document for a service than a traditional RFC...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to think through the ergonomics of requesting notifications before codifying into an RFC. Requiring a user to specify multiple --setattr options on the command line might not go over that well..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it would be fine to have something in here, it can always be updated in the future. However, just keep in the back of your mind that users would appreciate a single --setattr so perhaps comma delimited options or other ways to provide options in a single go should be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For longer term, maybe a design that is readable and simple, something like --notify.service=slack I can read from left to right.

@wihobbs
Copy link
Member Author

wihobbs commented May 13, 2024

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

The idea was that a sysadmin would load the jobtap plugin (maybe just include it in the config?) and then start the Python driver under systemd on the management node.

Still TBD is how a user would be able to start this service themselves to get notifications for batch jobs in sub-instances (which is currently how some DATs are organized).

@wihobbs wihobbs force-pushed the flanrfc branch 2 times, most recently from e227539 to a350340 Compare May 13, 2024 22:36
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just took a quick first pass 👍

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated

The ``notify.enable`` request has no payload.

At initialization the python driver SHALL create a kvs subdirectory, ``notify``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the directory already exists? :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python driver can traverse the existing directory for jobid keys.

If the jobtap plugin could insert and remove jobid keys itself, this eliminates the need for the record of jobids in the jobtap plugin ("hash table"). The jobtap plugin could have an INACTIVE callback to remove, and add the key in DEPEND. Currently the KVS is only being used to ensure uniqueness of the notification being sent.

Thoughts on this design?

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated Show resolved Hide resolved
index.rst Outdated
:doc:`spec_44`
~~~~~~~~~~~~~~

The Flux Library for Adaptable Notifications (FLAN) provides a connection to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent name. Although I do not like the texture of flan, +1 from me!

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated

- By default in a system-instance, do not notify a user of any job events.
Allow the user to override this default with a jobspec attribute,
system.notify.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it go under system ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a longer explanation in RFC 14/Canonical Job Specification about system vs. user jobspec attributes. The TLDR as I understand it is that attributes Flux needs go under system (such as changing your bank in flux-accounting, etc.) while things that the user wants to propagate to their application go under user.

A benefit of having jobspec attributes under system is that's the default if nothing is specified in setattr, i.e. --setattr=notify=default expands to --setattr=system.notify=default. Another is that under system we can make notify a reserved word which isn't possible under user.

I had a longer conversation with Mark about this that got eaten by the Slack monster :( where he had a more detailed explanation for why the notification service attributes would go under system .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty good summary @wihobbs.

spec_44.rst Outdated
- Support notification after any event of the job, where events are defined in
:doc:`spec_21`.
- Support email as the primary end user notification delivery.
- Build a driver capable of sending POST requests to chat services for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you see tokens or credentials for these posts going? I assume it's on the level of the user? So the user needs to setup their own slack app to make it work? But what about an organization or group - how would that be handled, permissions wise for the credentials?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I had envisioned credentials/tokens being stored in a config file on the management node. The only thing a user would need to pass to the jobspec would be their handle, i.e. hobbs17 on the LLNL slack. The slack admins would set up a bot to handle delivering the message to the user, and the sysadmins would store the webhooks for this bot in a toml file on the management node.

An edge case I haven't considered yet is "what if we need to support multiple 'teams' on Slack," i.e. LLNL and llnl-performance.

I should probably go read the Slack/Mattermost API docs and make sure this approach will actually work. I've built very little with MM and nothing with Slack...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any event I would probably place configuration location of third party tokens, etc. outside the scope of this document, which should focus on how the service monitors jobs for events of interest to turn them into generic notifications. (i.e. the protocol between Python service and a jobtap plugin, how restarts are handled, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec_44.rst Outdated Show resolved Hide resolved

.. note::
This design is intended to ensure that no double-notifications are sent upon
the restart of the Python script, the jobtap plugin, or the job-manager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the directory deletion is missed?

If a disconnect request is received by the jobtap plugin, this indicates the
python driver has exited. The jobtap plugin SHALL continue to add notification-
enabled jobs to its hash table as they enter the DEPEND state. When the python
driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it doesn't reconnect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"reconnect" is probably the wrong word. The Python driver if disconnected from the jobtap plugin will exit immediately and log an error. Eventually, systemd or an admin will start a new Python driver which sends a new initial notify.enable RPC request. No notifications can be sent while there's not an active Python driver process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question then is - what if the admin doesn't start a new Python driver? The hash table keeps filling up and explodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good point. I hadn't thought of this. Maybe the jobtap plugin should remove jobids when they reach an inactive state... the notifications would be missed, though. Hmm. 🤔

There might be a way to log a warning reminding the user/admins to start the Python driver, but I'll have to look into that more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this to the list of things to investigate. It shouldn't be an issue to keep a hash/list of all jobids with notifications, but we should not assume there is not a better way at this point in the design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this, my question above is becoming a bit more urgent. Is there a reason to keep these things separate? If so, is there a reason the jobtap plugin shouldn't manage the lifetime of the driver as a subprocess or systemd unit?

spec_44.rst Outdated Show resolved Hide resolved
spec_44.rst Outdated
-------------------------

The webhooks and other secrets required to connect to chat services SHALL be included
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either place this in a directory that has a meaningful path (for the developer to see) or give it name that is more specific to describe what it is. And I am wondering who has ownership of this file - a single user or if a center needs to take responsibility, for example, to provide a single credential for all users to use (is that a good idea)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely both use cases could be wanted (the "I am a user with my own custom plugin and private credential" and "I am a center providing this to my users."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe for the "center providing for users" case this needs an /etc/flux/notify.d/ or similar. Probably an implementation detail outside this RFC, but I had not thought about standardizing the config file path. Good idea!

In the event the job-manager crashes or is shut down the python driver SHALL exit
immediately and log an error.

Flux does not currently support restarting with running jobs. However, on a system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to restore state sounds messy - is there any reason to not recreate notifications on restart so there aren't any mismatches between state brought up and state known to the plugin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notifications would be recreated internally in the Python driver, since losing its connection to Flux is a fatal error (and that would happen in a system-instance restart).

Restoring state is messy...but I want to try for it anyway :). Here's an idea for how:

  1. Python driver crashes because job-manager is shut down.
  2. Job-manager is restarted (and so is the KVS). KVS comes back; jobtap plugin is loaded on start. All job eventlogs are replayed.
  3. All of the jobtap plugin depend callbacks are hit again. Jobtap plugin records these jobids.
  4. A "new" Python driver starts and gets a bunch of jobids sent to it. It watches the events [that may be somewhat stale.] (This was @chu11's point in a message above.) It does not add jobid folders to the KVS because they are already there.
  5. Before it sends a notification, it checks the KVS notify.<jobid> folder to see if __ (event) key is there. If the key is not there, that means a notification has already been sent. (On reflection, maybe it should be flipped -- i.e. if the key is there, that means a notification has already been sent.)
  6. Stop! Do not pass go, do not collect $200! Because we know the job is being watched, but a notification has already been sent, keep chugging and watching for others that may not have been sent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unclear here what is meant by the "depend callbacks are hit again".

Note that there are two jobtap callbacks related to dependencies and the DEPEND state:

  • job.state.depend: One of the generic job.state.* callbacks, called when the job reaches the DEPEND state.
  • job.dependency.SCHEME: Called for each dependency SCHEME in jobspec of a submitted job

I don't think the job.state.* callbacks are called on reload. You can test this by writing a jobtap plugin that subscribes to all these states and prints what it sees on a restart.

The job.dependency.SCHEME callbacks are specifically issued on a restart for jobs with dependencies defined in jobspec.

I know we talked about this before, but I don't think I properly understood you were talking about one of these callback types and not the other.

Probably the best way for plugins to be notified of new jobs is to use the job.new callback, which is called for new jobs after they've been validated, on plugin reload, and on job manager restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lack of a better place to document this (tagging @garlick as well):

It appears the job.state.priority callback is re-triggered when flux start is run in recovery mode with pending jobs, but no other job.state.* callbacks.

#include <flux/core.h>
#include <flux/jobtap.h>
#include <stdio.h>
#include <jansson.h>
#include <stdlib.h>

static int print_cb (flux_plugin_t *p,
                     const char *topic,
                     flux_plugin_arg_t *args,
                     void *arg)
{
    flux_jobid_t id;
    long unsigned int state;
    flux_t *h = flux_jobtap_get_flux(p);
    json_t *event;

    if (flux_plugin_arg_unpack(args, 
                               FLUX_PLUGIN_ARG_IN,
                               "{s:I s:i s:o}",
                               "id", &id,
                               "state", &state,
                               "entry", &event) < 0)
        return -1;
    
    if (id) {
        char* json_string = json_dumps(event, JSON_ENCODE_ANY);
	    flux_log(h, 0, "jobid %ld has entered state %s\n", id, json_string);
    }

    return 0;
}

static const struct flux_plugin_handler tab[] = {
    {"job.state.*", print_cb, NULL},
    {0},
};

int flux_plugin_init (flux_plugin_t *p) {
    if (flux_plugin_register(p, "jobtap-print", tab) < 0) {
        return -1;
    }
    return 0;
}

And on shutdown/restart:

  auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml
(s=1,d=0)  auk108 ~/scripts/jobtap-test $ flux submit --urgency=0 hostname
May 14 15:19:03.251059 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009410 {"timestamp": 1715725143.2404392, "name": "validate"}
May 14 15:19:03.251083 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009412 {"timestamp": 1715725143.2510674, "name": "depend"}
May 14 15:19:03.251104 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009416 {"timestamp": 1715725143.251091, "name": "priority", "context": {"priority": 0}}
ƒ9h7knoh
(s=1,d=0)  auk108 ~/scripts/jobtap-test $ flux shutdown --dump=$(pwd)/dump.tar
  auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml --recovery=$(pwd)/dump.tar
May 14 15:22:46.751219 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517508 {"timestamp": 1715725366.7511699, "name": "flux-restart"}
May 14 15:22:46.751270 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517512 {"timestamp": 1715725366.7512314, "name": "priority", "context": {"priority": 0}}
+-----------------------------------------------------
| Entering Flux recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
|  reduced functionality until manually corrected.
|
| broker.rc1_path    /home/hobbs17/flux-core/../bin/etc/flux/rc1
| broker.rc3_path    /home/hobbs17/flux-core/../bin/etc/flux/rc3
| config.path        /home/hobbs17/scripts/jobtap-test/config.toml
| statedir           changes will not be preserved
|
| Exit this shell when finished.
+-----------------------------------------------------

I accept the suggestion to tie the streaming RPC response to job.new, which is replayed on restart (tweaked the test above and verified).

We will have to make the ability to restart the service dependent on a requirement that this plugin be included in the job-manager's plugins table in the config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job.new is also called for every active job when a plugin is loaded (on the assumption that this is the method that "introduces" jobs to jobtap plugins). This should allow the plugin to be reloaded (e.g. to get a new version) and get back to the same place it was before it was unloaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on the test plugin, btw. You could tweak it to register callbacks for * if you want to see everything.

@wihobbs
Copy link
Member Author

wihobbs commented May 14, 2024

Note that ^ push is mostly grammar/clarity fixes and I'm still working through the [excellent, but involved :) ] design feedback.

@vsoch
Copy link
Member

vsoch commented May 15, 2024

@wihobbs you've probably already exceeded most of us in knowledge of jobtap plugins, but in case you are interested we are having a little (hopefully fun) hackathon this week on Thursday - grandmaster @trws is going to show us the ropes for making plugins, and ❗ I think there might be rust involved! If you are interested, even for the fun or sharing your expertise, I can send along the invite.

@wihobbs wihobbs force-pushed the flanrfc branch 2 times, most recently from b6366cc to 2883259 Compare May 15, 2024 17:59
@wihobbs
Copy link
Member Author

wihobbs commented May 15, 2024

Tons of great feedback on this. Thanks to everybody for participating and sharing your expertise.

I tried to close or consolidate duplicate issues. I apologize in advance if something was inadvertently closed -- please reopen if so.

I'm making a high-level summary of the outstanding design todos:

  • Recovery of state on a crash. Note that there are three things that can crash, and we should possibly have a different strategy for handling each:
    • Crash of the job-manager (+Python driver+jobtap plugin) -- essentially restoring the state of everything.
    • Crash of the Python driver, but jobtap plugin and job-manager persist.
    • Crash of the jobtap plugin, but not the job-manager (+?Python driver)
  • Ensuring uniqueness of events before sending notifications (related: no double-or-worse notifications, and no missed notifications)
  • Rate limiting: why do we need it and how will we implement it?
  • Extensibility of the service via plugins (for external services such as Slack). Implementation of individual services is not a requirement.
  • Ergonomics of the "advanced use case," where we allow a user to specify services other than email, non-standard things to include, and states other than START and FINISH in which to trigger notifications.

`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_
that streams the jobids of notification-enabled jobs to the Python driver.

The Python driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this documenting a thing that already exists? If not, is there a reason these two should be separate? If it's a language issue, a python-based jobtap plugin shouldn't be an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's largely because the jobtap plugin resides in the job manager and we want it to do as little as possible since it would be in the critical path.

The jobtap plugin (at the time) seemed like the most efficient way to avoid having to fetch and examine the jobspec of every job since it can just peek at the jobspec that the job manager already has in memory and send the jobid along to the Python process only when the user has requested notification.

However, I'm glad you asked because I wonder if there's a better way to do this using the job-manager journal or some other more widely applicable method. It sure would be nice if it was easy for people to develop services like this and not have to install yet another jobtap plugin to identify jobs of interest.

Something to consider...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point. Blocking the job manager would be unfortunate.

The journal is a good thought, or maybe even a pub-sub event from the plugin so anyone could subscribe to it? Not sure, there are definitely tradeoffs here, but it sounds like the current approach has some potential to cause lost or repeated notification as well, especially racing with the event delivery and kvs removal vs jobtap replay if flux is restarted.

Problem: no design currently exists for the Flux
email service as noted in flux-framework/flux-core#4435.

Add a RFC-style document detailing this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants