rfc: add notification service design doc #414

wihobbs · 2024-05-13T15:17:52Z

This is a RFC-style design document being put up to document the design we have for the flux job notification service.

I'm looking to get feedback on the clarity of this document and make sure we have as many edge cases as possible covered. If this isn't an appropriate RFC for the whole project it can be moved somewhere else, but the github PR interface provides a nice way for us to engage in a feedback loop.

I wanted to add a flow chart detailing some of the interactions but am still working on that. I think there's enough here for a first pass review, though.

cmoussa1

Great start @wihobbs! I've left some comments mostly just looking at grammar; feel free to take them or leave them. Sorry in advance if I am misunderstanding some of the RFC language here!

spec_44.rst

cmoussa1 · 2024-05-13T16:01:22Z

spec_44.rst

+ - Support notification after any event of the job, where events are defined in
+   :doc:`spec_21`.
+ - Support email notifications, as well as a driver capable of sending POST
+   requests to any chat service, provided they have an API capable of accepting


Feel free to ignore this suggestion, but I would perhaps be cognizant here of putting "any" in front of "chat services". :-) IMHO I would stick with the chat services you previously mentioned, i.e Slack & Mattermost to start, and build up from there!

"Support email notifications, as well as a driver capable of sending POST requests to chat services, provided they have an API capable of accepting such requests, including but not limited to Mattermost and Slack."

Do you think its necessary to have the definition of chat services in the Terminology section? On reflection, it seems redundant to have a definition of chat services as mattermost and slack and then reiterate that later in the doc too. I think I can drop the "Chat services" definition in terminology...? Thoughts?

Part of the reason the design of the notification service is split into a jobtap plugin and Python service is to make it easier to make the notification types extensible via plugins (I assume that is why the term "adaptable" appears in the name here).

It might be clearer to make the (perhaps eventual) extensibility of the notification service a first class design goal in this document, and mention this is done "in order to allow easy support of alternate notification platforms, such as Slack or Mattermost, via Python plugins" and leave it at that.

Then, make the only requirement that the notification service SHALL support notification via email.

As an aside, "adaptable" in the name is also meant to support the goal of making the notifications flexible to user demands, i.e., a user can specify what job metadata and formatting would be relevant and useful to them, and the service will include those fields/comply with that formatting in the notification. (This is the feature described, albeit breifly, in "user interface/advanced use cases" for possible support in v2.)

I like the idea of having to only support emails as the requirement for v1.

such as Slack or Mattermost, via Python plugins" and leave it at that.

+1 to provide these are "examples of such plugins" and not make any promises. Indeed a notification service on the level of a user is going to have a different level of permission than one that goes to an entire channel, or a single email. And the design you choose for each is going to be subtly different. I've made bots for Twitter, slack, Mastodon, Discord, GitHub, Reddit, email, and niche ones like Uservoice or other help desks, and they are all special snowflakes. They change their APIs too (I'm looking at you, Twitter)!

spec_44.rst

garlick

Excellent!

I had a few minor suggestions about the flow of the document. Feel free to ignore if it doesn't make sense to you!

garlick · 2024-05-13T16:44:34Z

index.rst

+The Flux Library for Adaptable Notifications (FLAN) provides a connection to
+external notification services (such as email) for steps in a batch job.


Excellent name 👍

Did you mean "job state transitions" rather than "steps" (technically, we don't have steps in flux)

Ah, this is good feedback. I think I meant to say events actually. Is that appropriate here?

And glad you like the name! I will have to learn how to make one when the work is finished. 🍰

BTW, thought of a good tagline for this project:

flux-flan: sweet job notifications for flux

Love me those sweet, sweet notifications! 🍰

spec_44.rst

garlick · 2024-05-13T16:48:05Z

spec_44.rst

+The jobtap plugin
+  A shared library based on the API defined in 
+  `flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#jobtap-plugin-names>`_
+  which streams the jobids of notification-enabled jobs to the python driver.


Again, maybe it's too early to define the implementation in terms of jobtap plugin and python driver since we're not to the requirements yet?

I liked having definitions for these, since "jobtap plugin" etc. has a broader meaning outside this RFC, so I moved these to below "Implementation" and kept the existing definitions. On a second pass let me know if that formatting change is helpful/harmful to the flow.

spec_44.rst

chu11

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

Or is the python driver something the user would start with their subinstance? How would we expect them to generlaly start the process? Like add a python myscript.py & in their batch file?

spec_44.rst

chu11 · 2024-05-13T19:10:22Z

spec_44.rst

+The default behavior SHALL be to send a notification to the users' primary email
+address, as provided by an LDAP query, when the job reaches the START and FINISH
+events.


I initially read this and thought this implementation was way too specific. But then I thought maybe you're just giving a use case example. Perhaps the words "The default behavior" threw me off. Maybe "For example, the default behavior could be ..."

The default behavior is supposed to be detailing what happens when a user says flux batch --setattr=system.notify=default. The eventual goal is to support doing crazy things like --setattr=system.notify.service=slack --setattr=system.notify.events=validate,epilog-start --setattr=system.notify.include="{id.f58} {R} {eventlog}, etc. so I wanted to give the behavior in the general case (which is probably all we'll actually support in v1.)

I do think that maybe specifying LDAP is a tad specific. Though I've generally tried to be as specific as possible in this RFC since its more of a design document for a service than a traditional RFC...

we may want to think through the ergonomics of requesting notifications before codifying into an RFC. Requiring a user to specify multiple --setattr options on the command line might not go over that well..

On second thought, it would be fine to have something in here, it can always be updated in the future. However, just keep in the back of your mind that users would appreciate a single --setattr so perhaps comma delimited options or other ways to provide options in a single go should be considered.

For longer term, maybe a design that is readable and simple, something like --notify.service=slack I can read from left to right.

wihobbs · 2024-05-13T19:21:36Z

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

The idea was that a sysadmin would load the jobtap plugin (maybe just include it in the config?) and then start the Python driver under systemd on the management node.

Still TBD is how a user would be able to start this service themselves to get notifications for batch jobs in sub-instances (which is currently how some DATs are organized).

grondo

Ok, I just took a quick first pass 👍

spec_44.rst

grondo · 2024-05-14T01:04:51Z

spec_44.rst

+
+The ``notify.enable`` request has no payload.
+
+At initialization the python driver SHALL create a kvs subdirectory, ``notify``.


What if the directory already exists? :-)

The Python driver can traverse the existing directory for jobid keys.

If the jobtap plugin could insert and remove jobid keys itself, this eliminates the need for the record of jobids in the jobtap plugin ("hash table"). The jobtap plugin could have an INACTIVE callback to remove, and add the key in DEPEND. Currently the KVS is only being used to ensure uniqueness of the notification being sent.

Thoughts on this design?

spec_44.rst

vsoch · 2024-05-14T02:29:25Z

index.rst

+:doc:`spec_44`
+~~~~~~~~~~~~~~
+
+The Flux Library for Adaptable Notifications (FLAN) provides a connection to


This is an excellent name. Although I do not like the texture of flan, +1 from me!

spec_44.rst

vsoch · 2024-05-14T02:31:30Z

spec_44.rst

+
+ - By default in a system-instance, do not notify a user of any job events.
+   Allow the user to override this default with a jobspec attribute,
+   system.notify.


Why does it go under system ?

There's a longer explanation in RFC 14/Canonical Job Specification about system vs. user jobspec attributes. The TLDR as I understand it is that attributes Flux needs go under system (such as changing your bank in flux-accounting, etc.) while things that the user wants to propagate to their application go under user.

A benefit of having jobspec attributes under system is that's the default if nothing is specified in setattr, i.e. --setattr=notify=default expands to --setattr=system.notify=default. Another is that under system we can make notify a reserved word which isn't possible under user.

I had a longer conversation with Mark about this that got eaten by the Slack monster :( where he had a more detailed explanation for why the notification service attributes would go under system .

That's a pretty good summary @wihobbs.

vsoch · 2024-05-14T02:32:32Z

spec_44.rst

+ - Support notification after any event of the job, where events are defined in
+   :doc:`spec_21`.
+ - Support email as the primary end user notification delivery.
+ - Build a driver capable of sending POST requests to chat services for


Where do you see tokens or credentials for these posts going? I assume it's on the level of the user? So the user needs to setup their own slack app to make it work? But what about an organization or group - how would that be handled, permissions wise for the credentials?

Actually, I had envisioned credentials/tokens being stored in a config file on the management node. The only thing a user would need to pass to the jobspec would be their handle, i.e. hobbs17 on the LLNL slack. The slack admins would set up a bot to handle delivering the message to the user, and the sysadmins would store the webhooks for this bot in a toml file on the management node.

An edge case I haven't considered yet is "what if we need to support multiple 'teams' on Slack," i.e. LLNL and llnl-performance.

I should probably go read the Slack/Mattermost API docs and make sure this approach will actually work. I've built very little with MM and nothing with Slack...

In any event I would probably place configuration location of third party tokens, etc. outside the scope of this document, which should focus on how the service monitors jobs for events of interest to turn them into generic notifications. (i.e. the protocol between Python service and a jobtap plugin, how restarts are handled, etc)

spec_44.rst

vsoch · 2024-05-14T02:45:10Z

spec_44.rst

+
+.. note::
+  This design is intended to ensure that no double-notifications are sent upon
+  the restart of the Python script, the jobtap plugin, or the job-manager.


Unless the directory deletion is missed?

vsoch · 2024-05-14T02:45:55Z

spec_44.rst

+If a disconnect request is received by the jobtap plugin, this indicates the
+python driver has exited. The jobtap plugin SHALL continue to add notification-
+enabled jobs to its hash table as they enter the DEPEND state. When the python
+driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable``


What if it doesn't reconnect?

"reconnect" is probably the wrong word. The Python driver if disconnected from the jobtap plugin will exit immediately and log an error. Eventually, systemd or an admin will start a new Python driver which sends a new initial notify.enable RPC request. No notifications can be sent while there's not an active Python driver process.

My question then is - what if the admin doesn't start a new Python driver? The hash table keeps filling up and explodes?

This is a really good point. I hadn't thought of this. Maybe the jobtap plugin should remove jobids when they reach an inactive state... the notifications would be missed, though. Hmm. 🤔

There might be a way to log a warning reminding the user/admins to start the Python driver, but I'll have to look into that more.

Let's add this to the list of things to investigate. It shouldn't be an issue to keep a hash/list of all jobids with notifications, but we should not assume there is not a better way at this point in the design.

Given this, my question above is becoming a bit more urgent. Is there a reason to keep these things separate? If so, is there a reason the jobtap plugin shouldn't manage the lifetime of the driver as a subprocess or systemd unit?

spec_44.rst

vsoch · 2024-05-14T02:48:22Z

spec_44.rst

+-------------------------
+
+The webhooks and other secrets required to connect to chat services SHALL be included
+in a ``config.toml`` file. The path to this file MUST be provided to the FLAN 


I would either place this in a directory that has a meaningful path (for the developer to see) or give it name that is more specific to describe what it is. And I am wondering who has ownership of this file - a single user or if a center needs to take responsibility, for example, to provide a single credential for all users to use (is that a good idea)?

Likely both use cases could be wanted (the "I am a user with my own custom plugin and private credential" and "I am a center providing this to my users."

Oh, maybe for the "center providing for users" case this needs an /etc/flux/notify.d/ or similar. Probably an implementation detail outside this RFC, but I had not thought about standardizing the config file path. Good idea!

vsoch · 2024-05-14T02:49:55Z

spec_44.rst

+In the event the job-manager crashes or is shut down the python driver SHALL exit
+immediately and log an error. 
+
+Flux does not currently support restarting with running jobs. However, on a system


Trying to restore state sounds messy - is there any reason to not recreate notifications on restart so there aren't any mismatches between state brought up and state known to the plugin?

Notifications would be recreated internally in the Python driver, since losing its connection to Flux is a fatal error (and that would happen in a system-instance restart).

Restoring state is messy...but I want to try for it anyway :). Here's an idea for how:

Python driver crashes because job-manager is shut down.

Job-manager is restarted (and so is the KVS). KVS comes back; jobtap plugin is loaded on start. All job eventlogs are replayed.

All of the jobtap plugin depend callbacks are hit again. Jobtap plugin records these jobids.

A "new" Python driver starts and gets a bunch of jobids sent to it. It watches the events [that may be somewhat stale.] (This was @chu11's point in a message above.) It does not add jobid folders to the KVS because they are already there.

Before it sends a notification, it checks the KVS notify.<jobid> folder to see if __ (event) key is there. If the key is not there, that means a notification has already been sent. (On reflection, maybe it should be flipped -- i.e. if the key is there, that means a notification has already been sent.)

Stop! Do not pass go, do not collect $200! Because we know the job is being watched, but a notification has already been sent, keep chugging and watching for others that may not have been sent.

It is a bit unclear here what is meant by the "depend callbacks are hit again".

Note that there are two jobtap callbacks related to dependencies and the DEPEND state:

job.state.depend: One of the generic job.state.* callbacks, called when the job reaches the DEPEND state.

job.dependency.SCHEME: Called for each dependency SCHEME in jobspec of a submitted job

I don't think the job.state.* callbacks are called on reload. You can test this by writing a jobtap plugin that subscribes to all these states and prints what it sees on a restart.

The job.dependency.SCHEME callbacks are specifically issued on a restart for jobs with dependencies defined in jobspec.

I know we talked about this before, but I don't think I properly understood you were talking about one of these callback types and not the other.

Probably the best way for plugins to be notified of new jobs is to use the job.new callback, which is called for new jobs after they've been validated, on plugin reload, and on job manager restart.

For lack of a better place to document this (tagging @garlick as well):

It appears the job.state.priority callback is re-triggered when flux start is run in recovery mode with pending jobs, but no other job.state.* callbacks.

#include <flux/core.h> #include <flux/jobtap.h> #include <stdio.h> #include <jansson.h> #include <stdlib.h> static int print_cb (flux_plugin_t *p, const char *topic, flux_plugin_arg_t *args, void *arg) { flux_jobid_t id; long unsigned int state; flux_t *h = flux_jobtap_get_flux(p); json_t *event; if (flux_plugin_arg_unpack(args, FLUX_PLUGIN_ARG_IN, "{s:I s:i s:o}", "id", &id, "state", &state, "entry", &event) < 0) return -1; if (id) { char* json_string = json_dumps(event, JSON_ENCODE_ANY); flux_log(h, 0, "jobid %ld has entered state %s\n", id, json_string); } return 0; } static const struct flux_plugin_handler tab[] = { {"job.state.*", print_cb, NULL}, {0}, }; int flux_plugin_init (flux_plugin_t *p) { if (flux_plugin_register(p, "jobtap-print", tab) < 0) { return -1; } return 0; }

And on shutdown/restart:

auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml (s=1,d=0) auk108 ~/scripts/jobtap-test $ flux submit --urgency=0 hostname May 14 15:19:03.251059 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009410 {"timestamp": 1715725143.2404392, "name": "validate"} May 14 15:19:03.251083 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009412 {"timestamp": 1715725143.2510674, "name": "depend"} May 14 15:19:03.251104 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009416 {"timestamp": 1715725143.251091, "name": "priority", "context": {"priority": 0}} ƒ9h7knoh (s=1,d=0) auk108 ~/scripts/jobtap-test $ flux shutdown --dump=$(pwd)/dump.tar auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml --recovery=$(pwd)/dump.tar May 14 15:22:46.751219 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517508 {"timestamp": 1715725366.7511699, "name": "flux-restart"} May 14 15:22:46.751270 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517512 {"timestamp": 1715725366.7512314, "name": "priority", "context": {"priority": 0}} +----------------------------------------------------- | Entering Flux recovery mode. | All resources will be offline during recovery. | Any rc1 failures noted above may result in | reduced functionality until manually corrected. | | broker.rc1_path /home/hobbs17/flux-core/../bin/etc/flux/rc1 | broker.rc3_path /home/hobbs17/flux-core/../bin/etc/flux/rc3 | config.path /home/hobbs17/scripts/jobtap-test/config.toml | statedir changes will not be preserved | | Exit this shell when finished. +-----------------------------------------------------

I accept the suggestion to tie the streaming RPC response to job.new, which is replayed on restart (tweaked the test above and verified).

We will have to make the ability to restart the service dependent on a requirement that this plugin be included in the job-manager's plugins table in the config.

job.new is also called for every active job when a plugin is loaded (on the assumption that this is the method that "introduces" jobs to jobtap plugins). This should allow the plugin to be reloaded (e.g. to get a new version) and get back to the same place it was before it was unloaded.

Nice job on the test plugin, btw. You could tweak it to register callbacks for * if you want to see everything.

wihobbs · 2024-05-14T17:35:35Z

Note that ^ push is mostly grammar/clarity fixes and I'm still working through the [excellent, but involved :) ] design feedback.

vsoch · 2024-05-15T03:08:31Z

@wihobbs you've probably already exceeded most of us in knowledge of jobtap plugins, but in case you are interested we are having a little (hopefully fun) hackathon this week on Thursday - grandmaster @trws is going to show us the ropes for making plugins, and ❗ I think there might be rust involved! If you are interested, even for the fun or sharing your expertise, I can send along the invite.

wihobbs · 2024-05-15T18:03:05Z

Tons of great feedback on this. Thanks to everybody for participating and sharing your expertise.

I tried to close or consolidate duplicate issues. I apologize in advance if something was inadvertently closed -- please reopen if so.

I'm making a high-level summary of the outstanding design todos:

Recovery of state on a crash. Note that there are three things that can crash, and we should possibly have a different strategy for handling each:
- Crash of the job-manager (+Python driver+jobtap plugin) -- essentially restoring the state of everything.
- Crash of the Python driver, but jobtap plugin and job-manager persist.
- Crash of the jobtap plugin, but not the job-manager (+?Python driver)
Ensuring uniqueness of events before sending notifications (related: no double-or-worse notifications, and no missed notifications)
Rate limiting: why do we need it and how will we implement it?
Extensibility of the service via plugins (for external services such as Slack). Implementation of individual services is not a requirement.
Ergonomics of the "advanced use case," where we allow a user to specify services other than email, non-standard things to include, and states other than START and FINISH in which to trigger notifications.

trws · 2024-05-15T18:34:45Z

spec_44.rst

+  `flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_
+  that streams the jobids of notification-enabled jobs to the Python driver.
+
+The Python driver


Is this documenting a thing that already exists? If not, is there a reason these two should be separate? If it's a language issue, a python-based jobtap plugin shouldn't be an issue.

It's largely because the jobtap plugin resides in the job manager and we want it to do as little as possible since it would be in the critical path.

The jobtap plugin (at the time) seemed like the most efficient way to avoid having to fetch and examine the jobspec of every job since it can just peek at the jobspec that the job manager already has in memory and send the jobid along to the Python process only when the user has requested notification.

However, I'm glad you asked because I wonder if there's a better way to do this using the job-manager journal or some other more widely applicable method. It sure would be nice if it was easy for people to develop services like this and not have to install yet another jobtap plugin to identify jobs of interest.

Something to consider...

Ah, that's a good point. Blocking the job manager would be unfortunate.

The journal is a good thought, or maybe even a pub-sub event from the plugin so anyone could subscribe to it? Not sure, there are definitely tradeoffs here, but it sounds like the current approach has some potential to cause lost or repeated notification as well, especially racing with the event delivery and kvs removal vs jobtap replay if flux is restarted.

Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.

wihobbs force-pushed the flanrfc branch 2 times, most recently from a56d23d to 60db4cf Compare May 13, 2024 15:50

cmoussa1 reviewed May 13, 2024

View reviewed changes

garlick reviewed May 13, 2024

View reviewed changes

chu11 reviewed May 13, 2024

View reviewed changes

wihobbs force-pushed the flanrfc branch 2 times, most recently from e227539 to a350340 Compare May 13, 2024 22:36

grondo reviewed May 14, 2024

View reviewed changes

vsoch reviewed May 14, 2024

View reviewed changes

wihobbs force-pushed the flanrfc branch from a350340 to 5eba985 Compare May 14, 2024 17:34

wihobbs force-pushed the flanrfc branch 2 times, most recently from b6366cc to 2883259 Compare May 15, 2024 17:59

trws reviewed May 15, 2024

View reviewed changes

rfc: add notification service design doc

d775154

Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.

wihobbs force-pushed the flanrfc branch from 2883259 to d775154 Compare May 15, 2024 20:47

grondo mentioned this pull request May 16, 2024

Plugin capabilities for front-end submission tools flux-framework/flux-core#2875

Open

		The Flux Library for Adaptable Notifications (FLAN) provides a connection to
		external notification services (such as email) for steps in a batch job.


		The ``notify.enable`` request has no payload.

		At initialization the python driver SHALL create a kvs subdirectory, ``notify``.

rfc: add notification service design doc #414

Are you sure you want to change the base?

rfc: add notification service design doc #414

Conversation

wihobbs commented May 13, 2024

cmoussa1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grondo May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs May 13, 2024 • edited

Choose a reason for hiding this comment

chu11 left a comment • edited

Choose a reason for hiding this comment

chu11 May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs commented May 13, 2024

grondo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs commented May 14, 2024

vsoch commented May 15, 2024

wihobbs commented May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grondo May 13, 2024 •

edited

wihobbs May 13, 2024 •

edited

chu11 left a comment •

edited

chu11 May 13, 2024 •

edited

wihobbs commented May 15, 2024 •

edited