-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs remain in CLEANUP state after fatal scheduler-restart exception #5579
Comments
Oh there is an epilog-start event but no epilog-finish. |
Confirmed, all the jobs in this state have an epilog-start but no epilog-finish. Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues. |
Side note: I'm not sure how to get rid of this job. This doesn't work:
It would be handy if we could force it somehow. |
I think this is due to known issue #4108. There is no A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time. |
Here's a jobtap plugin that might work to post the missing $ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]" Where The plugin should then be manually removed with Of course, I could not test this since I don't have any stuck jobs handy... #include <jansson.h>
#include <flux/jobtap.h>
static int post_events (flux_t *h,
flux_plugin_t *p,
json_t *jobs,
const char *description)
{
size_t index;
json_t *entry;
json_array_foreach (jobs, index, entry) {
flux_jobid_t id;
if (!json_is_integer (entry)) {
flux_log_error (h,
"invalid jobid '%s'",
json_string_value (entry));
return -1;
}
id = json_integer_value (entry);
if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
flux_log_error (h,
"failed to post epilog-finish event for %ju",
(uintmax_t) id);
}
return 0;
}
int flux_plugin_init (flux_plugin_t *p)
{
json_t *jobs;
const char *description = "job-manager.epilog";
flux_t *h = flux_jobtap_get_flux (p);
if (flux_plugin_conf_unpack (p,
"{s?s s:o}",
"descripion", &description,
"jobs", &jobs) < 0) {
flux_log_error (h, "no jobids provided");
return -1;
}
if (!json_is_array (jobs)) {
flux_log_error (h, "jobs conf value must be array");
return -1;
}
return post_events (h, p, jobs, description);
} |
NIce! I'll run that on corona in a bit when I'm back online. |
FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:
|
Closed by #5848? |
Probably not, that was just addition of a utility to fix this state. We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary. |
Problem: the corona management node was shut down abruptly. Afterwards,
flux jobs
shows several jobs in C state:For example:
Job eventlog says
The text was updated successfully, but these errors were encountered: