jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

garlick · 2023-11-20T21:06:06Z

Problem: the corona management node was shut down abruptly. Afterwards, flux jobs shows several jobs in C state:

For example:

f2cx7pwPP9CK pdebug   hobbs17  flux        C      2      2   2.621h corona[171,173]

Job eventlog says

$ flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765

$ flux job list -u hobbs17 | jq
{
  "id": 697830805111572480,
  "userid": 60943,
  "urgency": 16,
  "priority": 24028,
  "t_submit": 1700501129.0455658,
  "t_depend": 1700501129.0637124,
  "t_run": 1700501129.1128068,
  "t_cleanup": 1700510565.8155658,
  "state": 32,
  "name": "flux",
  "cwd": "/g/g0/hobbs17",
  "queue": "pdebug",
  "ntasks": 2,
  "ncores": 96,
  "duration": 1800.0,
  "nnodes": 2,
  "ranks": "[4,6]",
  "nodelist": "corona[171,173]",
  "expiration": 1700502929.0,
  "annotations": {
    "user": {
      "uri": "ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
    }
  },
  "waitstatus": 0
}

The text was updated successfully, but these errors were encountered:

garlick · 2023-11-20T21:14:33Z

Oh there is an epilog-start event but no epilog-finish.

garlick · 2023-11-20T21:18:42Z

Confirmed, all the jobs in this state have an epilog-start but no epilog-finish.

Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues.

garlick · 2023-11-20T21:20:58Z

Side note: I'm not sure how to get rid of this job. This doesn't work:

$ flux job purge -f f2cnGN8MBAu5
flux-job: purge: cannot purge active job

It would be handy if we could force it somehow.

grondo · 2023-11-21T01:25:56Z

I think this is due to known issue #4108. There is no epilog-finish event because the perilog plugin lost track of the epilog process after the restart.

A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time.

grondo · 2023-11-21T16:05:09Z

Here's a jobtap plugin that might work to post the missing epilog-finish events for jobs in this state. It should be loaded like:

$ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]"

Where ID1 etc are the integer representations of the jobids to target.

The plugin should then be manually removed with flux jobtap remove plugin.so after it is complete.

Of course, I could not test this since I don't have any stuck jobs handy...

#include <jansson.h>
#include <flux/jobtap.h>

static int post_events (flux_t *h,
                        flux_plugin_t *p,
                        json_t *jobs,
                        const char *description)
{
    size_t index;
    json_t *entry;

    json_array_foreach (jobs, index, entry) {
        flux_jobid_t id;
        if (!json_is_integer (entry)) {
            flux_log_error (h,
                            "invalid jobid '%s'",
                            json_string_value (entry));
            return -1;
        }
        id = json_integer_value (entry);
        if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
            flux_log_error (h,
                            "failed to post epilog-finish event for %ju",
                            (uintmax_t) id);
    }
    return 0;
}

int flux_plugin_init (flux_plugin_t *p)
{
    json_t *jobs;
    const char *description = "job-manager.epilog";
    flux_t *h = flux_jobtap_get_flux (p);

    if (flux_plugin_conf_unpack (p,
                                 "{s?s s:o}",
                                 "descripion", &description,
                                 "jobs", &jobs) < 0) {
        flux_log_error (h, "no jobids provided");
        return -1;
    }
    if (!json_is_array (jobs)) {
        flux_log_error (h, "jobs conf value must be array");
        return -1;
    }
    return post_events (h, p, jobs, description);
}

garlick · 2023-11-21T16:09:26Z

NIce! I'll run that on corona in a bit when I'm back online.

grondo · 2023-11-27T16:19:08Z

FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:

# flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
1701101824.652934 epilog-finish description="job-manager.epilog" status=0
1701101824.653028 clean

garlick · 2024-04-02T15:49:19Z

Closed by #5848?

grondo · 2024-04-02T16:18:11Z

Probably not, that was just addition of a utility to fix this state.

We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

grondo commented Nov 21, 2023

grondo commented Nov 21, 2023

garlick commented Nov 21, 2023

grondo commented Nov 27, 2023

garlick commented Apr 2, 2024

grondo commented Apr 2, 2024

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

Comments

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

garlick commented Nov 20, 2023

grondo commented Nov 21, 2023

grondo commented Nov 21, 2023

garlick commented Nov 21, 2023

grondo commented Nov 27, 2023

garlick commented Apr 2, 2024

grondo commented Apr 2, 2024