Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

Open
garlick opened this issue Nov 20, 2023 · 9 comments
Open

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

garlick opened this issue Nov 20, 2023 · 9 comments

Comments

@garlick
Copy link
Member

garlick commented Nov 20, 2023

Problem: the corona management node was shut down abruptly. Afterwards, flux jobs shows several jobs in C state:

For example:

f2cx7pwPP9CK pdebug   hobbs17  flux        C      2      2   2.621h corona[171,173]

Job eventlog says

$ flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
$ flux job list -u hobbs17 | jq
{
  "id": 697830805111572480,
  "userid": 60943,
  "urgency": 16,
  "priority": 24028,
  "t_submit": 1700501129.0455658,
  "t_depend": 1700501129.0637124,
  "t_run": 1700501129.1128068,
  "t_cleanup": 1700510565.8155658,
  "state": 32,
  "name": "flux",
  "cwd": "/g/g0/hobbs17",
  "queue": "pdebug",
  "ntasks": 2,
  "ncores": 96,
  "duration": 1800.0,
  "nnodes": 2,
  "ranks": "[4,6]",
  "nodelist": "corona[171,173]",
  "expiration": 1700502929.0,
  "annotations": {
    "user": {
      "uri": "ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
    }
  },
  "waitstatus": 0
}
@garlick
Copy link
Member Author

garlick commented Nov 20, 2023

Oh there is an epilog-start event but no epilog-finish.

@garlick
Copy link
Member Author

garlick commented Nov 20, 2023

Confirmed, all the jobs in this state have an epilog-start but no epilog-finish.

Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues.

@garlick
Copy link
Member Author

garlick commented Nov 20, 2023

Side note: I'm not sure how to get rid of this job. This doesn't work:

$ flux job purge -f f2cnGN8MBAu5
flux-job: purge: cannot purge active job

It would be handy if we could force it somehow.

@grondo
Copy link
Contributor

grondo commented Nov 21, 2023

I think this is due to known issue #4108. There is no epilog-finish event because the perilog plugin lost track of the epilog process after the restart.

A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time.

@grondo
Copy link
Contributor

grondo commented Nov 21, 2023

Here's a jobtap plugin that might work to post the missing epilog-finish events for jobs in this state. It should be loaded like:

$ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]"

Where ID1 etc are the integer representations of the jobids to target.

The plugin should then be manually removed with flux jobtap remove plugin.so after it is complete.

Of course, I could not test this since I don't have any stuck jobs handy...

#include <jansson.h>
#include <flux/jobtap.h>

static int post_events (flux_t *h,
                        flux_plugin_t *p,
                        json_t *jobs,
                        const char *description)
{
    size_t index;
    json_t *entry;

    json_array_foreach (jobs, index, entry) {
        flux_jobid_t id;
        if (!json_is_integer (entry)) {
            flux_log_error (h,
                            "invalid jobid '%s'",
                            json_string_value (entry));
            return -1;
        }
        id = json_integer_value (entry);
        if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
            flux_log_error (h,
                            "failed to post epilog-finish event for %ju",
                            (uintmax_t) id);
    }
    return 0;
}

int flux_plugin_init (flux_plugin_t *p)
{
    json_t *jobs;
    const char *description = "job-manager.epilog";
    flux_t *h = flux_jobtap_get_flux (p);

    if (flux_plugin_conf_unpack (p,
                                 "{s?s s:o}",
                                 "descripion", &description,
                                 "jobs", &jobs) < 0) {
        flux_log_error (h, "no jobids provided");
        return -1;
    }
    if (!json_is_array (jobs)) {
        flux_log_error (h, "jobs conf value must be array");
        return -1;
    }
    return post_events (h, p, jobs, description);
}

@garlick
Copy link
Member Author

garlick commented Nov 21, 2023

NIce! I'll run that on corona in a bit when I'm back online.

@grondo
Copy link
Contributor

grondo commented Nov 27, 2023

FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:

# flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
1701101824.652934 epilog-finish description="job-manager.epilog" status=0
1701101824.653028 clean

@garlick
Copy link
Member Author

garlick commented Apr 2, 2024

Closed by #5848?

@grondo
Copy link
Contributor

grondo commented Apr 2, 2024

Probably not, that was just addition of a utility to fix this state.

We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants