Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info: Support reading / waiting on guest event logs #2307

Merged
merged 13 commits into from Aug 22, 2019

Conversation

@chu11
Copy link
Contributor

commented Aug 14, 2019

As discussed in meetings and #2105, this PR will allow users to read from and watch guest event logs.

There is a new guest_watch.c file in the job-info module that handles the new job-info.guest-eventlog-watch callback. This callback handles a number of tricky scenarios as discussed in #2105. Effectively, job-info.guest-eventlog-watch is a wrapper around job-info.lookup and job-info.eventlog-watch RPCs.

Interesting notes on development:

  • Supporting an alternate eventlog path to job-info.lookup is easy. It already takes a path argument, so just passing in something like guest.exec.eventlog is easy enough.

  • The original job-info.eventlog-watch now takes an optional path to an eventlog and a flag that indicates if the eventlog should be read from the guest namespace. i.e. this flag indicates if we're reading from the main KVS namespace in job.0000.0000.0000.0000.<path> or if we're reading from <path> in the guest namespace.

  • flux_job_event_watch() - choose which callback to send to (i.e. main eventlog watch or guest eventlog watch) based on the path that is passed in to it. If the path is prefixed with guest. it goes to the guest callback. I considered several other options (flags, etc.) but this seemed the cleanest option. It's probably the one part of this PR that is highly debatable what would be best.

  • In guest_watch.c, you'll see a guest_msg_pack() function. Since we're sending requests to job-info.lookup and job-info.eventlog-watch that will do access checks, we need to create messages that have the same rolemask & userid from the original message passed in. So that's what this function does.

  • Added a bunch of tests, but more can probably be added. Just wanted to throw this up for now.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

Hmmm, hit a bunch of

flux-start: 1 (pid 15183) Killed
flux-start: 0 (pid 15182) Killed

1..57

ERROR: t2204-job-info.t - exited with status 137 (terminated by signal 9?)

and similar errors on t2205-job-info-security.t. Wondering if I cleanup the new guest-watch code really slowly or something. I saw the issue once or twice on LC clusters, but just assumed it was some NFS slowness or something like that.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

I hit something similar in pr #2309, but with the job-attach test. Maybe something going on in travis?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

I can hit on LC systems sometimes, just always threw it up to slow NFS, but i seem to hit this one more regularly. Guessing i'm not cleaning up well.

@chu11 chu11 force-pushed the chu11:issue2105 branch from 278182e to f2cc459 Aug 14, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

I think I figured it out. I changed tests from running hostname to running sleep inf, b/c I was hitting some racyness with jobs completing before I could cancel them. But some other tests didn't cancel the job.

So I think the job was still sitting around running, thus modules not exiting properly.

Edit: Also cleaned up a few tiny things in tests.

@chu11 chu11 force-pushed the chu11:issue2105 branch from f2cc459 to 00de572 Aug 14, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

re-pushed, fixed a bug where I wasn't cleaning up guest watchers correctly. Hoping that's why these tests are failing on travis.

@chu11 chu11 force-pushed the chu11:issue2105 branch from 00de572 to c8c94db Aug 14, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

Well I'm only getting

          t2204-job-info.t:  PASS: N=57  PASS=57  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
ERROR: t2204-job-info.t - exited with status 137 (terminated by signal 9?)

now, so that's atleast half the battle.

@chu11 chu11 force-pushed the chu11:issue2105 branch from c8c94db to 59ceef3 Aug 15, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

re-pushed, I found one more place where I wasn't canceling a job and it was still running after all tests had completed.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

only 1 of the builders failed, but with the same

          t2204-job-info.t:  PASS: N=57  PASS=57  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
ERROR: t2204-job-info.t - exited with status 137 (terminated by signal 9?)

I guess there is some subtle racy thing that I can't figure out. Time for debugging in travis, ugh.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Hmmm, saw:

expecting success: 
        jobid=$(submit_job_live)
        flux job wait-event -p "guest.exec.eventlog" $jobid done > wait_event_path3.out &
        waitpid=$! &&
        wait_watchers_nonzero "watchers" &&
        wait_watchers_nonzero "guest_watchers" &&
        guestns=$(flux job id -f dec -t hex $jobid) &&
        wait_guest_watcherscount_nonzero $guestns &&
        flux job cancel $jobid &&
        wait $waitpid &&
        grep done wait_event_path3.out

flux-job: flux_job_event_watch_get: Function not implemented

interesting broker log:

2019-08-15T16:12:42.839059Z connector-local.debug[0]: unsubscribe shell-141029277696.
2019-08-15T16:12:42.865271Z sched-simple.debug[0]: free: rank0/core0
2019-08-15T16:12:47.438617Z job-ingest.debug[0]: validate-schema.py: inactivity timeout
2019-08-15T16:12:47.441351Z job-ingest.debug[0]: validate-schema.py: ready
2019-08-15T16:12:47.441374Z job-ingest.debug[0]: validate-schema.py: exiting
2019-08-15T16:12:47.450443Z job-ingest.debug[0]: validate-schema.py: exited normally
2019-08-15T16:17:34.172164Z broker.err[0]: runlevel 2 timeout, sending SIGTERM
2019-08-15T16:17:34.172955Z broker.err[0]: Run level 2 Terminated (rc=143) 300.0s
2019-08-15T16:17:34.173154Z broker.info[0]: Run level 3 starting
2019-08-15T16:17:34.178499Z broker.info[0]: shutdown in 1.000s: run level 2 Terminated

So the Function not implemented is likely just that the wait-event hung for along time until the broker shut down. But why the runlevel 2 timeout and SIGTERM? is it possible validate-schema.py could hang?

@chu11 chu11 force-pushed the chu11:issue2105 branch 6 times, most recently from ed6deb4 to 3fc8bee Aug 15, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

Well this entire afternoon of running in travis, I haven't seen any more

ERROR: t2204-job-info.t - exited with status 137 (terminated by signal 9?)
ERROR: t2205-job-info-security.t - exited with status 137 (terminated by signal 9?)

occurrences. Which is the good news. But I have seen

ERROR: t2500-job-attach.t - exited with status 137 (terminated by signal 9?)

and

ERROR: t2005-hwloc-basic.t - exited with status 137 (terminated by signal 9?)

t2500 calls job eventlog once, and t2005 doesn't even load the job modules.

So while I suspect the earlier appearances of this issue were due to left over watchers, jobs still running, etc. I suspect that the once or twice it was seen later is likely due to general travis slowness. Ultimately, the addition of more module code may slightly increase the probability of a timeout compared to earlier builds.

One issue that showed up once or twice were some mysterious hangs. The interesting broker log output is.

2019-08-15T16:12:42.839059Z connector-local.debug[0]: unsubscribe shell-141029277696.
2019-08-15T16:12:42.865271Z sched-simple.debug[0]: free: rank0/core0
2019-08-15T16:12:47.438617Z job-ingest.debug[0]: validate-schema.py: inactivity timeout
2019-08-15T16:12:47.441351Z job-ingest.debug[0]: validate-schema.py: ready
2019-08-15T16:12:47.441374Z job-ingest.debug[0]: validate-schema.py: exiting
2019-08-15T16:12:47.450443Z job-ingest.debug[0]: validate-schema.py: exited normally
2019-08-15T16:17:34.172164Z broker.err[0]: runlevel 2 timeout, sending SIGTERM
2019-08-15T16:17:34.172955Z broker.err[0]: Run level 2 Terminated (rc=143) 300.0s
2019-08-15T16:17:34.173154Z broker.info[0]: Run level 3 starting
2019-08-15T16:17:34.178499Z broker.info[0]: shutdown in 1.000s: run level 2 Terminated

I suspect the hang is due to the fact the worker thread running validate-schema.py times out, so a job-ingest never occurs, and some tests thus hang/never complete. Looking through job-ingest code, it appears that a worker thread will get an inactivity timeout if a worker thread doesn't output anything for 5 seconds.

Although unlikely, I can see this as a general timeout due to travis slowness. The fact that I see the log time for "validate-schema.py: ready" to be AFTER "validate-schema.py: inactivity timeout", suggests that it just took a really really long time for this worker subprocess to get going.

So I think we can finally get to reviewing this PR :-)

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

Although unlikely, I can see this as a general timeout due to travis slowness. The fact that I see the log time for "validate-schema.py: ready" to be AFTER "validate-schema.py: inactivity timeout", suggests that it just took a really really long time for this worker subprocess to get going.

Yeah, I was thinking at least one worker should be kept resident at all times, or least allow that to be optional behavior. There seems to be a large startup overhead for the python interpreter, and so the first job submission after inactivity takes significantly longer. Maybe I'll open an issue.

@chu11 chu11 force-pushed the chu11:issue2105 branch 2 times, most recently from 2f9da17 to a8c7c3c Aug 16, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

rebased and re-pushed, fixed up things based on changes in #2315

@chu11 chu11 force-pushed the chu11:issue2105 branch from a8c7c3c to b2af778 Aug 17, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2019

re-pushed, saw I left some leftover debugging in kvs-helper.sh

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

When you're ready for review LMK.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2019

@garlick its ready for review

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2019

Sorry, had merged @garlick's PR so this branch needs another rebase

Also, there may be some races in the t2204-job-info.t. I tried running that test with FLUX_TEST_VALGRIND set and got one failure and finally a hang in the test after

ok 37 - flux job wait-event -p works

Let me know if it isn't reproducible and I'll look into it further on my system.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

@grondo was able to reproduce on catalyst, taking a look

chu11 added 3 commits Aug 11, 2019
Support --path option to wait-event, allowing user to specify
an alternate eventlog path suffix.
Add tests for job wait-event to read alternate path eventlogs.
Support ability to inform eventlog watch that the eventlog should
be read from the guest namespace.
@chu11 chu11 force-pushed the chu11:issue2105 branch from b007f71 to 0d8e613 Aug 21, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

re-pushed, adding a huge set of comments at the top of guest_watch.c. Also, made the path argument in watch & guest_watch callbacks a requirement.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Excellent explanatory text! Thank you!

One question: why watch in the primary namespace after the guest ns has been destroyed, when we know it it's read-only at that point?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

One question: why watch in the primary namespace after the guest ns has been destroyed, when we know it it's read-only at that point?

The reason is because we watch an eventlog forever until a user cancels. So for example with flux job wait-event, if the user inputs an invalid event, the flux job wait-event command will hang forever b/c that event never happens.

So to maintain this "infinite watch" with a guest namespace, I move to watching the primary namespace from the guest namespace.

You may recall my issue flux-framework/rfc#193 about having a defined "end" event in every eventlog. It's to sort of deal with this case.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

If the watch request will never get any responses, why set it up?

In practice this is true, but in reality someone can do a flux kvs eventlog append on the eventlog.

Why not just let the request remain active until it is canceled? Does it simplify the code?

Hmmm. Ya know, I didn't think about this. I get a ENOTSUP indicating that the namespace disappeared, so I felt like I had to watch something for real. But you're right. We could just not do anything and respond ENODATA after we get a cancel request.

The only case to consider is is the super corner case I list above. Is it something we should just not consider?

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

I don't think that case need be considered as it's not allowed.

Ok, let me adjust.

@chu11 chu11 force-pushed the chu11:issue2105 branch from 0d8e613 to 076e34f Aug 22, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

re-pushed with discussed tweak, only fallthrough from the guest namespace to the main namespace if the namespace was removed before reading any events from the eventlog.

Note that technically, it's hard to differentiate between:

  1. Racy situation where we know guest namespace created, but it was removed before could start watching it

  2. the guest namespace was removed before any events were put into the eventlog (e.g. no events ever populates the eventlog).

I assumed the latter case is rare and that in that case if we fallthrough to the main namespace, it's no big deal.

chu11 added 4 commits Aug 11, 2019
Support new guest watch callbacks, which allow users to read eventlogs
from guest job namespaces.  Handles all the tricky cases and corner cases
that normally involve guest namespaces.  The new callback takes care of:

- Checking the main job eventlog to see if the guest namespace has
  - not been created
  - created
  - moved into the main KVS namespace

- If it has not been created, wait for its creation

- If it has been created, watch the guest namespace

  - Move watching to the main KVS namespace if the guest namespace is
    removed before we can read from it.

- If it has been moved to the main KVS namespace, watch from there
Check the path provided to flux_job_event_watch() to see if it
belongs to a guest namespace.  If so, forward to the new guest
namespace job-info callback.
Add tests for job wait-event to read guest namespace eventlogs.  Most
notably, add tests that are still running and have not completed, to
ensure that the guest eventlog is being read from the guest namespace.
In t2204-job-info and t2205-job-info-security, wait for the event
"start" before canceling the job.  This avoids potential racy
circumstances in which the job is canceled before starting.
@chu11 chu11 force-pushed the chu11:issue2105 branch from 076e34f to b191da3 Aug 22, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

re-pushed, updated 1 comment :-)

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

hmmm, haven't seen this before.

2019-08-22T05:18:48.035660Z job-info.err[0]: watch_continuation: flux_kvs_lookup_get: Is a directory
2019-08-22T05:18:48.035924Z job-info.err[0]: main_namespace_watch_continuation: flux_rpc_get: Is a directory

This occurs in a test where the guest namespace doesn't exist when a watch is started. I'm completely perplexed how there could be an EISDIR error.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Ok, the "Is a directory" errors are coming from a different test and I don't think is relevant here.

in the broker log saw

2019-08-22T05:18:54.898995Z job-ingest.debug[0]: validate-schema.py: inactivity timeout
2019-08-22T05:18:54.899456Z job-ingest.debug[0]: validate-schema.py: exiting
2019-08-22T05:18:54.908686Z job-ingest.debug[0]: validate-schema.py: exited normally
2019-08-22T05:23:37.065215Z broker.err[0]: runlevel 2 timeout, sending SIGTERM
2019-08-22T05:23:37.065841Z broker.err[0]: Run level 2 Terminated (rc=143) 300.0s
2019-08-22T05:23:37.066102Z broker.info[0]: Run level 3 starting
2019-08-22T05:23:37.073415Z broker.info[0]: shutdown in 1.000s: run level 2 Terminated
2019-08-22T05:23:37.066072Z content-sqlite.debug[0]: broker shutdown in progress

It seems validate-schema.py never got going, then eventually rc2 timeout. I'm assuming super slow travis, the validator never got going, job not submitted, test hung. So restarting builder.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Looks like the large block comment still talks about switching to watching the main namespace. I assume that's now out of date?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Looks like the large block comment still talks about switching to watching the main namespace. I assume that's now out of date?

There is still the racy scenario of switching to the main namespace when the guest namespace is removed before you begin watching it. So I left those comments in there.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

OK, thanks. I'll go ahead and merge. Nice work!

@garlick garlick merged commit cb1bbde into flux-framework:master Aug 22, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.