Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG.debug causes a failure of 16-timeout test in test-battery #2929

Closed
MartinRyan opened this issue Jan 17, 2019 · 33 comments
Closed

LOG.debug causes a failure of 16-timeout test in test-battery #2929

MartinRyan opened this issue Jan 17, 2019 · 33 comments
Assignees
Labels
Milestone

Comments

@MartinRyan
Copy link
Contributor

MartinRyan commented Jan 17, 2019

I would like to add debugging information as to the command and arguments as well as the caller information to the subprocess popen wrapper.

If I add LOG.debug("test debug message") to the sprocess.py pcylc function the tests/job-submission/16-timeout.t test fails.

Without the debug log the egrep comparison in 16-timeout check correctly finds:-

ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T010015Z/job-submission/16-timeout/log/job 1/foo/01
	[jobs-submit ret_code] -9
	[jobs-submit err] killed on timeout (PT10S)

When I add a debug statement the equivalent lines are:-

DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T011404Z/job-submission/16-timeout/log/job 1/foo/01
	[jobs-submit ret_code] 0
	[jobs-submit out] [TASK JOB SUMMARY]2019-01-17T12:14:08+11:00|1/foo/01|0|None

The return code is 0 and not -9

Test output with no debug statement in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok
ok
All tests successful.
Files=1, Tests=4, 21 wallclock secs ( 0.02 usr  0.00 sys +  2.05 cusr  0.77 csys =  2.84 CPU)
Result: PASS

Test output with one or more LOG.debug statements in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
--- -	2019-01-17 15:41:51.354548153 +1100
+++ log	2019-01-17 15:41:51.063998311 +1100
@@ -1,3 +0,0 @@
-ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T044115Z/job-submission/16-timeout/log/job 1/foo/01
-	[jobs-submit ret_code] -9
-	[jobs-submit err] killed on timeout (PT10S)
not ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok

    stdout and stderr stored in: /tmp/mryan/cylctb-20190117T044115Z/job-submission/16-timeout
Failed 1/4 subtests 

Test Summary Report
-------------------
tests/job-submission/16-timeout.t (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=1, Tests=4, 35 wallclock secs ( 0.03 usr  0.00 sys +  1.77 cusr  0.69 csys =  2.49 CPU)
Result: FAIL
@kinow
Copy link
Member

kinow commented Jan 17, 2019

Funny. No solution/explanation, but can confirm it happens to me too (using your branch with bandit changes). Tried replacing the cmp_ok by two contains_ok and increasing the number of tests by one, but it would still miss the output about the job being killed.

Hacked a bit the test to output to a /tmp/loglog file, then saved the good and the bad (after including the simple LOG.debug('test debug message').

image

Only difference I noted was the two entries for test debug message (I believe the first one is when popen is used to execute the command to get Cylc's version). And also the missing output after the housekeeping watcher kills the job due to its timeout.

Can't explain why though. Maybe @matthewrmshin, @hjoliver or someone with more experience with the tests and with the way these commands are executed & killed. Sorry.

@hjoliver
Copy link
Member

hjoliver commented Jan 17, 2019

I've also reproduced this. Strange - the test suite seems to function correctly, and it correctly identifies that foo.1 failed job submission, but the job submit command output only contains the new debug message and reports success status (0).

@hjoliver
Copy link
Member

tests/job-submission/16-timeout.t really tests process pool kill-on-timeout, not "job submission timeout" as the name might suggest. It changes the at job submission template to sleep 30 (i.e. it doesn't actually submit a job) and sets (via global config) a low process pool timeout so the job submission process gets killed after only 10 seconds. Then cylc is supposed to log that it killed a managed subprocess (in this case a job submission process), and react accordingly (in this case, set the associated task to the submit-failed state).

@hjoliver
Copy link
Member

hjoliver commented Jan 17, 2019

With no LOG.debug() in the Popen wrapper, the test works as just described. Immediately after the "job submission", a bunch of main loop "Performing suite health check" lines appear in the log before the process is killed after 10 seconds.

With a LOG.debug() in the wrapper, the "job submission" process does not get killed - it runs for the full 30 seconds, which presumably explains the 0 success status. Moreover, there are no "Performing suite health check" lines logged by the main loop during the 30 seconds - it seems like the LOG call causes the main Cylc process to block and wait on the subprocess??!!

So that's interesting, and hopefully it provides a clue as to what's happening.

Bed time here now though. Maybe @matthewrmshin or @oliver-sanders can follow-up...

(BTW this same test is also failing on master in my environment, but for a less nefarious reason: the job submit command ERROR is logged twice for some reason).

@matthewrmshin
Copy link
Contributor

I am trying to repeat this one on master at the moment.

@matthewrmshin
Copy link
Contributor

I now suspect that we are still getting a blocking pipe from the logic of #2876.

@MartinRyan
Copy link
Contributor Author

fyi I did have to patch the 16-timeout test so that egrep worked correctly

@hjoliver
Copy link
Member

hjoliver commented Jan 23, 2019

[UPDATE, (HJO) - the theory espoused in this comment, and followed up in subsequent comments, is completely wrong!]

Here's the reason why adding logging statements to @MartinRyan 's subprocess wrapper can screw up tests/job-submission/16-timeout.t (this is distinct from the blocking pipe problem, I think).

Jobs are submitted via the cylc jobs-submit command in a managed subprocesses (by which I mean, with our own subprocess machinery). That command in turn executes the actual jobs in managed subprocesses, using the same cylc library code.

We therefore have two different processes (the main process, and the cylc jobs-submit process) making the same log calls and writing asynchronously to the suite log , which can result in inter-mixed log lines (I've seen it). And this can mess with cmp_ok and contains_ok tests on the suite log.

@kinow
Copy link
Member

kinow commented Jan 23, 2019

Ooohh, that sounds more like what could be happening. While testing yesterday I did a os.getpid to check the process, and noticed different PID's... but after that I couldn't really understand why this could be a problem. I thought the logging module worked fine with threads and processes. Could it be because we have a custom file handler?

@hjoliver
Copy link
Member

I'm not if there are any other log calls from code used in the scheduler and in cylc jobs-submit. If there are, this could be another reason for test flakiness.

Not sure what to do about it.

  • ensure there are NO log calls in that common code?
  • ensure that only the main process writes to the log?
  • (or other processes write to a different log?)

@hjoliver
Copy link
Member

https://docs.python.org/2/library/multiprocessing.html#logging

the logging package does not use process shared locks so it is possible (depending on the handler type) for messages from different processes to get mixed up.

@kinow
Copy link
Member

kinow commented Jan 23, 2019

Sounds like fun! Though I suspect there should be something in logging, or another module that abstracts that away, and we don't have to implement any kind of special handling for multiprocessing in Cylc?

@kinow
Copy link
Member

kinow commented Jan 23, 2019

Yup, surprised it didn't happen before. multiprocess has been around in Cylc for a while... so maybe it happened now as we are adding more logging to the code?

@hjoliver
Copy link
Member

hjoliver commented Jan 23, 2019

We probably got away with it because our subprocess module is quite minimal (ish) and the commands we execute in subprocesses (main job submit, poll, and kill) do not use much library code beyond that.

If that's the case (not much code involved) the easiest solution might be just to have the CLI tools like cylc jobs-submit pass an argument to tell subprocess code not to write to the main log (and we could centralise that distinction in our logging module). ??

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Jan 23, 2019

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered. (We did not use logging before in these commands.) This means that when we poll the stderr pipe of the command's subprocess.Popen instance, we get a positive result, triggering the bug 🐛.

@hjoliver
Copy link
Member

hjoliver commented Jan 23, 2019

... triggering the bug.

You're mean the blocking bug here, right, not the newly-identified asynchronous logging bug?

@matthewrmshin
Copy link
Contributor

The job submission command should not write to the main suite log. Has this branch changed that?

@hjoliver
Copy link
Member

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered...

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

So perhaps @MartinRyan's branch change that?

@kinow
Copy link
Member

kinow commented Jan 23, 2019

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

Oh, interesting!

Has this branch changed that?

I think so... from what I remember while debugging with the IDE, sprocess is used in this branch as the wrapper for the Popen. Adding the LOG.debug statement there, effectively meant logging for every command that needed to use Popen. So I think we would have to make sure to avoid that, or prevent using the cylc.LOG module with sprocess.py? Or even re-think sprocess?

@hjoliver
Copy link
Member

Here's my proof of inter-mixed log lines:

2019-01-23T22:12:28+13:00 DEBUG - Performing suite health check
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

Notice DEBUG - XXXXX:sleep 30 appears in the middle of the command err log line.

That comes from LOG.debug("XXXXX:" + str(cmd)) just before Popen() in Martin's sprocess:pcylc() wrapper function.

@hjoliver
Copy link
Member

(And sleep 30 is what 16-timeout.t tests replaces the at command with. )

@hjoliver
Copy link
Member

hjoliver commented Jan 23, 2019

@matthewrmshin - is it just that we deliberately write to stderr in cylc jobs-submit (and the library code that supports it) instead of loggin, NOT that we altered logging config to redirect all log calls to stderr, in that context? If so, Martin didn't change anything, he just added a logging call in that code. It would be nice to do the "redirect" thing, so that the library code can just call the logger without caring if it lives in the main process or a subprocess...

@hjoliver
Copy link
Member

(Doesn't look like anything important changed on #2872

@kinow
Copy link
Member

kinow commented Jan 23, 2019

(Doesn't look like anything important changed on #2872

I think the wrapper around the Popen contributed to this issue. If it replaces all occurrences of Popen in the code, it could be significant change I think.

I agree that it would be nice to do the redirect, or something else to prevent other devs (like myself) from accidentally using a logger where it's not supposed to be used.

@hjoliver
Copy link
Member

I think the wrapper around the Popen contributed to this issue. ...

Yes, sorry, I meant "other than the wrapper", which (unlike the previously separate Popen calls) is shared between main and subprocesses.

@hjoliver
Copy link
Member

I suppose the quick fix is just to add a "no_log" arg to Martin's wrapper, and use that in all calls from lib/cylc/batch_sys_manager.py - which is the library code that underlies the sub-process commands.

And comment in batch_sys_manager exactly why it is that we do not make log calls there.

Code shared by main and subprocess commands:

  • the new wrapper
  • lib/cylc/subprocpool.py (better check the few log calls in there are OK?)
  • any other? (probably not much, if any).

@hjoliver
Copy link
Member

(Gotta go address CISE article feedback from @sadielbartholomew now...)

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Jan 23, 2019

I still don't understand the issue here. From how I understand the (original) system, the only thing that writes to the suite log should be the suite server program. The sequence of event should be like this:

  1. Suite server program writes to suite log.
  2. Suite server program invokes `cylc jobs-submit``.
  3. cylc jobs-submit being an external command that can run on localhost or remote host does not write to suite log. Instead it communicates with suite server program only with STDOUT and STDERR. STDOUT is chosen for the normal output for job submission statuses, etc. STDERR is chosen for any extra diagnostics message. Therefore, its normal logger is only attached to a StreamHandler writing to STDERR.
  4. cylc jobs-submit writes stuff to STDOUT and STDERR.
  5. Suite server program captures these output either while cylc jobs-submit is running or when it completes.
  6. STDOUT and STDERR (including log messages) from cylc jobs-submit is written to the suite log by the suite server program when the command completes.
  7. Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

@hjoliver
Copy link
Member

hjoliver commented Jan 23, 2019

I agree (now that you've reminded me!) that that's what's supposed to happen. Except I'm hazy on the Streamhandler bit. Is that a general "redirect" so that any call to log.debug("blah") in the other process will actually go to stderr, or not?

  1. Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

I don't think my transcript above conforms to your narrative!

#2929 (comment)

... it certainly looks like my log.debug("XXXXX" + cmd) is printing to the log in the middle of a main process log string.

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Jan 23, 2019

Yes. Logic here:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/option_parsers.py#L288-L295

Logging for all CLI commands goes to STDERR by default. The suite server program overrides this here when starting a daemon:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/scheduler.py#L343-L351

OK, let's analyse your log.

The following was logged as the command completed (return code -9):

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)

The following was logged by the logic that handled the task submission failure. Note the made-up return code 1 to represent the submission failure of the individual task. (The cylc jobs-submit command may have good/bad submits for many tasks.)

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

@hjoliver
Copy link
Member

Arggh, that is pretty convincing - thanks for the excellent explanation.

I didn't find the log redirect logic earlier (before you mentioned StreamHandler, anyway) because option_parsers.py did not occur to me as a place to look!

In that case, I suppose we just have to make sure that any suite log comparison tests are careful to ignore any "extra" err output - which unfortunately can run over multiple lines as above.

@matthewrmshin
Copy link
Contributor

Should we have multiple loggings instead of a single multi-line logging for external commands and its outputs, etc?

@hjoliver
Copy link
Member

That may be a good idea, to avoid messing with the tests anyway. If we don't need to worry about intentionally multi-line messages...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants