LOG.debug causes a failure of 16-timeout test in test-battery #2929

MartinRyan · 2019-01-17T05:08:57Z

I would like to add debugging information as to the command and arguments as well as the caller information to the subprocess popen wrapper.

If I add LOG.debug("test debug message") to the sprocess.py pcylc function the tests/job-submission/16-timeout.t test fails.

Without the debug log the egrep comparison in 16-timeout check correctly finds:-

ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T010015Z/job-submission/16-timeout/log/job 1/foo/01
	[jobs-submit ret_code] -9
	[jobs-submit err] killed on timeout (PT10S)

When I add a debug statement the equivalent lines are:-

DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T011404Z/job-submission/16-timeout/log/job 1/foo/01
	[jobs-submit ret_code] 0
	[jobs-submit out] [TASK JOB SUMMARY]2019-01-17T12:14:08+11:00|1/foo/01|0|None

The return code is 0 and not -9

Test output with no debug statement in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok
ok
All tests successful.
Files=1, Tests=4, 21 wallclock secs ( 0.02 usr  0.00 sys +  2.05 cusr  0.77 csys =  2.84 CPU)
Result: PASS

Test output with one or more LOG.debug statements in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
--- -	2019-01-17 15:41:51.354548153 +1100
+++ log	2019-01-17 15:41:51.063998311 +1100
@@ -1,3 +0,0 @@
-ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T044115Z/job-submission/16-timeout/log/job 1/foo/01
-	[jobs-submit ret_code] -9
-	[jobs-submit err] killed on timeout (PT10S)
not ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok

    stdout and stderr stored in: /tmp/mryan/cylctb-20190117T044115Z/job-submission/16-timeout
Failed 1/4 subtests 

Test Summary Report
-------------------
tests/job-submission/16-timeout.t (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=1, Tests=4, 35 wallclock secs ( 0.03 usr  0.00 sys +  1.77 cusr  0.69 csys =  2.49 CPU)
Result: FAIL

The text was updated successfully, but these errors were encountered:

kinow · 2019-01-17T06:24:10Z

Funny. No solution/explanation, but can confirm it happens to me too (using your branch with bandit changes). Tried replacing the cmp_ok by two contains_ok and increasing the number of tests by one, but it would still miss the output about the job being killed.

Hacked a bit the test to output to a /tmp/loglog file, then saved the good and the bad (after including the simple LOG.debug('test debug message').

Only difference I noted was the two entries for test debug message (I believe the first one is when popen is used to execute the command to get Cylc's version). And also the missing output after the housekeeping watcher kills the job due to its timeout.

Can't explain why though. Maybe @matthewrmshin, @hjoliver or someone with more experience with the tests and with the way these commands are executed & killed. Sorry.

hjoliver · 2019-01-17T08:32:42Z

I've also reproduced this. Strange - ~~the test suite seems to function correctly~~, and it correctly identifies that foo.1 failed job submission, but the job submit command output only contains the new debug message and reports success status (0).

hjoliver · 2019-01-17T11:52:06Z

tests/job-submission/16-timeout.t really tests process pool kill-on-timeout, not "job submission timeout" as the name might suggest. It changes the at job submission template to sleep 30 (i.e. it doesn't actually submit a job) and sets (via global config) a low process pool timeout so the job submission process gets killed after only 10 seconds. Then cylc is supposed to log that it killed a managed subprocess (in this case a job submission process), and react accordingly (in this case, set the associated task to the submit-failed state).

hjoliver · 2019-01-17T12:05:00Z

With no LOG.debug() in the Popen wrapper, the test works as just described. Immediately after the "job submission", a bunch of main loop "Performing suite health check" lines appear in the log before the process is killed after 10 seconds.

With a LOG.debug() in the wrapper, the "job submission" process does not get killed - it runs for the full 30 seconds, which presumably explains the 0 success status. Moreover, there are no "Performing suite health check" lines logged by the main loop during the 30 seconds - it seems like the LOG call causes the main Cylc process to block and wait on the subprocess??!!

So that's interesting, and hopefully it provides a clue as to what's happening.

Bed time here now though. Maybe @matthewrmshin or @oliver-sanders can follow-up...

(BTW this same test is also failing on master in my environment, but for a less nefarious reason: the job submit command ERROR is logged twice for some reason).

matthewrmshin · 2019-01-17T12:25:21Z

I am trying to repeat this one on master at the moment.

matthewrmshin · 2019-01-17T15:09:06Z

I now suspect that we are still getting a blocking pipe from the logic of #2876.

MartinRyan · 2019-01-18T00:07:34Z

fyi I did have to patch the 16-timeout test so that egrep worked correctly

hjoliver · 2019-01-23T09:22:11Z

[UPDATE, (HJO) - the theory espoused in this comment, and followed up in subsequent comments, is completely wrong!]

Here's the reason why adding logging statements to @MartinRyan 's subprocess wrapper can screw up tests/job-submission/16-timeout.t (this is distinct from the blocking pipe problem, I think).

Jobs are submitted via the cylc jobs-submit command in a managed subprocesses (by which I mean, with our own subprocess machinery). That command in turn executes the actual jobs in managed subprocesses, using the same cylc library code.

We therefore have two different processes (the main process, and the cylc jobs-submit process) making the same log calls and writing asynchronously to the suite log , which can result in inter-mixed log lines (I've seen it). And this can mess with cmp_ok and contains_ok tests on the suite log.

kinow · 2019-01-23T09:25:21Z

Ooohh, that sounds more like what could be happening. While testing yesterday I did a os.getpid to check the process, and noticed different PID's... but after that I couldn't really understand why this could be a problem. I thought the logging module worked fine with threads and processes. Could it be because we have a custom file handler?

hjoliver · 2019-01-23T09:25:44Z

I'm not if there are any other log calls from code used in the scheduler and in cylc jobs-submit. If there are, this could be another reason for test flakiness.

Not sure what to do about it.

ensure there are NO log calls in that common code?
ensure that only the main process writes to the log?
(or other processes write to a different log?)

hjoliver · 2019-01-23T09:30:59Z

https://docs.python.org/2/library/multiprocessing.html#logging

the logging package does not use process shared locks so it is possible (depending on the handler type) for messages from different processes to get mixed up.

kinow · 2019-01-23T09:31:08Z

Sounds like fun! Though I suspect there should be something in logging, or another module that abstracts that away, and we don't have to implement any kind of special handling for multiprocessing in Cylc?

kinow · 2019-01-23T09:31:43Z

Yup, surprised it didn't happen before. multiprocess has been around in Cylc for a while... so maybe it happened now as we are adding more logging to the code?

hjoliver · 2019-01-23T09:38:47Z

We probably got away with it because our subprocess module is quite minimal (ish) and the commands we execute in subprocesses (main job submit, poll, and kill) do not use much library code beyond that.

If that's the case (not much code involved) the easiest solution might be just to have the CLI tools like cylc jobs-submit pass an argument to tell subprocess code not to write to the main log (and we could centralise that distinction in our logging module). ??

matthewrmshin · 2019-01-23T09:39:24Z

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered. (We did not use logging before in these commands.) This means that when we poll the stderr pipe of the command's subprocess.Popen instance, we get a positive result, triggering the bug 🐛.

hjoliver · 2019-01-23T09:41:20Z

... triggering the bug.

You're mean the blocking bug here, right, not the newly-identified asynchronous logging bug?

matthewrmshin · 2019-01-23T09:42:12Z

The job submission command should not write to the main suite log. Has this branch changed that?

hjoliver · 2019-01-23T09:44:03Z

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered...

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

So perhaps @MartinRyan's branch change that?

kinow · 2019-01-23T09:46:33Z

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

Oh, interesting!

Has this branch changed that?

I think so... from what I remember while debugging with the IDE, sprocess is used in this branch as the wrapper for the Popen. Adding the LOG.debug statement there, effectively meant logging for every command that needed to use Popen. So I think we would have to make sure to avoid that, or prevent using the cylc.LOG module with sprocess.py? Or even re-think sprocess?

hjoliver · 2019-01-23T09:49:15Z

Here's my proof of inter-mixed log lines:

2019-01-23T22:12:28+13:00 DEBUG - Performing suite health check
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

Notice DEBUG - XXXXX:sleep 30 appears in the middle of the command err log line.

That comes from LOG.debug("XXXXX:" + str(cmd)) just before Popen() in Martin's sprocess:pcylc() wrapper function.

hjoliver · 2019-01-23T09:50:23Z

(And sleep 30 is what 16-timeout.t tests replaces the at command with. )

hjoliver · 2019-01-23T10:02:10Z

@matthewrmshin - is it just that we deliberately write to stderr in cylc jobs-submit (and the library code that supports it) instead of loggin, NOT that we altered logging config to redirect all log calls to stderr, in that context? If so, Martin didn't change anything, he just added a logging call in that code. It would be nice to do the "redirect" thing, so that the library code can just call the logger without caring if it lives in the main process or a subprocess...

hjoliver · 2019-01-23T10:17:10Z

(Doesn't look like anything important changed on #2872

kinow · 2019-01-23T10:21:38Z

(Doesn't look like anything important changed on #2872

I think the wrapper around the Popen contributed to this issue. If it replaces all occurrences of Popen in the code, it could be significant change I think.

I agree that it would be nice to do the redirect, or something else to prevent other devs (like myself) from accidentally using a logger where it's not supposed to be used.

hjoliver · 2019-01-23T10:24:42Z

I think the wrapper around the Popen contributed to this issue. ...

Yes, sorry, I meant "other than the wrapper", which (unlike the previously separate Popen calls) is shared between main and subprocesses.

hjoliver · 2019-01-23T10:33:24Z

I suppose the quick fix is just to add a "no_log" arg to Martin's wrapper, and use that in all calls from lib/cylc/batch_sys_manager.py - which is the library code that underlies the sub-process commands.

And comment in batch_sys_manager exactly why it is that we do not make log calls there.

Code shared by main and subprocess commands:

the new wrapper
lib/cylc/subprocpool.py (better check the few log calls in there are OK?)
any other? (probably not much, if any).

hjoliver · 2019-01-23T10:34:38Z

(Gotta go address CISE article feedback from @sadielbartholomew now...)

matthewrmshin · 2019-01-23T11:18:13Z

I still don't understand the issue here. From how I understand the (original) system, the only thing that writes to the suite log should be the suite server program. The sequence of event should be like this:

Suite server program writes to suite log.
Suite server program invokes `cylc jobs-submit``.
cylc jobs-submit being an external command that can run on localhost or remote host does not write to suite log. Instead it communicates with suite server program only with STDOUT and STDERR. STDOUT is chosen for the normal output for job submission statuses, etc. STDERR is chosen for any extra diagnostics message. Therefore, its normal logger is only attached to a StreamHandler writing to STDERR.
cylc jobs-submit writes stuff to STDOUT and STDERR.
Suite server program captures these output either while cylc jobs-submit is running or when it completes.
STDOUT and STDERR (including log messages) from cylc jobs-submit is written to the suite log by the suite server program when the command completes.
Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

hjoliver · 2019-01-23T11:32:09Z

I agree (now that you've reminded me!) that that's what's supposed to happen. Except I'm hazy on the Streamhandler bit. Is that a general "redirect" so that any call to log.debug("blah") in the other process will actually go to stderr, or not?

Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

I don't think my transcript above conforms to your narrative!

#2929 (comment)

... it certainly looks like my log.debug("XXXXX" + cmd) is printing to the log in the middle of a main process log string.

matthewrmshin · 2019-01-23T11:45:12Z

Yes. Logic here:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/option_parsers.py#L288-L295

Logging for all CLI commands goes to STDERR by default. The suite server program overrides this here when starting a daemon:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/scheduler.py#L343-L351

OK, let's analyse your log.

The following was logged as the command completed (return code -9):

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)

The following was logged by the logic that handled the task submission failure. Note the made-up return code 1 to represent the submission failure of the individual task. (The cylc jobs-submit command may have good/bad submits for many tasks.)

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

hjoliver · 2019-01-23T12:04:23Z

Arggh, that is pretty convincing - thanks for the excellent explanation.

I didn't find the log redirect logic earlier (before you mentioned StreamHandler, anyway) because option_parsers.py did not occur to me as a place to look!

In that case, I suppose we just have to make sure that any suite log comparison tests are careful to ignore any "extra" err output - which unfortunately can run over multiple lines as above.

matthewrmshin · 2019-01-23T12:11:57Z

Should we have multiple loggings instead of a single multi-line logging for external commands and its outputs, etc?

hjoliver · 2019-01-23T12:14:17Z

That may be a good idea, to avoid messing with the tests anyway. If we don't need to worry about intentionally multi-line messages...

matthewrmshin mentioned this issue Jan 17, 2019

test battery: flaky tests #2894

Closed

matthewrmshin added the bug label Jan 17, 2019

matthewrmshin added this to the soon milestone Jan 17, 2019

hjoliver mentioned this issue Jan 17, 2019

Remove timer reset #2930

Merged

matthewrmshin mentioned this issue Jan 17, 2019

Fix subprocpool pipe polling #2932

Merged

matthewrmshin modified the milestones: soon, next-release Jan 17, 2019

matthewrmshin self-assigned this Jan 17, 2019

hjoliver closed this as completed in #2932 Jan 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOG.debug causes a failure of 16-timeout test in test-battery #2929

LOG.debug causes a failure of 16-timeout test in test-battery #2929

MartinRyan commented Jan 17, 2019 •

edited by matthewrmshin

kinow commented Jan 17, 2019

hjoliver commented Jan 17, 2019 •

edited

hjoliver commented Jan 17, 2019

hjoliver commented Jan 17, 2019 •

edited

matthewrmshin commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

MartinRyan commented Jan 18, 2019

hjoliver commented Jan 23, 2019 •

edited

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

kinow commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

matthewrmshin commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019

matthewrmshin commented Jan 23, 2019

hjoliver commented Jan 23, 2019

LOG.debug causes a failure of 16-timeout test in test-battery #2929

LOG.debug causes a failure of 16-timeout test in test-battery #2929

Comments

MartinRyan commented Jan 17, 2019 • edited by matthewrmshin

kinow commented Jan 17, 2019

hjoliver commented Jan 17, 2019 • edited

hjoliver commented Jan 17, 2019

hjoliver commented Jan 17, 2019 • edited

matthewrmshin commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

MartinRyan commented Jan 18, 2019

hjoliver commented Jan 23, 2019 • edited

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

kinow commented Jan 23, 2019 • edited

hjoliver commented Jan 23, 2019 • edited

matthewrmshin commented Jan 23, 2019 • edited

hjoliver commented Jan 23, 2019 • edited

matthewrmshin commented Jan 23, 2019

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019 • edited

hjoliver commented Jan 23, 2019

kinow commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

hjoliver commented Jan 23, 2019

matthewrmshin commented Jan 23, 2019 • edited

hjoliver commented Jan 23, 2019 • edited

matthewrmshin commented Jan 23, 2019 • edited

hjoliver commented Jan 23, 2019

matthewrmshin commented Jan 23, 2019

hjoliver commented Jan 23, 2019

MartinRyan commented Jan 17, 2019 •

edited by matthewrmshin

hjoliver commented Jan 17, 2019 •

edited

hjoliver commented Jan 17, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

kinow commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019 •

edited

hjoliver commented Jan 23, 2019 •

edited

matthewrmshin commented Jan 23, 2019 •

edited