Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document suite runtime interface #3004

Merged
merged 7 commits into from
Apr 4, 2019

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Mar 14, 2019

Follow on from #2966
Closes #3048

Write up the network stuff whilst it's still fresh in my head:

  • Add new api section to the user guide
    • We can migrate other reference material there when we auto-document it.
  • Document the SuiteRuntimeServer endpoints.
  • Document the suite privilege levels.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for clarifying what methods are public and private in the API (i.e. renaming some methods to have the _ prefix). As well as for adding TODO markers so we remember to revisit parts of the code later.

Generated documentation locally with no issues. Used a Notebook to help me testing the code too, and it works (with a few hiccups, but works nicely).

Found some typos, but I think no blockers. So happy to approve once updated 👍

lib/cylc/network/server.py Outdated Show resolved Hide resolved
bin/cylc-make-docs Outdated Show resolved Hide resolved
doc/src/api/zmq.rst Show resolved Hide resolved
lib/cylc/network/__init__.py Outdated Show resolved Hide resolved
lib/cylc/network/server.py Outdated Show resolved Hide resolved
@@ -235,14 +242,42 @@ def _authorise(self, *args, user='?', meta=None, **kwargs):
LOG.info(
'[client-command] %s %s@%s:%s', fcn.__name__, user, host, prog)
return fcn(self, *args, **kwargs)
_authorise.__doc__ += ( # add auth level to docstring
'Authentication:\n%s:py:obj:`cylc.network.%s`\n' % (
' ' * 12, req_priv_level))
Copy link
Member

@kinow kinow Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe now we start using the (supposedly faster, but maybe simpler) f-strings? They were introduced in Py3.6, which I think is our minimum version in Travis-CI?

_authorise.__doc__ += f"Authentication:\n{' ' * 12}:py:obj:cylc.network.{req_priv_level}\n" should work I believe.

(though that could be done later, incrementally, etc, feel free to resolve this conversation if you prefer 👍 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use the latest and greatest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe simpler

Can do, I have mixed feelings about fstrings, Python has created an entire templating engine. From playing about with them so far, they have irritating limitations.

supposedly faster

Wow, really. That's surprising!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_authorise.doc += f"Authentication:\n{' ' * 12}:py:obj:cylc.network.{req_priv_level}\n" should work I believe.

SyntaxError: f-string: expecting '}'

Copy link
Member

@kinow kinow Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are using Py34, but minimum requirement for f-strings is Py36?

Py37:

f-string-py37

Py36:

f-string-py36

Py34:

f-string-py34

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, once running sphinx-build, it complained about the final value.

...
writing output... [100%] index                                                                                                                                                                                     
/home/kinow/Development/python/workspace/cylc/lib/cylc/network/server.py:docstring of cylc.network.server.SuiteRuntimeServer.api:15: WARNING: py:obj reference target not found: cylc.network.1
/home/kinow/Development/python/workspace/cylc/lib/cylc/network/server.py:docstring of cylc.network.server.SuiteRuntimeServer.clear_broadcast:31: WARNING: py:obj reference target not found: cylc.network.6
...

Had to change to

_authorise.__doc__ += (  # add auth level to docstring
            f"Authentication:\n{' ' * 12}:py:obj:`cylc.network.Priv.{req_priv_level.name}`\n")

To get it working OK.

lib/cylc/network/__init__.py Show resolved Hide resolved
lib/cylc/network/server.py Show resolved Hide resolved
lib/cylc/network/server.py Show resolved Hide resolved
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about f-strings, but fine if we don't start using it right now. Nothing else to comment on the change, looks good to me 👍

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Mar 20, 2019

FYI: I've fixed the trigger tests but still need to work on the two restart failures.
Should be good now.

@oliver-sanders
Copy link
Member Author

FYI: The functional tests are now intermittently timing out.

@kinow
Copy link
Member

kinow commented Mar 21, 2019

FYI: The functional tests are now intermittently timing out.

Kicked both builds that failed, should be green in a moment 👍

@kinow
Copy link
Member

kinow commented Mar 22, 2019

Hmmm, there is one build that is simply refusing to work. Kicked it four times today already.

===( 564;1133 1/3 1/2 1/3 )======================================
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

I wonder if something changed in Travis configuration. We had a way around this default 10 minutes-no-response-killer feature from Travis I think?

@hjoliver
Copy link
Member

Travis CI is pretty frustrating at the moment. I don't recall if we had a solution for the timeout issue before though ... maybe not.

@oliver-sanders
Copy link
Member Author

The test which hangs seems to change around a bit, is it; Cylc hanging; a problem in my branch; Travis issues, I have no idea. We shouldn't have tests hanging.

@hjoliver
Copy link
Member

hjoliver commented Mar 22, 2019

No, we shouldn't! (Or, yes, we shouldn't ... i.e. I'm agreeing!)

@oliver-sanders
Copy link
Member Author

Hmmm, there is one build that is simply refusing to work. Kicked it four times today already.

By checking against the chunking algorithm the three tests hanging in that job are:

  • tests/special/00-sequential.t
  • tests/suite-state/05-message.t
  • tests/suite-state/07-message2.t

In my environment each passes individually and they all pass when run together in parallel.

I wonder if something changed in Travis configuration.

I guess it could be, I can't work out what in this PR could cause it.

@kinow
Copy link
Member

kinow commented Mar 26, 2019

@oliver-sanders so there was actually some discussion around tests being killed by Travis after an inactivity period, but it was in isodatetime (https://github.com/metomi/isodatetime/blob/5a5ab113412d9b43e880e0303963780f13e4527d/.travis.yml#L42).

I tried to apply the same fix here, to see if that would fix it. My hunch was that perhaps the improvements in logging and exceptions, or maybe something in py3, could have caused our tests to be more silent. But that did not work.

See Travis result here: https://travis-ci.org/kinow/cylc/builds/511280533

Builds were killed after 50 minutes of inactivity (which is a lot more than our normal 20-30 minutes). Looks like the tests are simply getting stuck somewhere. It should be able to reproduce locally. I'll try to run one of the test chunks that failed in my computer to see if that works or not too 👍

@kinow
Copy link
Member

kinow commented Mar 26, 2019

Maybe this could be helpful:

Screenshot_2019-03-26_16-29-34

Looks like.... sleep crashed? 🙄

@kinow
Copy link
Member

kinow commented Mar 26, 2019

Tried to find what was hanging, but all that I could find running ps aux | grep python Cylc related were:

kinow     1240  0.0  0.5 254588 30724 ?        Sl   16:21   0:00 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run cylctb-20190326T031802Z/graph-equivalence/03-multiline_and1 --hold
kinow    30458  0.1  0.6 255060 36880 pts/3    Sl+  16:20   0:00 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run --set=RELEASE_MATCH=STUFF --reference-test --debug --no-detach cylctb-20190326T031802Z/hold-release/03-release-family-exact

I think whatever crashed sleep had already been killed. So looked at /var/log/apport.log:

ERROR: apport (pid 18309) Tue Mar 26 16:18:56 2019: called for pid 17978, signal 24, core limit 0, dump mode 1
ERROR: apport (pid 18309) Tue Mar 26 16:18:56 2019: executable: /bin/sleep (command line "sleep 10")
ERROR: apport (pid 18309) Tue Mar 26 16:18:56 2019: gdbus call error: Error: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.gnome.SessionManager was not provided by any .service files

ERROR: apport (pid 18309) Tue Mar 26 16:18:56 2019: debug: session gdbus call: 
ERROR: apport (pid 18309) Tue Mar 26 16:18:56 2019: wrote report /var/crash/_bin_sleep.1000.crash
ERROR: apport (pid 23441) Tue Mar 26 16:19:49 2019: called for pid 23253, signal 24, core limit 0, dump mode 1
ERROR: apport (pid 23441) Tue Mar 26 16:19:49 2019: executable: /bin/sleep (command line "sleep 10")
ERROR: apport (pid 23441) Tue Mar 26 16:19:49 2019: gdbus call error: Error: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.gnome.SessionManager was not provided by any .service files

ERROR: apport (pid 23441) Tue Mar 26 16:19:49 2019: debug: session gdbus call: 
ERROR: apport (pid 23441) Tue Mar 26 16:19:49 2019: apport: report /var/crash/_bin_sleep.1000.crash already exists and unseen, doing nothing to avoid disk usage DoS
ERROR: apport (pid 24883) Tue Mar 26 16:20:05 2019: called for pid 24503, signal 24, core limit 0, dump mode 1
ERROR: apport (pid 24883) Tue Mar 26 16:20:05 2019: executable: /bin/sleep (command line "sleep 30")
ERROR: apport (pid 24883) Tue Mar 26 16:20:05 2019: gdbus call error: Error: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.gnome.SessionManager was not provided by any .service files

ERROR: apport (pid 24883) Tue Mar 26 16:20:05 2019: debug: session gdbus call: 
ERROR: apport (pid 24883) Tue Mar 26 16:20:05 2019: apport: report /var/crash/_bin_sleep.1000.crash already exists and unseen, doing nothing to avoid disk usage DoS

And /var/crash/_bin_sleep.1000.crash (only first lines):

ProblemType: Crash
Architecture: amd64
CrashCounter: 1
CurrentDesktop: XFCE
Date: Tue Mar 26 16:18:56 2019
DistroRelease: Ubuntu 18.04
ExecutablePath: /bin/sleep
ExecutableTimestamp: 1516268629
ProcCmdline: sleep 10
ProcCwd: /home/kinow/cylc-run/cylctb-20190326T031802Z/execution-time-limit/00-background/work/1/foo
ProcEnviron:
 LANG=en_NZ.UTF-8
 TERM=xterm-256color
 SHELL=/bin/bash
 LANGUAGE=en_NZ:en
 XDG_RUNTIME_DIR=<set>
 PATH=(custom, user)
ProcMaps:
(...)

I tried running all tests mentioned here, and all of these tests passed when executed isolated. My guess now is that there could be a problem when they are executed in parallel, or maybe with our test or coverage tools.

Sorry if not really helpful 😞

@hjoliver
Copy link
Member

That is interesting/weird. I'm pretty sure I've seen several "sleep crashed" dialogs on my Ubuntu 18.04 VM lately too.

@kinow
Copy link
Member

kinow commented Mar 26, 2019

I had noticed it once a little while ago, while running some command in Cylc. But I simply ignored it and continued doing whatever I was doing at that moment. I wonder why this pull request in special is failing.

@hjoliver
Copy link
Member

Looks like one of these occurred recently on my system:

$ head /var/crash/_bin_sleep.1001.crash    

ProblemType: Crash
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Thu Mar 21 12:54:14 2019
DistroRelease: Ubuntu 18.04
ExecutablePath: /bin/sleep
ExecutableTimestamp: 1516268629
ProcCmdline: sleep 10
ProcCwd: /home/oliverh/cylc-run/cylctb-20190320T235405Z/execution-time-limit/00-background/work/1/foo
...

@dwsutherland
Copy link
Member

Yeah I had a couple of Ubuntu sleep crashes while working on that flask-graphql branch a while ago..

@hjoliver
Copy link
Member

hjoliver commented Mar 26, 2019

The test job mentioned above (execution-time-limit/00-background/work/1/foo executes sleep 10 in the job script, which is killed at PT5S by execution timeout (which for background jobs, means timeout 5 job).

Doing this manually doesn't cause /var/crash/_bin_sleep... to appear though. Might be a red herring...

@kinow
Copy link
Member

kinow commented Mar 26, 2019

Yeah I had a couple of Ubuntu sleep crashes while working on that flask-graphql branch a while ago..

That may be a really good lead @dwsutherland . Does that mean you had this issue before the Python 3 changes?

Could you check tomorrow if you also have one of these log files? I assume you were using Linux? If not, Windows Event Viewer may have something useful (though I have no idea if Cylc runs on Windows/cygwin/subsystem/etc).

@hjoliver
Copy link
Member

I have no idea if Cylc runs on Windows...

We definitely don't run on Windows itself. Cygwin or Linux on "windows subsystem for Linux" probably, but I haven't tried yet (I think @dpmatthews said he did that - successfully - not so long ago?).

@dpmatthews
Copy link
Contributor

We definitely don't run on Windows itself. Cygwin or Linux on "windows subsystem for Linux" probably, but I haven't tried yet (I think @dpmatthews said he did that - successfully - not so long ago?).

Nope - not me

@hjoliver
Copy link
Member

Nope - not me

OK, god knows why I thought that.

@matthewrmshin
Copy link
Contributor

I tried Cylc on WSL some time ago with limited success. It works with a very basic suite, but had issues as soon as the suite is a bit more involved. I'm sure things have improved recently. Cylc works perfectly on the Chromebook, however.

@kinow
Copy link
Member

kinow commented Mar 27, 2019

Executed the same CHUNK=4/4 locally today, after restarting my virtual machine. It started around 10AM, and has been running since then (13:07 now).

It is stuck, without a crash.

./tests/runahead/03-check-default-future.t ........................ ok  
./tests/validate/08-whitespace.t .................................. ok  
./tests/special/08-clock-triggered-0.t ............................ ok  
./tests/tutorial/cycling/01-tut.four.t ............................ ok  
===(     581;1085  1/2  1/6  0/3  1/2 )=================================

I did a ps -aux | grep python before lunch, and just did one now. Same Python processes are running for Cylc.

kinow     5463  0.0  0.3  39972 18408 pts/0    S+   10:30   0:00 /home/kinow/anaconda3/bin/python /home/kinow/anaconda3/bin/coverage run .travis/cover.py
kinow     7110  0.0  0.6 260128 39808 pts/0    Sl+  10:43   0:07 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run --reference-test --debug --no-detach cylctb-20190326T213057Z/triggering/09-fail
kinow    11111  0.0  0.6 260136 39692 pts/0    Sl+  10:32   0:08 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run --reference-test --no-detach --debug cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit
kinow    11807  0.0  0.6 259860 39660 pts/0    Sl+  10:32   0:08 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run --no-detach cylctb-20190326T213057Z/shutdown/08-now1
kinow    16079  0.1  0.6 260108 40012 pts/0    Sl+  10:33   0:10 python3 /home/kinow/Development/python/workspace/cylc/bin/cylc-run --reference-test --debug --no-detach cylctb-20190326T213057Z/suite-state/04-template_ref

Will poke around a bit with strace, lsof, etc, to see if I can find anything about the running processes. Maybe it's something that happens when we have multiple processes? I think we have threads in pyzmq... could it be a deadlock somewhere?

@kinow
Copy link
Member

kinow commented Mar 27, 2019

What the parent process is up to:

# strace -p 5463
strace: Process 5463 attached
wait4(5464, ^Cstrace: Process 5463 detached
 <detached ...>

wait4

wait3, wait4 - wait for process to change state, BSD style

The 5464 pid is kinow 5464 0.0 0.0 4628 920 pts/0 S+ 10:30 0:00 /bin/sh -c xvfb-run -a cylc test-battery --chunk $CHUNK --state=save -j 5.

# strace -p 5464
strace: Process 5464 attached
wait4(-1, ^Cstrace: Process 5464 detached
 <detached ...>

Which is also waiting, but on any process (the -1 argument).

And the Cylc suites:

# strace -p 7110
strace: Process 7110 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=225875}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999337}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999198}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999345}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999399}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999377}^Cstrace: Process 7110 detached
 <detached ...>
# strace -p 11111
strace: Process 11111 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=702635}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999250}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999249}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999329}^Cstrace: Process 11111 detached
 <detached ...>
# strace -p 11807
strace: Process 11807 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=687200}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999282}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=998944}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999199}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999337}^Cstrace: Process 11807 detached
 <detached ...>
# strace -p 16079
strace: Process 16079 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=268371}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999037}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=999135}^Cstrace: Process 16079 detached
 <detached ...>

Looks like they are all waiting on file descriptors. I think the four child processes are suffering from the same illness. So will diagnose one only (11111 because it is easier to type).

# lsof -p 11111 | grep -v anaconda3 | grep -v /lib/x86_64-linux-gnu/
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
COMMAND   PID  USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
python3 11111 kinow  cwd       DIR                8,1     4096  659129 /tmp/tmp.pfbdb4xtp8/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit
python3 11111 kinow  rtd       DIR                8,1     4096       2 /
python3 11111 kinow  mem       REG                8,1  3004096  394613 /usr/lib/locale/locale-archive
python3 11111 kinow    0u      CHR              136,0      0t0       3 /dev/pts/0
python3 11111 kinow    1w      REG                8,1      568  694790 /tmp/tmp.pfbdb4xtp8/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit/05-tut.inherit-run.stdout
python3 11111 kinow    2w      REG                8,1     8600  694791 /tmp/tmp.pfbdb4xtp8/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit/05-tut.inherit-run.stderr
python3 11111 kinow    3r      CHR                1,9      0t0      11 /dev/urandom
python3 11111 kinow    4w      REG                8,1     7259  791526 /home/kinow/cylc-run/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit/log/suite/log.20190327T103214+13
python3 11111 kinow    5u     unix 0xffff90c67e2b6000      0t0  137823 type=STREAM
python3 11111 kinow    6u     unix 0xffff90c67e2b5000      0t0  137824 type=STREAM
python3 11111 kinow    7r      CHR                1,9      0t0      11 /dev/urandom
python3 11111 kinow    8u     unix 0xffff90c67e2b5400      0t0  137825 type=STREAM
python3 11111 kinow    9u     unix 0xffff90c67e2b4000      0t0  137826 type=STREAM
python3 11111 kinow   10u  a_inode               0,13        0   10594 [eventpoll]
python3 11111 kinow   11u     unix 0xffff90c67e2b6800      0t0  137827 type=STREAM
python3 11111 kinow   12u     unix 0xffff90c67e2b6c00      0t0  137828 type=STREAM
python3 11111 kinow   13u  a_inode               0,13        0   10594 [eventpoll]
python3 11111 kinow   14u     unix 0xffff90c67e2b6400      0t0  137829 type=STREAM
python3 11111 kinow   15u     unix 0xffff90c67e2b7800      0t0  137830 type=STREAM
python3 11111 kinow   16u     IPv4             137832      0t0     TCP *:43002 (LISTEN)
python3 11111 kinow   19u      REG                8,1   139264  791543 /home/kinow/cylc-run/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit/log/db

Before doing any more digging, looking at the log files of the job, 05-tut.inherit-run.stderr contains (complete log output here):

2019-03-27T10:32:14+13:00 DEBUG - Loading site/user global config files
2019-03-27T10:32:14+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T10:32:14+13:00 DEBUG - Generated /home/kinow/cylc-run/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit/.service/passphrase
2019-03-27T10:32:14+13:00 DEBUG - Loading site/user global config files
2019-03-27T10:32:14+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T10:32:14+13:00 DEBUG - creating suite run directory: /home/kinow/cylc-run/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit
...
...
2019-03-27T12:32:21+13:00 DEBUG - Loading site/user global config files
2019-03-27T12:32:21+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T12:42:22+13:00 DEBUG - Performing suite health check
2019-03-27T12:42:22+13:00 DEBUG - Loading site/user global config files
2019-03-27T12:42:22+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T12:52:22+13:00 DEBUG - Performing suite health check
2019-03-27T12:52:22+13:00 DEBUG - Loading site/user global config files
2019-03-27T12:52:22+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T13:02:22+13:00 DEBUG - Performing suite health check
2019-03-27T13:02:22+13:00 DEBUG - Loading site/user global config files
2019-03-27T13:02:22+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T13:12:23+13:00 DEBUG - Performing suite health check
2019-03-27T13:12:23+13:00 DEBUG - Loading site/user global config files
2019-03-27T13:12:23+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T13:22:23+13:00 DEBUG - Performing suite health check
2019-03-27T13:22:23+13:00 DEBUG - Loading site/user global config files
2019-03-27T13:22:23+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc
2019-03-27T13:32:24+13:00 DEBUG - Performing suite health check
2019-03-27T13:32:24+13:00 DEBUG - Loading site/user global config files
2019-03-27T13:32:24+13:00 DEBUG - Reading file /tmp/tmp.pfbdb4xtp8/etc/global.rc

Sook looks like it is still running. I waited until 13:42 to confirm, and indeed got another line there. I think now I need to learn how to debug/trace Cylc suites, and understand what are they waiting for...

EDIT: but my first attempt started without much luck...

$ cylc scan
cylctb-20190326T213057Z/shutdown/08-now1 kinow-VirtualBox 43003 TIMEOUT
cylctb-20190326T213057Z/triggering/09-fail kinow-VirtualBox 43004 TIMEOUT
cylctb-20190326T213057Z/suite-state/04-template_ref kinow-VirtualBox 43001 TIMEOUT
cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit kinow-VirtualBox 43002 TIMEOUT

@kinow
Copy link
Member

kinow commented Mar 27, 2019

Best I could find was this suite

~/cylc-run/cylctb-20190326T213057Z/tutorial/oneoff/05-tut.inherit$ cat suite.rc.processed 
[meta]
    title = "Simple runtime inheritance example"
[scheduling]
    [[dependencies]]
        graph = "hello => goodbye"
[runtime]
    [[root]]
        script = "sleep 0; echo $GREETING World!"
    [[hello]]
        [[[environment]]]
            GREETING = Hello
    [[goodbye]]
        [[[environment]]]
            GREETING = Goodbye
[runtime]
    [[root]]
        script = /bin/true

has the following in its sqlite DB:

image

Not sure why the goodbye task would stay so long in waiting.

@hjoliver
Copy link
Member

Not sure why the goodbye task would stay so long in waiting.

Because the "hello" task never finished (or even started, by the look of it - it's in the "submitted" state ... so the first question is, why is that?)

@hjoliver
Copy link
Member

Shall we move the test battery investigation to #3046 ? (it's a bit off-topic here).

@kinow
Copy link
Member

kinow commented Mar 28, 2019

All checks pass (thanks @oliver-sanders for looking into the issue on Travis). Keeping my +1 for merging it. 🎉

bin/cylc-insert Outdated Show resolved Hide resolved
@hjoliver hjoliver merged commit f6bfb3f into cylc:master Apr 4, 2019
@sadielbartholomew sadielbartholomew removed their request for review October 25, 2019 14:45
@oliver-sanders oliver-sanders deleted the zmq-document-interface branch May 27, 2020 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants