Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hold_swap => is_held #3230

Merged
merged 19 commits into from
Aug 15, 2019
Merged

hold_swap => is_held #3230

merged 19 commits into from
Aug 15, 2019

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jul 23, 2019

Addresses the isHeld point of #3223

~Quick POC ~branch for replacing the hold_swap logic for a is_held flag to see how much work is involved.

Logical Change:

  • All task states can be held, so you can have a held succeeded task.
  • Holding a task is equivalent to telling Cylc not to issue any more submissions (will need to document as such).

Broken tests:

  • tests/restart/21-task-elapsed.t
  • tests/hold-release/20-reset-waiting-output.t
  • tests/hold-release/13-ready-restart.t

TODO:

  • Add tests
  • Fix tests
  • DB upgrader

Questions:

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Includes an appropriate entry in the release change log CHANGES.md.
  • No documentation update required for this change.

@oliver-sanders oliver-sanders added the POC Proof of Concept label Jul 23, 2019
@oliver-sanders oliver-sanders self-assigned this Jul 23, 2019
@matthewrmshin matthewrmshin added this to the cylc-8.0a2 milestone Jul 23, 2019
@cylc cylc deleted a comment Jul 25, 2019
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jul 26, 2019

The remaining two test failures seem to be genuine and quite difficult to shake off:

  • tests/hold-release/20-reset-waiting-output.t (fixed in a54ab0a)
  • tests/hold-release/13-ready-restart.t (fixed in 91feb83)
  • tests/hold-release/14-hold-kill.t (seems to have become flaky)

@cylc cylc deleted a comment Jul 26, 2019
@oliver-sanders oliver-sanders mentioned this pull request Aug 1, 2019
4 tasks
@cylc cylc deleted a comment Aug 2, 2019
@oliver-sanders
Copy link
Member Author

At the moment this change has broken the logic which keeps the prerequisite brokering system in-line with manual resets.

  • hold workflow
  • reset X to succeeded
  • reset X back to waiting
  • release workflow
  • downstream task Y now runs (despite X being waiting)

When a task is reset it's upstream prerequisites are reset, I'm having a hard time hunting down the logic which is supposed to do this for downstream prerequisites (postrequisites).

  • Task status is being reset as intended
  • Task outputs are being reset as intended

The status of the offending prerequisite object after X has been reset to waiting:

Prerequisite(satisfied={('X', '1', 'succeeded'): 'satisfied naturally'}, _all_satisfied=True, target_point_strings=['1'], start_point=1, pre_initial_messages=[], conditional_expression=None, point=1)

(this is unchanged from before the reset [when X was succeeded])

@cylc cylc deleted a comment Aug 5, 2019
@cylc cylc deleted a comment Aug 5, 2019
@cylc cylc deleted a comment Aug 5, 2019
@hjoliver
Copy link
Member

hjoliver commented Aug 5, 2019

@oliver-sanders -

When a task is reset it's upstream prerequisites are reset, I'm having a hard time hunting down the logic which is supposed to do this for downstream prerequisites (postrequisites).

We discussed this on Riot chat (and I just added to that conversation). The way I'd put it is, for foo => bar:

  1. reset foo to waiting, and its own prerequistes get set unsatisfied
  2. reset foo to succeeded, and its own outputs get set to completed
  3. reset foo back to waiting, and its own outputs get set back to unsatisfied

The prerequisites of bar do not get modified in any of these cases case, but:

  • in case 2. they get satisfied by the normal dependency matching mechanism
  • and in case 3. they remain satisfied because there is no mechanism (except manual reset of bar) to reverse already-satisfied prerequisites.

And finally, in the test hold-release/20 the downstream task gets reset back to waiting only because all the manual state manipulation is done with the suite held, and on release we set held tasks back to their pre-hold state.

@hjoliver
Copy link
Member

hjoliver commented Aug 5, 2019

Hopefully that helps (at least to explain what's going on ... not entirely sure of the implications for this PR).

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 6, 2019

Thanks for your explanation.

the downstream task gets reset back to waiting only because all the manual state manipulation is done with the suite held

not entirely sure of the implications for this PR

If this change breaks the logic It's possible there may be some convolution of the task-held and suite-held statuses in the task pool.

@oliver-sanders
Copy link
Member Author

not entirely sure of the implications for this PR

Fixed in a54ab0a!

@cylc cylc deleted a comment Aug 6, 2019
@matthewrmshin
Copy link
Contributor

No Travis CI for the latest change for some reason?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 6, 2019

No Travis CI for the latest change for some reason?

I'll mark the PR as ready for review (not quite there yet) I think that'll get Travis going.

-> Nope, the branch is in conflict.

@oliver-sanders oliver-sanders marked this pull request as ready for review August 6, 2019 14:23
Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment on test with in-memory database.

cylc/flow/tests/test_rundb.py Show resolved Hide resolved
Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@oliver-sanders oliver-sanders removed the POC Proof of Concept label Aug 7, 2019
@cylc cylc deleted a comment Aug 7, 2019
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. We will need to document the new meaning of "held" (for suite and tasks). It was a suite status, with hold-swap logic that reverted tasks to pre-suite-hold status on suite release, even if their states had been manipulated while the suite was held; and a task status that only waiting tasks would go to. Now the hold swap logic is gone (which is good) and a task with any status can be held (which means: the task will not be submitted to run even if it goes to the waiting status and has prerequisites satisfied). But I'll open another issue for the documentation since it might take some thought.

LGTM 👍

@hjoliver
Copy link
Member

hjoliver commented Aug 7, 2019

Kicking Travis for tests/hold-release/14-hold-kill.t. Test passes locally. On Travis its another file not found issue: cylc.flow.exceptions.SuiteServiceFileError: Couldn't get contact which I think we see quite a lot of. I'm not sure if the specific tests are flaky or something else weird is happening on Travis CI.

@hjoliver
Copy link
Member

hjoliver commented Aug 7, 2019

tests/hold-release/14-hold-kill.t

failed again 🤔

@kinow
Copy link
Member

kinow commented Aug 8, 2019

cylc.flow.exceptions.ClientError: Contact info not found for suite "cylctb-20190807T234059Z/hold-release/14-hold-kill", suite not running?

This is the same kind of error that I started seeing in our functional tests. Not sure if it was due to any change in our code or Travis, but it's recent-ish I think.

@hjoliver
Copy link
Member

hjoliver commented Aug 8, 2019

And failed again, same test.

Direct link to T-CI after-fail.sh output: https://travis-ci.org/cylc/cylc-flow/jobs/568894700

First error is actually this:

==== /home/travis/cylc-run/cylctb-20190807T234059Z/hold-release/14-hold-kill/log/job/1/sleeper 02/job.err ====
Sending DEBUG MODE xtrace to job.xtrace
Fatal Python error: initsite: Failed to import the site module

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site.py", line 703, in <module>
    main()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site.py", line 694, in main
    execsitecustomize()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site.py", line 548, in execsitecustomize
    import sitecustomize
  File "/home/travis/build/cylc/cylc-flow/.travis/sitecustomize.py", line 24, in <module>
    coverage.process_startup()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/coverage/control.py", line 1289, n process_startup
    cov.start()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/coverage/control.py", line 690, in start
    self._init()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/coverage/control.py", line 266, in _init
    concurrency=concurrency,
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/coverage/collector.py", line 107, in __init__
    self.origin = short_stack()
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/coverage/debug.py", line 143, in short_stack
    stack = inspect.stack()[limit:skip:-1]
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 1511, in stack
    return getouterframes(sys._getframe(1), context)
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 1488, in getouterframes
    frameinfo = (frame,) + getframeinfo(frame, context)
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 1458, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 725, in getmodule
    file = getabsfile(object, _filename)
  File "/opt/python/3.7.1/lib/python3.7/inspect.py", line 709, in getabsfile
    return os.path.normcase(os.path.abspath(_filename))
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/posixpath.py", line 376, in abspath
    cwd = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

@kinow
Copy link
Member

kinow commented Aug 8, 2019

Hmm, maybe the coverage script that uses the site-customization trick in Python is trying to use an invalid directory (deleted, incorrect, etc)

@oliver-sanders
Copy link
Member Author

Kicking Travis for tests/hold-release/14-hold-kill.t. Test passes locally.

I have managed to get it to fail locally with a busy system (and only after several attempts), simple solution, dump it in flakytests?

@cylc cylc deleted a comment Aug 8, 2019
@matthewrmshin
Copy link
Contributor

(Just kicked Travis CI again, noting failures of hold-release/12 and cli/02. I can see why the hold-release tests are so sensitive to time. I'll have a look at them soon.)

@oliver-sanders
Copy link
Member Author

flakytests/hold-release/14-hold-kill.t is now failing reliably on Travis but passing reliably on my system even when put under load:

for i in $(seq 1 10); do
    sleep 1  # due to ctb test naming convention
    etc/bin/run-functional-tests.sh tests/hold-release/14-hold-kill.t &
done

Will see if bc84e23 helps.

@cylc cylc deleted a comment Aug 13, 2019
@oliver-sanders
Copy link
Member Author

Will see if bc84e23 helps.

It did!

Tests passing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants