Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc monitor rewrite #3463

Merged
merged 40 commits into from Mar 30, 2020
Merged

cylc monitor rewrite #3463

merged 40 commits into from Mar 30, 2020

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Dec 12, 2019

Closes #1685

cylc-monitor-2

Jet-lag washup - re-implement cylc monitor using urwid.

  • A direct replacement for the old cylc monitor.
    • Gets data direct from the suite.
      • Uses GraphQL so we can add the option to go via the UI Server later.
      • One second (non-adaptive) polling.
    • Performs global updating (there is scope for partial updating).
  • Follows the new Cylc UI - TreeView interface which is easily ported to a terminal interface.
  • Uses unicode to replicate task icons.
    • Note font size can be changed with ctrl +.

What's supported:

  • The full Workflow-Cycle-Family-Task-Job tree.
  • A predefined set of job information.
  • Task and job icons (including 25,50,75% progress indicators).
  • Family grouping with consolidated task icons.

What's not supported:

  • Multiple workflows (to be supported via GScan-esque panel at a future date).
  • Cycle consolidated task icons (computationally expensive).
  • Mutations (attempt after/with Suite API - On-The-Fly #2123).
  • ctrl c exit, use q or ctrl d.

TODO

  • Fix tests which depend on the old cylc monitor.
  • Unit test the new code where possible quite difficult.
  • Write visual test via HTML translation.
  • Add suite state summary to toolbar.
  • Add task state filtering.
  • Investigate performance with larger suites. attempt after delta updates

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • (master branch) I have opened a documentation PR at cylc/cylc-doc/pull/XXXX.

@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Dec 12, 2019
@oliver-sanders oliver-sanders self-assigned this Dec 12, 2019
This was referenced Dec 12, 2019
@hjoliver
Copy link
Member

Need to add urwid to setup.py.

@hjoliver
Copy link
Member

Found a couple of issues in my environment already, but it's a WIP; generally: god damn it, that is cool. Well done. 🎉

@oliver-sanders
Copy link
Member Author

The test I added for the new cylc monitor is failing on Travis for some reason, I'll take a look at it and hopefully simplify a bit. The test uses a date-time cycling suite which isn't using UTC mode = True so hopefully that covers the time zone issue.

Have added a TODO for performance, there might be some easy wins. If nothing else, we can lower the refresh rate for heavy suites.

@hjoliver
Copy link
Member

@oliver-sanders - I'm keen to get this in, and we can enhance it further (e.g. status summaries at top) in future PRs.

However, (CylcCon 2020) @trwhitcomb wanted to keep the old cylc monitor for "quick overview" use cases, at least until he's satisfied that this really is better for that. To make that easy, we need to rename this new app anyway as it is likely to grow command funcitionality - at which point it will no longer be merely a "monitor".

What about cylc treeview?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Feb 17, 2020

and we can enhance it further (e.g. status summaries at top) in future PRs.

Things I'll try and do in this PR:

  • State summary at the top
  • Filtering
  • Copyable text

Things I'll try and punt to future PRs:

  • Async updater (so that refreshing data doesn't lockup the UI)
  • Configurable / semi-intelligent update frequency.

I think once filtering/state-totals/copyable text are done the use case for the old cylc monitor is expended?

What about cylc treeview?

With the addition of a gscan column this program becomes a CLI equivalent to the Cylc WUI. Perhaps:

  • cylc cui (as opposed to WUI)
  • cylc cli (more intuitive version of the above)
  • cylc term[inal] (meh)
  • cylc console (more fun)
  • cylc command-centre (woohoo)

@hjoliver
Copy link
Member

hjoliver commented Feb 17, 2020

Things I'll try and do in this PR:

Great.

Things I'll try and punt to future PRs:

Yep, good.

I think once filtering/state-totals/copyable text are done the use case for the old cylc monitor is expended?

I tend to agree, but we can treat that as a separate issue since the two utils won't have the same name anymore.

With the addition of a gscan column this program becomes a CLI equivalent to the Cylc WUI.

Ah, good point. The name might require some thought. cylc cui makes the most sense, but it ain't pretty.

@kinow
Copy link
Member

kinow commented Mar 26, 2020

Two approvals, if Travis passes with no issues I think it should be ready to be merged. Thanks @oliver-sanders !

@kinow
Copy link
Member

kinow commented Mar 26, 2020

One failure, but doesn't look related. I don't seem to have access to restart the build (button not displaying, maybe a Travis glitch). Doing some manual testing...

@kinow
Copy link
Member

kinow commented Mar 26, 2020

Found another weird case with five, but we can merge and move it to another issue if you prefer @oliver-sanders .

image

The terminal at the left top corner has the workflow five. The one below has the families2.

I did a cylc hold five, followed by cylc release five after some seconds. Then did the same for families2.

However, five's tui terminal never came back? Actually, I cannot use h, or the arrow keys. Simply nothing is working.

Did a quick strace on both cylc tui instances, not sure if helpful.

image

Then after hitting CTRL + C, it printed an exception and the process started working again.

image

@oliver-sanders
Copy link
Member Author

Hard to reproduce, if you get it again could you see if the suite is still responsive, try a graphql request. I'm going to keep Tui open on a suite in the background and see if it freezes at any point.

@kinow
Copy link
Member

kinow commented Mar 26, 2020

if you get it again could you see if the suite is still responsive

It was running, at least the logs were showing the tasks/jobs and new cyclepoints as it normally does. After clicking CTRL + C on the cylc tui terminal, that terminal started showing the updated data too (I didn't need to touch the suite).

Will merge & create an issue for that tomorrow morning so we can track the issue 👍 thanks!

@oliver-sanders
Copy link
Member Author

Would rather fix now but I'm finding it very difficult to reproduce and harder to debug. (I'm getting other issues on my platform at the moment which are hard to separate).

@oliver-sanders
Copy link
Member Author

Managed to reproduce, the traceback is coming from zmq.sugar.context.Context.__del__.

I think what's going on is that something has gone wrong with the connection so Tui has attempted to create a new connection. This has caused zmqto try and remove/recreate the context which, it would seem, causes traceback. Perhaps a bug in zmq. If so this problem is bigger than Tui.

Investigating...

@kinow
Copy link
Member

kinow commented Mar 26, 2020

This has caused zmqto try and remove/recreate the context which, it would seem, causes traceback. Perhaps a bug in zmq. If so this problem is bigger than Tui.

Not sure if related, but these are the issues in flow/uiserver I created when working on Cylc UI and that have some ZMQ stuff in the tracebacks:

@hjoliver
Copy link
Member

Investigating...

If you confirm that as the cause, maybe for the moment we should not attempt to automatically reconnect? (Or is it possible to do it in a way that is more like starting the application again from scratch)?

@dwsutherland
Copy link
Member

@oliver-sanders appears to be some consistent failures in Travis chunk3:

./tests/shutdown/18-client-on-dead-suite.t                     (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3, 5
./tests/shutdown/10-no-port-file.t                             (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  4

@oliver-sanders
Copy link
Member Author

Should be fixed now.

@hjoliver
Copy link
Member

Tests pass, two approvals, and this is quite self-contained ... going for the merge 🎉

@hjoliver hjoliver merged commit b940027 into cylc:master Mar 30, 2020
@kinow
Copy link
Member

kinow commented Mar 30, 2020

Synced local branch, then tested again with five and families2, both running with no-detach and with cylc tui attached to both.

Randomly held then released workflows. The first two times, it worked fine for five. Except the " that it displayed. Or is that a symbol for paused @oliver-sanders ?

image

Then on the third time the workflow exited, but that's a known issue. Started it again with cylc run --no-detach five, and now the workflow is running fine, no errors, and cylc tui five that was already running is now showing the tasks again. All good.

cylc hold five, and now the workflow is still happy, no errors in the logs. But cylc tui is now showing five as running. However, it might be something wrong with Cylc Flow. I odn't see it moving the logs and showing that health check line which I normally see when I hold five. But I can still ping/poll it. 🤔

Better to merge it and move it to a separate issue if others have a similar problem (might be something I am doing).

image

@hjoliver
Copy link
Member

Must admit I didn't test hold/release extensively.

" is definitely the hold indicator, but @kinow seems to be right that it doesn't get applied and removed consistently.

@hjoliver
Copy link
Member

Testing with a simpler suite:

[scheduling]
   cycling mode = integer
   initial cycle point = 1
   final cycle point = 10
   [[dependencies]]
        [[[P1]]]
            graph = "foo[-P1] => foo => bar & baz"
[runtime]
   [[root]]
       script = sleep 10

It appears that the hold indicator does not get applied and/or removed to complete cycles of finished tasks. (For which it is kind of meaningless anyway ... but probably easy for @oliver-sanders to fix).

@oliver-sanders
Copy link
Member Author

Yep, confirmed.

The tasks which appear to be out-of-sync should not be visible and are somehow left-over from the previous update.

If you run watch cylc dump <suite> against cylc tui <suite> the out-of-sync tasks should appear in Tui but not in Dump.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Mar 31, 2020

@kinow @hjoliver

After a bit of digging the issue is not here (you can replicate the same in WUI), Tui is simply displaying the data it was provided.

This is an artifact of the two task pools we currently have in Cylc, the SOS TaskPool and the data store. The data store contains tasks which aren't present in the TaskPool, when you hold the suite these tasks aren't held because they aren't present in the data store task pool.

@oliver-sanders oliver-sanders deleted the cylc-monitor branch March 31, 2020 14:11
@kinow
Copy link
Member

kinow commented Mar 31, 2020

Thanks for investigating it @oliver-sanders . I guess that might be something being fixed. @dwsutherland mentioned he has been finding small issues in the data store while working on the UIS deltas (if I understand that correctly).

@hjoliver
Copy link
Member

This is an artifact of the two task pools we currently have in Cylc, the SOS TaskPool and the data store...

Ah yes, makes sense, I should have realized that. I talked to @dwsutherland about this a while ago Not his fault, there's some ambiguity about what exactly to provide to the UI for the default "view" until one or other of the following is settled on and solved:

  • SOS: UI shows pure SOS task pool as now
  • SOS: UI shows "fake" n-distance window centered on "active tasks" (we noted in the workshop there are several problems with this)
  • SOD: UI shows n-distance window centered on the (n=0) SOD scheduler task pool

For work on SOD I really need to see just the scheduler task pool (so was using the old monitor which still feed off of the old scheduler data store).

@oliver-sanders
Copy link
Member Author

This isn't a bug, it's absolutely correct, when you hold a suite you are only holding active tasks, not archived ones.

  • In the WUI we can display all n>1 tasks with opacity 0.5 which will make it clear that they are not "live".
  • In the TUI that isn't really an option, we will have to do something more clever.

@hjoliver
Copy link
Member

hjoliver commented Apr 1, 2020

What’s not a bug? Showing the hold indicator on “archived” tasks, or displaying those archived tasks at all in the current UI? I would say the former is a bug (as you say, you can’t hold an archived task), and the latter isn’t a bug but is sort of wrong in that it doesn’t conform to our current SOS task pool or any of the future options as I listed above.

@oliver-sanders
Copy link
Member Author

@kinow seems to be right that it doesn't get applied and removed consistently

I guess that might be something being fixed

This isn't a bug, it's absolutely correct

What’s not a bug?

When you hold a suite only the "active" tasks actually get held and so show up that way in the GUI.

Technically this is the correct behaviour, the "archived" tasks are not held, the database backs this up.

@hjoliver
Copy link
Member

hjoliver commented Apr 7, 2020

Technically this is the correct behaviour, the "archived" tasks are not held, the database backs this up.

I agree with that, but I thought (maybe I was wrong?) that the hold indicator was being applied to archived tasks in cylc tui (or at least not removed from them once they left the task pool).

@oliver-sanders
Copy link
Member Author

Looks like tasks keep the status they had when they were in the task pool. Subsequent holds, etc do not effect "archived" tasks.

@dpmatthews dpmatthews modified the milestones: cylc-8.0.0, cylc-8.0a2 May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better "cylc monitor" - rewrite for ncurses?
5 participants