Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite flux-resource status #4997

Merged
merged 14 commits into from
Mar 20, 2023
Merged

rewrite flux-resource status #4997

merged 14 commits into from
Mar 20, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 16, 2023

This PR is a WIP. Still need to write a few more tests and documentation. However, since this makes some fundamental changes to the flux resource status command, it would be good to get feedback before finishing up those items.

This is pretty much a full rewrite of flux resource status. It is broken up into a few parts, but the final rewrite is one big change, I wasn't sure how else to accomplish that.

  • First some groundwork is laid to make a reusable flux.util.Deduplicator class, which combines items that would format to the same string given an output formatter, a list of fields that should be excluded from duplicate detection, and a method to combine similar items.
  • Then, some reusable code for fetching status from both the resource and scheduler module is added to flux.resource.status
  • Finally, flux resource status is rewritten using these new pieces.

The major changes in flux resource status include:

  • Move offline and online from "states" to a new field "status". (results can still be "filtered" by e.g. --states=online)
  • Rename the {state} header from STATUS to STATE
  • all is no longer a "state". Instead --states=all displays all states even if they have zero resources
  • Ensure a rank only appears in one state at a time. The states currently are exclude, drained, draining or avail (i.e. not excluded or drained/draining).
  • The default {state} output field appends an asterisk * to the state to indicate the rank is offline.
  • A new output field {statex} is available to get the state without the *
  • The Deduplicator class is used to combine output lines, which fixes flux resource drain does not collate down to a single line due to timing #4985 among other things

(Note this all applies to flux resource drain (no args) as well since that command is just a special case of flux resource status)

Fixes #4985
Fixes #4910

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

Some examples of changed output:

Offline nodes now reported separately for each "state":
Before:

 grondo@corona211:~/git/flux-core.git$ flux resource status
    STATUS NNODES NODELIST
     avail    114 corona[171,173-194,196-201,203-207,213-247,250-259,261-283,285-296]
   offline      7 corona[172,195,202,248-249,260,284]
   exclude      4 corona[81-82,211-212]
  drained*      1 corona284

After

 grondo@corona211:~/git/flux-core.git$ src/cmd/flux resource status
     STATE NNODES NODELIST
     avail    114 corona[171,173-194,196-201,203-207,213-247,250-259,261-283,285-296]
    avail*      6 corona[172,195,202,248-249,260]
   exclude      4 corona[81-82,211-212]
  drained*      1 corona284

flux resource drain combines similar lines based on actual format, not timestamp to millisecond
Before

 grondo@fluke108:~/git/flux-core.git$ flux resource drain
TIME         STATUS   REASON                         NODELIST
Mar15 18:08  drained  broker was unresponsive        fluke10
Mar15 20:12  drained  broker was unresponsive        fluke66
Mar15 18:00  drained  broker was unresponsive        fluke70
Mar15 18:00  drained  broker was unresponsive        fluke74
Mar15 18:00  drained  broker was unresponsive        fluke75
Mar15 18:00  drained  broker was unresponsive        fluke77
Mar15 18:00  drained  broker was unresponsive        fluke87
Mar15 18:05  drained  broker was unresponsive        fluke88

After

 grondo@fluke108:~/git/flux-core.git$ src/cmd/flux resource drain
TIME         STATE    REASON                         NODELIST
Mar15 18:08  drained  broker was unresponsive        fluke10
Mar15 20:12  drained  broker was unresponsive        fluke66
Mar15 18:00  drained  broker was unresponsive        fluke[70,74-75,77,87]
Mar15 18:05  drained  broker was unresponsive        fluke88

No more "all" state, but "all" selects all states even if empty:

Before:

grondo@fluke108:~/git/flux-core.git$ flux resource status -s all
    STATUS NNODES NODELIST
       all    101 fluke[1,3,108,6-103]

After:

 grondo@fluke108:~/git/flux-core.git$ src/cmd/flux resource status -s all
     STATE NNODES NODELIST
     avail     84 fluke[6-9,11-16,18-23,25-60,62-65,67-69,71-73,76,78,80-86,89-91,93-101,103]
    avail*      6 fluke[17,24,61,79,92,102]
   exclude      3 fluke[1,3,108]
  draining      0 
   drained      8 fluke[10,66,70,74-75,77,87-88]

online/offline act as filters, not a state in and of themselves:

 grondo@fluke108:~/git/flux-core.git$ flux resource status -s offline
    STATUS NNODES NODELIST
   offline      6 fluke[17,24,61,79,92,102]
 grondo@fluke108:~/git/flux-core.git$ src/cmd/flux resource status -s offline
     STATE NNODES NODELIST
    avail*      6 fluke[17,24,61,79,92,102]
  exclude*      0 

Note you can get similar output with a format that displays the status instead of state:

 grondo@fluke108:~/git/flux-core.git$ src/cmd/flux resource status -o "{status:>12} {nnodes:>8} {nodelist}" -s offline
      STATUS   NNODES NODELIST
     offline        6 fluke[17,24,61,79,92,102]
 grondo@fluke108:~/git/flux-core.git$ src/cmd/flux resource status -o "{status:>12} {nnodes:>8} {nodelist}" 
      STATUS   NNODES NODELIST
      online       95 fluke[1,3,6-16,18-23,25-60,62-78,80-91,93-101,103,108]
     offline        6 fluke[17,24,61,79,92,102]

@garlick
Copy link
Member

garlick commented Mar 16, 2023

Great!

The one thing that gives me pause is listing nodes that are turned off as avail*. I wonder if the default format should include both {statex} and {status}.

picl[4-6] are turned off:

Before

$ flux resource status
    STATUS NNODES NODELIST
     avail      2 picl[1-2]
   offline      3 picl[4-6]
   exclude      1 picl0
   drained      2 picl[3,7]

After

$ src/cmd/flux resource status
     STATE NNODES NODELIST
     avail      2 picl[1-2]
    avail*      3 picl[4-6]
   exclude      1 picl0
   drained      2 picl[3,7]

Maybe slightly more informative?

$ src/cmd/flux resource status  -o "{state:>12} {status:>8} {nnodes:>8} {nodelist}"
       STATE   STATUS   NNODES NODELIST
       avail   online        2 picl[1-2]
      avail*  offline        3 picl[4-6]
     exclude   online        1 picl0
     drained   online        2 picl[3,7]

Aside, I couldn't get {statex} to work:

$ src/cmd/flux resource status  -o "{statex:>12} {status:>8} {nnodes:>8} {nodelist}"
flux-resource: ERROR: Unknown format field: statex

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

Ah, sorry, I forgot to add a header for statex. I thought I had it but must have dropped it in a rebase.

I also don't like avail*. The only problem with including the STATUS column is that it wastes a lot of space - I wonder if we could come up with a 2 character wide column alternative?

@garlick
Copy link
Member

garlick commented Mar 16, 2023

How about UP with y/n values?

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

Ok, I added statex to the headings and forced a push.

BTW, I would have renamed state to the pure state name (without the asterisk) and have statex be the one with the asterisk, but that might break existing user formats so I didn't go that way.

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

Hm, what about something like this?

$ flux resource status -o "{state:>8} {st:>2} {nnodes:>4} {nodelist}"
   STATE ON NNODES NODELIST
   avail  ✓    2 asp,asp
  avail*  ✗    1 asp
 drained  ✓    1 asp

@garlick
Copy link
Member

garlick commented Mar 16, 2023

I like it!

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

How about UP with y/n values?

Sorry I missed that comment initially. That seems fine to me.

@garlick
Copy link
Member

garlick commented Mar 16, 2023

I think I like yours more but either way works for me.

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

I like it!

UP is probably better though? Also I wonder if we're going to have problems with terminals displaying mojibake in devolved hpc environments. Maybe better to use your suggestion of y/n or similar.

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

I'll add a commit that makes this the new default?

$ flux resource status -o "{statex:>8} {up:>2} {nnodes:>6} {nodelist}"
   STATE UP NNODES NODELIST
   avail  y      2 asp,asp
   avail  n      1 asp
 drained  y      1 asp

Edit: fixed the format

@garlick
Copy link
Member

garlick commented Mar 16, 2023

Sounds good to me.

@grondo
Copy link
Contributor Author

grondo commented Mar 16, 2023

Ok I pushed the updated default format but I'll still have to fix all the tests :-(

Comment on lines +553 to +559
sys.stdout = open(
sys.stdout.fileno(), "w", encoding="utf8", errors="surrogateescape"
)

Check warning

Code scanning / CodeQL

File is not always closed

File is opened but is not closed.
src/cmd/flux-resource.py Fixed Show fixed Hide fixed
Problem: The ability to easily copy an instance of the OutputFormat
class, possibly with some fields removed, would be useful when
utilities wish to attempt deduplication of similar lines - but this
functionality does not exist.

Add a copy method to the OutputFormat class with an optional
`except_fields` parameter. This allows copying of format strings with
some fields removed.
Problem: It is probably not useful to print a blank line in the
OutputFormat print_items() method.

Skip blank lines in flux.util.OutputFormat.print_items().
Problem: The combination of like lines in formatted output is a
recurring use case in utilities that output line-based output,
e.g. the flux-resource list, status and drain commands. However,
each utility implements the deduplication of lines in their own way.

Attempt to add a reusable class that automates deduplication of
similar lines based on formatted output. Given an output formatter
and a function used to combine similar items, the Deduplicator class
simply combines items where the supplied formatter (minus any excluded
fields) renders the exact same line.

The class can then be iterated like a list to output the combined
items.
Problem: Functions for accessing the scheduler's view of resources
are present in the Python API, but no such functionality exists for
obtaining the resource module view of resource, including drained
and excluded ranks.

Add a new function flux.resource.resource_status() which returns a
ResourceStatus object containining a combination of information from
the resource.status and sched.resource-status RPCs. The resulting
object can be used to get idsets of ranks in various states, e.g.
online, offline, excluded, allocated, drained, draining, along with
any drained rank extra information such as timestamp and reason.
Problem: The `flux resource status` subcommand is awkward to use in
many situations, since it conflates resource "state" (i.e. available,
excluded, drained), with online/offline status. This results in nodes
appearing in multiple lines, which can be confusing, and erroneous
output when attempting to combine similar formatted lines in the
output.

There are also other design issues, such as the use of the floating
point drain timestamp when combining similar drained nodes in output,
which means nodes must be drained at the same millisecond to be
combined.

Rewrite `flux resource status` using the newly added interfaces in
flux.resource.status, and the new flux.util.Deduplicator class.
High level changes include:

* Add a new field for online/offline status, instead of treating the
  online status as a "state".

* Rename the state field header to STATE, and add a new status field
  which returns one of "online" or "offline".

* A node/rank may now only appear in one of the displayed output lines.

* Lines are combined based on the current output format using the
  Deduplicator class.

* The "all" state is dropped. When selected, `--state=all` just displays
  all states, even if they have zero resources in them.

* The "online", "offline" states are dropped. When selected,
  `--state=online|offline` filters the output to only include resource
  with the specified online status.

* Instead of displaying a row of offline ranks/nodes, the command now
  tags offline nodes in a given state (avail, exclude, draining, drained)
  with an asterisk if they are offline. Therefore 'avail*' means the
  resources would be available if they were not offline.
Problem: expected output of `flux resource status` has changed.

Update tests given the changes.
Problem: The status field takes up too much horizontal space
when used in flux-resource status output.

Add an "up" field with a heading of UP and alternate non-ascii
and ascii "y/n" values. The ascii values are available as `up.ascii`.
@codecov
Copy link

codecov bot commented Mar 20, 2023

Codecov Report

Merging #4997 (e09c08f) into master (2fcf4dd) will decrease coverage by 0.48%.
The diff coverage is 98.54%.

❗ Current head e09c08f differs from pull request most recent head e89d31b. Consider uploading reports for the commit e89d31b to get more accurate results

@@            Coverage Diff             @@
##           master    #4997      +/-   ##
==========================================
- Coverage   83.58%   83.11%   -0.48%     
==========================================
  Files         440      444       +4     
  Lines       74740    76257    +1517     
==========================================
+ Hits        62471    63379     +908     
- Misses      12269    12878     +609     
Impacted Files Coverage Δ
src/cmd/flux-resource.py 96.55% <98.07%> (-0.31%) ⬇️
src/bindings/python/flux/resource/status.py 98.43% <98.43%> (ø)
src/bindings/python/flux/resource/__init__.py 100.00% <100.00%> (ø)
src/bindings/python/flux/util.py 95.02% <100.00%> (+0.35%) ⬆️

... and 96 files with indirect coverage changes

@grondo
Copy link
Contributor Author

grondo commented Mar 20, 2023

I've updated the tests and documentation here, so this PR is probably ready for a more thorough review.

Note that there is still one case where a node may appear in multiple lines of flux resource status output (something this PR purports to fix) until #4999 is resolved,

@grondo grondo changed the title WIP: rewrite flux-resource status rewrite flux-resource status Mar 20, 2023
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and I really like the new output format. Very readable to me.

Just a couple typos to fix.

@@ -36,7 +36,7 @@ class FluxResourceConfig(UtilConfig):
builtin_formats["status"] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the commit message for "status: support color in UP column": s/rsources/resources/ and s/throught/through/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed that and will force a push and set MWP.

Problem: It isn't clear which lines of flux resource status output
correspond to online vs offline nodes.

Add the 'up' field to the default `flux resource status` output.
Problem: The online/offline state of resources could be made more
readily apparent through the use of color.

Add a new field ({color_up}) that emits an ANSI color escape sequence
based on the online/offline status of a resource status line. Also
add a corresponding field to reset colors ({color_off}).

Change `{up:>2}` in the default output format for flux-resource status
to `{color_up}{up:>2}{color_off}` so that the online/offline indicator
character is colorized in status output.
Problem: The "long" output format should probably include the UP
column, which is present in the default format.

Add the colorized UP column to `flux resource status -o long`.
Problem: Python doesn't handle non-ascii characters in output when
the current environment doesn't support it. This causes an exception
to be thrown instead of escaping the characters.

Reopen stdout with the surrogateescape option to avoid these errors.
Problem: The flux-resource status subcommand has been rewritten,
but the tests have not been updated or expanded.

Expand the tests in t2351-resource-status-input.t to include coverage
of many of the newer features in the flux-resource status command.
Problem: The description of subcommands in the flux-resource(1)
man page is compact and therefore difficult to read.

Split descriptions of commands in flux-resource(1) into multiple lines
so it is easier to read. Give each option its own line for the most
part, and move more important options first.

Expand the description of the `status` command so it is clearer what
the command does.
Problem: Recent changes in the `flux resource status` command are
not reflected in the flux-resource(1) man page.

Update flux-resource(1) to describe the new version of the status
subcommand.
@mergify mergify bot merged commit 0a99624 into flux-framework:master Mar 20, 2023
@grondo grondo deleted the issue#4985 branch May 1, 2023 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants