WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

chu11 · 2024-02-26T17:15:36Z

Per discussion in #81, #128, #129

Will split this into multiple PRs later on ... unfortunately all of this had to be done before I even begin to test :P

the series of commits;

support setplugs command (w/o parents)
support setpath command
support plug substitution
support parents with setplugs config
support HPE Cray Supercomputing EX chassis device file (hopefully this is the right name)
lots of new tests and new device files for tests in `t/etc/"

what parenting does

If ancestors are all on, then redfishpower can perform power on/off/status on the target.

If ancestor is off, power status is defined as off for all descendants. Power on/off cannot be done.

As special case, if powering on both ancestors and children (e.g pm --on cmm,blade0,node[0-1]), descendants will wait until ancestor power ons are completed first. This could lead to increase runtime of powerman client b/c of multiple "rounds" of power on and we need bigger timeout. I chose 100 seconds for time being (need to test).

As special case, if powering off both ancestors and children, descendants will wait until ancestor power offs are complete. Then by definition, descendants are now all off.

(Just to help differentiate the special cases, there is a difference between pm --on blade0 and pm --on cmm0,blade0, the former will check cmm0 first to determine if blade0 can be turned on. The latter will turn on cmm0 first, then if that is successful it can then power on blade0)

If any ancestors have status unknown or get an error, that status carries to descendants and results in errors (for status query, that results in "unknown" status).

annoyances

Gotta do "%%p" for plug substitution instead of "%p", b/c we're passing a string into redfishpower in which it'll store and parse. vs most device scripts that is probably using the '%s' as a "format". This is also one of the reasons I chose "%p" instead of "%s".

testing

All testing is done in simulation mode, i need to test on real hardware. Hopefully it works. When running in verbose mode, I see the right order of messages going on.

assumptions

with parenting, some code simply assumes no loops possible. I think is fair assumption to avoid adding excess code for rare case.
with "%p" substitution, assumes at max one "%p" and no need for "%%" escaping. I didn't implement a proper "loop through string, look for escape chars" kinda thing. Seemed excessive given what we actually need. But maybe I shouldn't have been so lazy.
if you power on parent, once status of parent is "on", all children can be powered on, no delay is necessary. But delay wouldn't be too hard to add, "delay code" is already there b/c of status polling delay code. (note, this "delay" is not in powerman land but in redfishpower land).
If user does pm --on cmm0,blade0,node0 and cmm0/blade0 on but node0 not, I just assume redfish protocol will work out (ie sending an "on" will return success that it's already on). But need to try against real hardware.

mini concerns

Due to parent being "off", it is assumed all children are "off", including any node that is missing. This may not necessarily be what user expects powerman to output. I think this is acceptable b/c 99% of the time parents on are on (i.e. chassis management head). It's nodes down below that are typically turned off. Unfortunately, I think we just have to go with this, as the alternate (send messages to nodes that won't respond) is what we are trying to avoid. If this is a big deal, we can add some type of "whatsup"-like kinda support, where we can ping in the background and identify those targets as gone/missing.

chu11 · 2024-03-01T16:27:34Z

re-pushed with some changes given initial testing against hardware

If ancestor is off, power status is defined as off for all descendants. Power on/off cannot be done.

Now, if ancestor is off a power off is considered to be successful not an error. Power on you still an error.

As special case, if powering on both ancestors and children (e.g pm --on cmm,blade0,node[0-1]), descendants will wait until ancestor power ons are completed first. This could lead to increase runtime of powerman client b/c of multiple "rounds" of power on and we need bigger timeout. I chose 100 seconds for time being (need to test).

"phased" power on requires delays in between levels. In the case between a blade and node would be > 2 mins, probably 3 minutes to be on the safe side. Adding the delay between cmm & blade, we're probably looking at a 5-7 minute powerman timeout for pm --on cmm,blade,node to work. The feeling is this is:

A) horrible
B) gives the appearance that powerman is hung
C) in the event of a real error w/ redfishpower/powerman, the timeout is now 5-7 minutes
D) is frought with danger of given potential error handling. For example, after powering on the blade and the blade is on, the node responds to an "power on" with "network error". Is it a network error b/c the node is not up yet and can't respond (normal case)? Or is the node not there and the blade unpopulated? I have no idea. Given El Cap sizes, an occasional node being removed for servicing is to be expected. Sooo ... should I just keep on retrying? Or not?

Sooo ... we made a rule for now. You cannot power on two targets that have a parent/child relationship. So you can't do pm --on blade0,node[0-1], if blade0 is node[0-1] parents. But you can do pm --on blade0,node[14-15] where the parent of node[14-15] is a different blade.

You can still power off targets with a parent/child relationship.

Problem: A comment about the status polling interval is out of date. Update it to indicate the power on/off wait range is upwards of 50 seconds.

Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-50 seconds to complete a power operation. Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.

Problem: When power control/query to a target fails, there is no way for a user to know why it failed except through the very verbose --telemetry output. Add a new --diag to powerman that will inform powermand to send diagnostic information about why a power operation failed. Common errors from the same host will be collapsed into a hostrange. This only works with setplugstate and the new setresult statement.

Problem: The new --diag option is not documented. Add it in powerman(1).

Problem: The --bad-plug option in vpcd cannot be called multiple times to specify multiple bad plugs. Support calling --bad-plug multiple times by putting the bad plugs into an array.

Problem: There is no coverage for the new --diag option. Add tests in new t0036-diag.t tests.

Add device file for a HPE Cray Supercomputing EX Chassis. Fixes chaos#128

Problem: There is no testing for the new HPE Cray Supercomputing EX Chassis device file. Add new tests in t0029-redfish.t.

chu11 · 2024-04-05T03:34:44Z

closing this as almost everything has been divided up

mergify · 2024-04-05T03:38:05Z

⚠️ The sha of the head commit of this PR conflicts with #173. Mergify cannot evaluate rules on this PR. ⚠️

chu11 force-pushed the redfishpower_chassis_support branch 8 times, most recently from c8c31b6 to 7e2883e Compare February 28, 2024 19:40

chu11 mentioned this pull request Feb 29, 2024

powerman: support status_ranged script #152

Open

chu11 force-pushed the redfishpower_chassis_support branch from 7e2883e to 3f4200d Compare March 1, 2024 16:15

chu11 force-pushed the redfishpower_chassis_support branch 6 times, most recently from 9eaf69e to 85dda4d Compare March 5, 2024 20:03

chu11 force-pushed the redfishpower_chassis_support branch 3 times, most recently from 986fe67 to 53b4b5e Compare March 21, 2024 23:03

chu11 force-pushed the redfishpower_chassis_support branch from 53b4b5e to aeae52a Compare March 26, 2024 19:59

chu11 added 6 commits April 4, 2024 15:45

redfishpower: update status poll comment

99c6aa8

Problem: A comment about the status polling interval is out of date. Update it to indicate the power on/off wait range is upwards of 50 seconds.

man: document --diag in powerman(1)

caa1911

Problem: The new --diag option is not documented. Add it in powerman(1).

t: support multiple calls to --bad-plug in vpcd

f6d50e3

Problem: The --bad-plug option in vpcd cannot be called multiple times to specify multiple bad plugs. Support calling --bad-plug multiple times by putting the bad plugs into an array.

t: add coverage for new --diag option

70bbdc6

Problem: There is no coverage for the new --diag option. Add tests in new t0036-diag.t tests.

chu11 force-pushed the redfishpower_chassis_support branch from aeae52a to 6b5f38d Compare April 4, 2024 22:46

chu11 added 2 commits April 4, 2024 16:37

etc: add device file for HPE Cray EX Chassis

6d6b036

Add device file for a HPE Cray Supercomputing EX Chassis. Fixes chaos#128

t: test hpe cray supercomputing ex chassis device

b95676d

Problem: There is no testing for the new HPE Cray Supercomputing EX Chassis device file. Add new tests in t0029-redfish.t.

chu11 force-pushed the redfishpower_chassis_support branch from 6b5f38d to b95676d Compare April 5, 2024 00:14

chu11 closed this Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

chu11 commented Feb 26, 2024 •

edited

Loading

chu11 commented Mar 1, 2024 •

edited

Loading

chu11 commented Apr 5, 2024

mergify bot commented Apr 5, 2024

WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

Conversation

chu11 commented Feb 26, 2024 • edited Loading

chu11 commented Mar 1, 2024 • edited Loading

chu11 commented Apr 5, 2024

mergify bot commented Apr 5, 2024

chu11 commented Feb 26, 2024 •

edited

Loading

chu11 commented Mar 1, 2024 •

edited

Loading