Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150

Closed
wants to merge 8 commits into from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Feb 26, 2024

Per discussion in #81, #128, #129

Will split this into multiple PRs later on ... unfortunately all of this had to be done before I even begin to test :P

the series of commits;

  • support setplugs command (w/o parents)
  • support setpath command
  • support plug substitution
  • support parents with setplugs config
  • support HPE Cray Supercomputing EX chassis device file (hopefully this is the right name)
  • lots of new tests and new device files for tests in `t/etc/"

what parenting does

If ancestors are all on, then redfishpower can perform power on/off/status on the target.

If ancestor is off, power status is defined as off for all descendants. Power on/off cannot be done.

As special case, if powering on both ancestors and children (e.g pm --on cmm,blade0,node[0-1]), descendants will wait until ancestor power ons are completed first. This could lead to increase runtime of powerman client b/c of multiple "rounds" of power on and we need bigger timeout. I chose 100 seconds for time being (need to test).

As special case, if powering off both ancestors and children, descendants will wait until ancestor power offs are complete. Then by definition, descendants are now all off.

(Just to help differentiate the special cases, there is a difference between pm --on blade0 and pm --on cmm0,blade0, the former will check cmm0 first to determine if blade0 can be turned on. The latter will turn on cmm0 first, then if that is successful it can then power on blade0)

If any ancestors have status unknown or get an error, that status carries to descendants and results in errors (for status query, that results in "unknown" status).

annoyances

Gotta do "%%p" for plug substitution instead of "%p", b/c we're passing a string into redfishpower in which it'll store and parse. vs most device scripts that is probably using the '%s' as a "format". This is also one of the reasons I chose "%p" instead of "%s".

testing

All testing is done in simulation mode, i need to test on real hardware. Hopefully it works. When running in verbose mode, I see the right order of messages going on.

assumptions

  • with parenting, some code simply assumes no loops possible. I think is fair assumption to avoid adding excess code for rare case.

  • with "%p" substitution, assumes at max one "%p" and no need for "%%" escaping. I didn't implement a proper "loop through string, look for escape chars" kinda thing. Seemed excessive given what we actually need. But maybe I shouldn't have been so lazy.

  • if you power on parent, once status of parent is "on", all children can be powered on, no delay is necessary. But delay wouldn't be too hard to add, "delay code" is already there b/c of status polling delay code. (note, this "delay" is not in powerman land but in redfishpower land).

  • If user does pm --on cmm0,blade0,node0 and cmm0/blade0 on but node0 not, I just assume redfish protocol will work out (ie sending an "on" will return success that it's already on). But need to try against real hardware.

mini concerns

Due to parent being "off", it is assumed all children are "off", including any node that is missing. This may not necessarily be what user expects powerman to output. I think this is acceptable b/c 99% of the time parents on are on (i.e. chassis management head). It's nodes down below that are typically turned off. Unfortunately, I think we just have to go with this, as the alternate (send messages to nodes that won't respond) is what we are trying to avoid. If this is a big deal, we can add some type of "whatsup"-like kinda support, where we can ping in the background and identify those targets as gone/missing.

@chu11 chu11 force-pushed the redfishpower_chassis_support branch 8 times, most recently from c8c31b6 to 7e2883e Compare February 28, 2024 19:40
@chu11 chu11 force-pushed the redfishpower_chassis_support branch from 7e2883e to 3f4200d Compare March 1, 2024 16:15
@chu11
Copy link
Member Author

chu11 commented Mar 1, 2024

re-pushed with some changes given initial testing against hardware

If ancestor is off, power status is defined as off for all descendants. Power on/off cannot be done.

Now, if ancestor is off a power off is considered to be successful not an error. Power on you still an error.

As special case, if powering on both ancestors and children (e.g pm --on cmm,blade0,node[0-1]), descendants will wait until ancestor power ons are completed first. This could lead to increase runtime of powerman client b/c of multiple "rounds" of power on and we need bigger timeout. I chose 100 seconds for time being (need to test).

"phased" power on requires delays in between levels. In the case between a blade and node would be > 2 mins, probably 3 minutes to be on the safe side. Adding the delay between cmm & blade, we're probably looking at a 5-7 minute powerman timeout for pm --on cmm,blade,node to work. The feeling is this is:

A) horrible
B) gives the appearance that powerman is hung
C) in the event of a real error w/ redfishpower/powerman, the timeout is now 5-7 minutes
D) is frought with danger of given potential error handling. For example, after powering on the blade and the blade is on, the node responds to an "power on" with "network error". Is it a network error b/c the node is not up yet and can't respond (normal case)? Or is the node not there and the blade unpopulated? I have no idea. Given El Cap sizes, an occasional node being removed for servicing is to be expected. Sooo ... should I just keep on retrying? Or not?

Sooo ... we made a rule for now. You cannot power on two targets that have a parent/child relationship. So you can't do pm --on blade0,node[0-1], if blade0 is node[0-1] parents. But you can do pm --on blade0,node[14-15] where the parent of node[14-15] is a different blade.

You can still power off targets with a parent/child relationship.

@chu11 chu11 force-pushed the redfishpower_chassis_support branch 6 times, most recently from 9eaf69e to 85dda4d Compare March 5, 2024 20:03
@chu11 chu11 force-pushed the redfishpower_chassis_support branch 3 times, most recently from 986fe67 to 53b4b5e Compare March 21, 2024 23:03
@chu11 chu11 force-pushed the redfishpower_chassis_support branch from 53b4b5e to aeae52a Compare March 26, 2024 19:59
Problem: A comment about the status polling interval is out of
date.

Update it to indicate the power on/off wait range is upwards of
50 seconds.
Problem: The status polling interval is hard coded to 1 second long.
This can result in an excessive number of polling messages being sent
when it is known that some hardware takes 20-50 seconds to complete
a power operation.

Solution: Support a modified "exponential backoff" of the status polling
interval.  The modified algorithm is based on observations of how long it
typically takes to complete power operations on hardware.  The status
polling interval begins at one second, but it gets capped at 4 seconds.
Problem: When power control/query to a target fails, there is no way for
a user to know why it failed except through the very verbose
--telemetry output.

Add a new --diag to powerman that will inform powermand to send
diagnostic information about why a power operation failed.  Common errors
from the same host will be collapsed into a hostrange.  This only works
with setplugstate and the new setresult statement.
Problem: The new --diag option is not documented.

Add it in powerman(1).
Problem: The --bad-plug option in vpcd cannot be called multiple
times to specify multiple bad plugs.

Support calling --bad-plug multiple times by putting the bad plugs
into an array.
Problem: There is no coverage for the new --diag option.

Add tests in new t0036-diag.t tests.
@chu11 chu11 force-pushed the redfishpower_chassis_support branch from aeae52a to 6b5f38d Compare April 4, 2024 22:46
Add device file for a HPE Cray Supercomputing EX Chassis.

Fixes chaos#128
Problem: There is no testing for the new HPE Cray Supercomputing
EX Chassis device file.

Add new tests in t0029-redfish.t.
@chu11 chu11 force-pushed the redfishpower_chassis_support branch from 6b5f38d to b95676d Compare April 5, 2024 00:14
@chu11
Copy link
Member Author

chu11 commented Apr 5, 2024

closing this as almost everything has been divided up

@chu11 chu11 closed this Apr 5, 2024
Copy link
Contributor

mergify bot commented Apr 5, 2024

⚠️ The sha of the head commit of this PR conflicts with #173. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant