New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd: execute rbd image promote with timeout #2737
Conversation
|
Logs from manual testing |
|
@idryomov PTAL |
|
keeping the timeout as 1 minute but it can be reduced to 20/30 seconds to speed up the failover. @ShyamsundarR @idryomov any suggestion on the timeout? |
| // TODO: remove this workaround when the issue is fixed | ||
| err = rbdVol.forcePromoteImage(cr) | ||
| } else { | ||
| err = rbdVol.promoteImage(req.Force) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| err = rbdVol.promoteImage(req.Force) | |
| err = rbdVol.promoteImage(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once I remove the workaround I need to change this from false to req.Force. keeping it as for now. Hope its fine
| err, | ||
| stderr, | ||
| program, | ||
| sanitizedArgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a generic utility helper, it would be nice to differentiate between the cases of a timeout and a command failing of its own accord within a timeout a bit more explicitly, so that instead of "signal: killed" or something like that the error message actually mentions the timeout. Perhaps check for context.DeadlineExceeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking is it possible with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update code to have timeout: context deadline exceeded error in the error message.
What I think @ShyamsundarR really wants here is a retry loop, so I'll defer to him for the frequency and the number of retries. |
This will be retried from the caller i.e volume replication operator sidecar. |
361738d
to
6cfd59c
Compare
| // forcePromoteImage promotes image to primary with force option with 1 minute | ||
| // timeout. If there is no response within 1 minute,the rbd CLI process will be | ||
| // killed and an error is returned. | ||
| func (rv *rbdVolume) forcePromoteImage(cr *util.Credentials) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idryomov if force promote is rolling back a snapshot, then would it take time to respond? and if so, terminating it early would continue the operation or restart fresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShyamsundarR this is client termination right, will it have any impact on the server-side? but good to get it confirmed 👍
1 minute is fine for now as the default. The current focus is to recover, even if RTO is impacted slightly. |
| program, | ||
| sanitizedArgs) | ||
|
|
||
| if ctx != context.TODO() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rational for checking context.TODO(), should it not be logged for any non-nil contexts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot log the ctx if its TODO as it does not contains the request ID
| err = rbdVol.forcePromoteImage(cr) | ||
| } else { | ||
| err = rbdVol.promoteImage(req.Force) | ||
| } | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there was an error, should it not be inspected? A timeout can be retried, but an other failure maybe not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left to the caller to retry all errors will be returned to the users
added ExecCommandWithTimeout helper function to execute the commands with the timeout option, if the command does not return any response with in the timeout time the process will be terminated and error will be returned back to the user. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
use ExecCommandWithTimeout with timeout of 1 minute for the promote operation. If the command doesnot returns error/response in 1 minute the process will be killed and error will be returned to the user. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
6cfd59c
to
e7abdfc
Compare
Pull request has been modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding back the review) Thanks @nixpanic
added ExecCommandWithTimeout helper function to execute the commands with the timeout option, if the command does not return any response within the timeout time the process will be terminated and an error will be returned back to the user.
use ExecCommandWithTimeout with a timeout of 1 minute for the promote operation. If the command does not return an error/response in 1 minute the process will be killed and the error will be returned to the user
updates #2736
Signed-off-by: Madhu Rajanna madhupr007@gmail.com