Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: execute rbd image promote with timeout #2737

Merged
merged 2 commits into from Dec 23, 2021

Conversation

Madhu-1
Copy link
Collaborator

@Madhu-1 Madhu-1 commented Dec 22, 2021

added ExecCommandWithTimeout helper function to execute the commands with the timeout option, if the command does not return any response within the timeout time the process will be terminated and an error will be returned back to the user.

use ExecCommandWithTimeout with a timeout of 1 minute for the promote operation. If the command does not return an error/response in 1 minute the process will be killed and the error will be returned to the user

updates #2736
Signed-off-by: Madhu Rajanna madhupr007@gmail.com

@Madhu-1 Madhu-1 added component/rbd Issues related to RBD ci/skip/e2e skip running e2e CI jobs labels Dec 22, 2021
@Madhu-1 Madhu-1 requested review from ShyamsundarR and a team December 22, 2021 04:21
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 22, 2021

Logs from manual testing

I1222 04:11:47.720355       1 utils.go:177] ID: 19 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC call: /replication.Controller/PromoteVolume
I1222 04:11:47.720612       1 utils.go:179] ID: 19 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC request: {"parameters":{"mirroringMode":"snapshot","schedulingInterval":"2m"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004"}
I1222 04:11:47.723091       1 omap.go:87] ID: 19 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 got omap values: (pool="replicapool", namespace="", name="csi.volume.79b7de99-6261-11ec-adc1-0242ac110004"): map[csi.imageid:1097daf8aee6 csi.imagename:csi-vol-79b7de99-6261-11ec-adc1-0242ac110004 csi.volname:pvc-7783127f-a758-4ee9-8eac-4b813f61be69 csi.volume.owner:default]
E1222 04:11:47.869356       1 replicationcontrollerserver.go:568] ID: 19 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 failed to promote image "replicapool/csi-vol-79b7de99-6261-11ec-adc1-0242ac110004" with error: rbd: ret=-16, Device or resource busy
E1222 04:11:47.869983       1 utils.go:186] ID: 19 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC error: rpc error: code = FailedPrecondition desc = failed to promote image "replicapool/csi-vol-79b7de99-6261-11ec-adc1-0242ac110004" with error: rbd: ret=-16, Device or resource busy
I1222 04:11:47.880137       1 utils.go:177] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC call: /replication.Controller/PromoteVolume
I1222 04:11:47.880426       1 utils.go:179] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC request: {"force":true,"parameters":{"mirroringMode":"snapshot","schedulingInterval":"2m"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004"}
I1222 04:11:47.889445       1 omap.go:87] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 got omap values: (pool="replicapool", namespace="", name="csi.volume.79b7de99-6261-11ec-adc1-0242ac110004"): map[csi.imageid:1097daf8aee6 csi.imagename:csi-vol-79b7de99-6261-11ec-adc1-0242ac110004 csi.volname:pvc-7783127f-a758-4ee9-8eac-4b813f61be69 csi.volume.owner:default]


I1222 04:11:51.808229       1 cephcmds.go:96] executing command rbd [mirror image promote replicapool/csi-vol-79b7de99-6261-11ec-adc1-0242ac110004 --force --id csi-rbd-provisioner -m 192.168.121.3:6789 --keyfile=/tmp/csi/keys/keyfile-1849052362]


I1222 04:11:51.870051       1 replicationcontrollerserver.go:600] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 Added scheduling at interval 2m, start time  for volume replicapool/csi-vol-79b7de99-6261-11ec-adc1-0242ac110004
I1222 04:11:51.873381       1 replicationcontrollerserver.go:608] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 attempting to tickle dummy image for restarting RBD schedules
I1222 04:11:52.376957       1 utils.go:188] ID: 20 Req-ID: 0001-0009-rook-ceph-0000000000000001-79b7de99-6261-11ec-adc1-0242ac110004 GRPC response: {}

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 22, 2021

@idryomov PTAL

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 22, 2021

keeping the timeout as 1 minute but it can be reduced to 20/30 seconds to speed up the failover. @ShyamsundarR @idryomov any suggestion on the timeout?

// TODO: remove this workaround when the issue is fixed
err = rbdVol.forcePromoteImage(cr)
} else {
err = rbdVol.promoteImage(req.Force)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = rbdVol.promoteImage(req.Force)
err = rbdVol.promoteImage(false)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once I remove the workaround I need to change this from false to req.Force. keeping it as for now. Hope its fine

err,
stderr,
program,
sanitizedArgs)
Copy link
Contributor

@idryomov idryomov Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a generic utility helper, it would be nice to differentiate between the cases of a timeout and a command failing of its own accord within a timeout a bit more explicitly, so that instead of "signal: killed" or something like that the error message actually mentions the timeout. Perhaps check for context.DeadlineExceeded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking is it possible with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update code to have timeout: context deadline exceeded error in the error message.

@idryomov
Copy link
Contributor

keeping the timeout as 1 minute but it can be reduced to 20/30 seconds to speed up the failover. @ShyamsundarR @idryomov any suggestion on the timeout?

What I think @ShyamsundarR really wants here is a retry loop, so I'll defer to him for the frequency and the number of retries.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 22, 2021

keeping the timeout as 1 minute but it can be reduced to 20/30 seconds to speed up the failover. @ShyamsundarR @idryomov any suggestion on the timeout?

What I think @ShyamsundarR really wants here is a retry loop, so I'll defer to him for the frequency and the number of retries.

This will be retried from the caller i.e volume replication operator sidecar.

@Madhu-1 Madhu-1 force-pushed the command-withtimeout branch 2 times, most recently from 361738d to 6cfd59c Compare December 22, 2021 11:27
// forcePromoteImage promotes image to primary with force option with 1 minute
// timeout. If there is no response within 1 minute,the rbd CLI process will be
// killed and an error is returned.
func (rv *rbdVolume) forcePromoteImage(cr *util.Credentials) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idryomov if force promote is rolling back a snapshot, then would it take time to respond? and if so, terminating it early would continue the operation or restart fresh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShyamsundarR this is client termination right, will it have any impact on the server-side? but good to get it confirmed 👍

@ShyamsundarR
Copy link
Contributor

keeping the timeout as 1 minute but it can be reduced to 20/30 seconds to speed up the failover. @ShyamsundarR @idryomov any suggestion on the timeout?

What I think @ShyamsundarR really wants here is a retry loop, so I'll defer to him for the frequency and the number of retries.

This will be retried from the caller i.e volume replication operator sidecar.

1 minute is fine for now as the default. The current focus is to recover, even if RTO is impacted slightly.

ShyamsundarR
ShyamsundarR previously approved these changes Dec 23, 2021
program,
sanitizedArgs)

if ctx != context.TODO() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rational for checking context.TODO(), should it not be logged for any non-nil contexts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot log the ctx if its TODO as it does not contains the request ID

internal/rbd/replicationcontrollerserver.go Outdated Show resolved Hide resolved
err = rbdVol.forcePromoteImage(cr)
} else {
err = rbdVol.promoteImage(req.Force)
}
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there was an error, should it not be inspected? A timeout can be retried, but an other failure maybe not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left to the caller to retry all errors will be returned to the users

added ExecCommandWithTimeout helper function
to execute the commands with the timeout option,
if the command does not return any response with
in the timeout time the process will be terminated
and error will be returned back to the user.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
use ExecCommandWithTimeout with timeout
of 1 minute for the promote operation.
If the command doesnot returns error/response
in 1 minute the process will be killed
and error will be returned to the user.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@mergify mergify bot dismissed ShyamsundarR’s stale review December 23, 2021 12:31

Pull request has been modified.

Copy link
Contributor

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding back the review) Thanks @nixpanic

@mergify mergify bot merged commit e4b7943 into ceph:devel Dec 23, 2021
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/skip/e2e skip running e2e CI jobs component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants