Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

Merged
merged 1 commit into from Feb 28, 2017

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Feb 20, 2017

RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error
codes, with all other errors being fatal when exit_on_error is set

to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync
is possible on that shard until the gateway restarts

this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so
that it will continue to retry with backoff no matter what error code it
gets

Fixes: http://tracker.ceph.com/issues/19019

@cbodley
Copy link
Contributor Author

cbodley commented Feb 20, 2017

@yehudasa can you think of any cases where we really do want to give up on a shard?

the only case i can think of is for #13070, where we need to signal that we're done with the current period - but i'd rather make that an explicit part of the RGWBackoffControlCR interface, rather than returning a faked-up error code

@cbodley
Copy link
Contributor Author

cbodley commented Feb 22, 2017

there was some discussion about the risks for continuing to retry on non-transient errors. i'd like to find a way to address these, while maintaining the ability to recover from errors at this level

one risk is related to RGWSyncErrorLogger, which writes each sync error to rados so that admins can view them with radosgw-admin sync error list. if we retry endlessly, we'd spam this error log and eventually fill the storage. however, we could avoid duplicating these error entries if each sync shard were to remember the last key for which it logged an error

are there other issues that need to be addressed before we can safely retry on all errors?

we might also want to raise the maximum backoff time above 30 seconds to deal with extreme cases

RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error
codes, with all other errors being fatal when exit_on_error is set

to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync
is possible on that shard until the gateway restarts

this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so
that it will continue to retry with backoff no matter what error code it
gets

Fixes: http://tracker.ceph.com/issues/19019

Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley cbodley changed the title [RFC] rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes Feb 27, 2017
@cbodley
Copy link
Contributor Author

cbodley commented Feb 27, 2017

updated to clear the reset_backoff flag on errors from the remote

Copy link
Contributor

@mattbenjamin mattbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants