New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

Merged
merged 1 commit into from Feb 28, 2017

Conversation

Projects
None yet
2 participants
@cbodley
Contributor

cbodley commented Feb 20, 2017

RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error
codes, with all other errors being fatal when exit_on_error is set

to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync
is possible on that shard until the gateway restarts

this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so
that it will continue to retry with backoff no matter what error code it
gets

Fixes: http://tracker.ceph.com/issues/19019

@cbodley

This comment has been minimized.

Contributor

cbodley commented Feb 20, 2017

@yehudasa can you think of any cases where we really do want to give up on a shard?

the only case i can think of is for #13070, where we need to signal that we're done with the current period - but i'd rather make that an explicit part of the RGWBackoffControlCR interface, rather than returning a faked-up error code

@cbodley cbodley requested a review from yehudasa Feb 20, 2017

@cbodley

This comment has been minimized.

Contributor

cbodley commented Feb 22, 2017

there was some discussion about the risks for continuing to retry on non-transient errors. i'd like to find a way to address these, while maintaining the ability to recover from errors at this level

one risk is related to RGWSyncErrorLogger, which writes each sync error to rados so that admins can view them with radosgw-admin sync error list. if we retry endlessly, we'd spam this error log and eventually fill the storage. however, we could avoid duplicating these error entries if each sync shard were to remember the last key for which it logged an error

are there other issues that need to be addressed before we can safely retry on all errors?

we might also want to raise the maximum backoff time above 30 seconds to deal with extreme cases

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes
RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error
codes, with all other errors being fatal when exit_on_error is set

to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync
is possible on that shard until the gateway restarts

this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so
that it will continue to retry with backoff no matter what error code it
gets

Fixes: http://tracker.ceph.com/issues/19019

Signed-off-by: Casey Bodley <cbodley@redhat.com>

@cbodley cbodley changed the title from [RFC] rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes to rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes Feb 27, 2017

@cbodley

This comment has been minimized.

Contributor

cbodley commented Feb 27, 2017

updated to clear the reset_backoff flag on errors from the remote

@mattbenjamin mattbenjamin self-requested a review Feb 28, 2017

@mattbenjamin

lgtm

@mattbenjamin mattbenjamin merged commit af7f048 into ceph:master Feb 28, 2017

2 of 3 checks passed

default Build finished.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment