rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

cbodley · 2017-02-20T21:11:21Z

RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error
codes, with all other errors being fatal when exit_on_error is set

to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync
is possible on that shard until the gateway restarts

this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so
that it will continue to retry with backoff no matter what error code it
gets

Fixes: http://tracker.ceph.com/issues/19019

cbodley · 2017-02-20T21:29:30Z

@yehudasa can you think of any cases where we really do want to give up on a shard?

the only case i can think of is for #13070, where we need to signal that we're done with the current period - but i'd rather make that an explicit part of the RGWBackoffControlCR interface, rather than returning a faked-up error code

cbodley · 2017-02-22T15:18:26Z

there was some discussion about the risks for continuing to retry on non-transient errors. i'd like to find a way to address these, while maintaining the ability to recover from errors at this level

one risk is related to RGWSyncErrorLogger, which writes each sync error to rados so that admins can view them with radosgw-admin sync error list. if we retry endlessly, we'd spam this error log and eventually fill the storage. however, we could avoid duplicating these error entries if each sync shard were to remember the last key for which it logged an error

are there other issues that need to be addressed before we can safely retry on all errors?

we might also want to raise the maximum backoff time above 30 seconds to deal with extreme cases

RGWBackoffControlCR only treats EBUSY and EAGAIN as 'temporary' error codes, with all other errors being fatal when exit_on_error is set to RGWMetaSyncShardControlCR, a 'fatal' error means that no further sync is possible on that shard until the gateway restarts this changes RGWMetaSyncShardControlCR to set exit_on_error to false, so that it will continue to retry with backoff no matter what error code it gets Fixes: http://tracker.ceph.com/issues/19019 Signed-off-by: Casey Bodley <cbodley@redhat.com>

cbodley · 2017-02-27T20:10:07Z

updated to clear the reset_backoff flag on errors from the remote

mattbenjamin

lgtm

cbodley added bug-fix rgw labels Feb 20, 2017

cbodley requested a review from yehudasa February 20, 2017 21:29

cbodley force-pushed the wip-19019 branch from 3dbc414 to 3e40595 Compare February 27, 2017 20:06

cbodley changed the title ~~[RFC] rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes~~ rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes Feb 27, 2017

cbodley added wip-cbodley-testing and removed wip-cbodley-testing labels Feb 27, 2017

mattbenjamin self-requested a review February 28, 2017 20:05

mattbenjamin approved these changes Feb 28, 2017

View reviewed changes

mattbenjamin merged commit af7f048 into ceph:master Feb 28, 2017

cbodley mentioned this pull request Mar 7, 2017

rgw multisite: fixes for meta sync across periods #13070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

cbodley commented Feb 20, 2017

cbodley commented Feb 20, 2017

cbodley commented Feb 22, 2017

cbodley commented Feb 27, 2017

mattbenjamin left a comment

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

rgw: RGWMetaSyncShardControlCR retries with backoff on all error codes #13546

Conversation

cbodley commented Feb 20, 2017

cbodley commented Feb 20, 2017

cbodley commented Feb 22, 2017

cbodley commented Feb 27, 2017

mattbenjamin left a comment

Choose a reason for hiding this comment