New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw/multisite: maintain endpoints connectable status and retry the requests to them when appropriate #53320
Conversation
Replication lag comparison between pre-fix and post-fix. Multisite clusters configuration:
Pre-fix:
Post-fix:
|
@@ -67,8 +67,12 @@ inline param_vec_t make_param_list(const std::map<std::string, std::string> *pp) | |||
|
|||
class RGWRESTConn | |||
{ | |||
/* the endpoint is not able to connect if the timestamp is not real_clock::zero */ | |||
using endpoint_status_map = std::unordered_map<std::string, std::atomic<ceph::real_time>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a comment @cbodley made in this PR on the same changes, which have been moved and updated here.
endpoints_status
could probably just be astd::map<std::string, ceph::real_time>
that only stores the failed endpoints.std::unordered_map
is probably overkill when we only expect a handful of endpoints per zone.boost::container::flat_map
could be a good alternative to reduce memory allocations
Here are the reasons why I still choose to use the std::unordered_map
to maintain the status for each and every endpoint are the following:
- the
endpoints_status
map itself is created when the RGWRESTConn is instantiated and won't be updated after that, hence the map structure itself don't need to be thread-safe. - however the timestamp/status of each endpoint might be accessed simultaneously by multiple threads so it should be handled in a thread-safe way. In order to reduce the lock granularity, I choose to use
std::atomic
for each endpoint instead of locking the entire map. Since majority operations on the map would be read-only, this can reduce the overall time spent on locking. std::atomic
is neither copyable nor movable hence I'm not able to useboost::container::flat_map
here.
jenkins test make check |
jenkins test ceph API tests |
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley
tested, on a 3x3 dedicated for sync RGWs,
graph is sync rate from PRI to SEC, bottom scale is seconds, at:
at the 1400 seconds mark 1 RGW of the 3 was stopped and synching continued with 2 RGWs out of 3
at the 3000 seconds mark another RGW was stopped and synching continued with 1 RGWs out of 3
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
fad79d4
to
7014f69
Compare
rebased |
7014f69
to
cb7af19
Compare
jenkins test make check |
does this PR need |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
jenkins test api |
@jzhu116-bloomberg regarding multisite functional tests, there's a README in src/test/rgw/test_multi.md for running the tests locally from a ceph build directory the test cases themselves are in src/test/rgw/rgw_multi/tests.py. see test_object_sync() for a simple example src/test/rgw/rgw_multi/multisite.py contains the abstractions for a central concept of these tests are the 'checkpoints', like does that make sense? |
|
01dd685
to
e467909
Compare
Added some multisite test cases trying to cover as mush code path changed in this PR as possible. In these new test cases, the 2nd rgw instance in each zone is being brought down before the test and brought up after the test. All the newly added test cases are under
All the newly added test cases passed in my local run. |
All the newly added test cases work well until I pull the main branch where the test case "test_role_sync" fails in |
…quests to them when appropriate Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
9612399
to
3ee43f0
Compare
Rebased and all |
src/test/rgw/rgw_multi/tests.py
Outdated
if not stop_2nd_rgw(zonegroup): | ||
log.warning('Skipping the test - test_bucket_create_rgw_down: ' | ||
'more than one rgw needed in any one or multiple zone(s)') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use SkipTest so these get reported as a skip instead of a pass, ex:
raise SkipTest("test_multi_period_incremental_sync skipped. Requires 3 or more zones in master zonegroup.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/test/rgw/rgw_multi/tests.py
Outdated
zonegroup_conns = ZonegroupConns(zonegroup) | ||
buckets, _ = create_bucket_per_zone(zonegroup_conns, 2) | ||
zonegroup_meta_checkpoint(zonegroup) | ||
|
||
for zone in zonegroup_conns.zones: | ||
assert check_all_buckets_exist(zone, buckets) | ||
|
||
start_2nd_rgw(zonegroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using try/finally to make sure the gateways always get restarted, even if the test fails above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
verified that the tests pass locally, albeit slowly, with
|
Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
3ee43f0
to
9b36e31
Compare
The heavier the traffic, the better the performance improvement. |
it seems that would cause at most a 2-second delay to each test case after restarting its rgw? i'd guess that more significant delays are caused by the fact that radosgw doesn't drop the data/mdlog shard locks it holds on shutdown, so nobody can make progress on those for the next rgw_sync_lease_period=2 minutes i'll run this through qa. if see that it adds too much to the runtime of rgw/multisite jobs, we might look for ways to simplify the tests |
the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete |
we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think? |
okay, so it would live outside of qa/suites/rgw there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably |
sounds good. about the present multisite failures, I updated the result here: #54442 |
thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py |
sure, please go ahead. thanks! |
thanks @jzhu116-bloomberg, and thanks @mkogan1 for the help with testing |
<< " diff=" << diff << dendl; | ||
|
||
static constexpr uint32_t CONN_STATUS_EXPIRE_SECS = 2; | ||
if (diff >= CONN_STATUS_EXPIRE_SECS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jzhu116-bloomberg .. could you please explain what it means if connection_status is not expired here (i.e, diff < 2sec). Shouldn't it be considered valid endpoint?
This check added here breaks cloud-transition code and perhaps cloud-sync module too - https://tracker.ceph.com/issues/65251
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumyakoduri this tracks endpoints that recently returned a connection error so we can retry against other endpoints. 'not expired' here means that it failed recently enough that we shouldn't try again
however, this logic is not very good at differentiating between connection errors and other http errors. it just checks for EIO, which turns out to be the default case in rgw_http_error_to_errno()
@yuvalif and i discussed this recently in #55527, and changed some pubsub ops to return ERR_SERVICE_UNAVAILABLE instead of EIO to work around that
i would like to find some other default error code for rgw_http_error_to_errno()
, but it's not clear what that might break in multisite
for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?
Thanks Casey.
For simple object transition case, I can think of requests such as target bucket creation (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1217) and fetching HEAD object (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282) which may fail. This is expected at times and hence the errors are ignored. The errors returned by the cloud endpoint for these ops may include EIO.
There may be few more such cases with multipart upload to the cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite
if we can't prevent these cloud transition ops from returning errors that map to EIO, removing EIO from rgw_http_error_to_errno() seems like the only viable solution. i pushed #56704 as an example. @smanjara do you think we could reasonably validate such a change with multisite testing at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smanjara i've added that to a test batch already. i'll share the results and we can go through them together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Casey and Shilpa
Maintain endpoints connectable status and retry the requests to them when appropriate.
Fixes: https://tracker.ceph.com/issues/62710
Signed-off-by: Jane Zhu jzhu116@bloomberg.net
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows