rgw/multisite: maintain endpoints connectable status and retry the requests to them when appropriate #53320

jzhu116-bloomberg · 2023-09-06T19:30:06Z

Maintain endpoints connectable status and retry the requests to them when appropriate.

Fixes: https://tracker.ceph.com/issues/62710
Signed-off-by: Jane Zhu jzhu116@bloomberg.net

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

jzhu116-bloomberg · 2023-09-06T19:35:49Z

Replication lag comparison between pre-fix and post-fix.

Multisite clusters configuration:

2 clusters
each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
all rgw instances are added in the zonegroup settings
shutdown all the rgw instances on one primary rgw node

Pre-fix:

Client traffic: cosbench write only, 15 users, 30 workers, 600 seconds
Replication lag: 49 mins

Post-fix:

Client traffic: cosbench write only (with similar pattern to the pre-fix test), 15 users, 30 workers, 3600 seconds
Replication lag: roughly 16 mins

jzhu116-bloomberg · 2023-09-06T19:57:38Z

src/rgw/rgw_rest_conn.h

@@ -67,8 +67,12 @@ inline param_vec_t make_param_list(const std::map<std::string, std::string> *pp)

 class RGWRESTConn
 {
+  /* the endpoint is not able to connect if the timestamp is not real_clock::zero */
+  using endpoint_status_map = std::unordered_map<std::string, std::atomic<ceph::real_time>>;


This is a comment @cbodley made in this PR on the same changes, which have been moved and updated here.

endpoints_status could probably just be a std::map<std::string, ceph::real_time> that only stores the failed endpoints. std::unordered_map is probably overkill when we only expect a handful of endpoints per zone. boost::container::flat_map could be a good alternative to reduce memory allocations

Here are the reasons why I still choose to use the std::unordered_map to maintain the status for each and every endpoint are the following:

the endpoints_status map itself is created when the RGWRESTConn is instantiated and won't be updated after that, hence the map structure itself don't need to be thread-safe.

however the timestamp/status of each endpoint might be accessed simultaneously by multiple threads so it should be handled in a thread-safe way. In order to reduce the lock granularity, I choose to use std::atomic for each endpoint instead of locking the entire map. Since majority operations on the map would be read-only, this can reduce the overall time spent on locking.

std::atomic is neither copyable nor movable hence I'm not able to use boost::container::flat_map here.

jzhu116-bloomberg · 2023-09-07T21:12:34Z

jenkins test make check

jzhu116-bloomberg · 2023-09-08T03:00:04Z

jenkins test ceph API tests

jzhu116-bloomberg · 2023-09-13T20:57:59Z

jenkins test make check

mkogan1

@cbodley
tested, on a 3x3 dedicated for sync RGWs,

graph is sync rate from PRI to SEC, bottom scale is seconds, at:
at the 1400 seconds mark 1 RGW of the 3 was stopped and synching continued with 2 RGWs out of 3
at the 3000 seconds mark another RGW was stopped and synching continued with 1 RGWs out of 3

github-actions · 2023-10-09T15:57:04Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

jzhu116-bloomberg · 2023-10-17T03:43:26Z

rebased

jzhu116-bloomberg · 2023-10-17T13:38:10Z

jenkins test make check

jzhu116-bloomberg · 2023-10-24T21:15:03Z

does this PR need needs-qa flag?

github-actions · 2023-11-06T16:18:57Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

jzhu116-bloomberg · 2023-11-11T08:30:22Z

jenkins test api

cbodley · 2023-11-29T17:48:38Z

@jzhu116-bloomberg regarding multisite functional tests, there's a README in src/test/rgw/test_multi.md for running the tests locally from a ceph build directory

the test cases themselves are in src/test/rgw/rgw_multi/tests.py. see test_object_sync() for a simple example

src/test/rgw/rgw_multi/multisite.py contains the abstractions for Gateway, Zone, etc. individual gateways can be stopped/started with Gateway.stop()/start(). there's also a Zone.stop()/start() that does the same for all gateways in the zone. there are several examples like test_bucket_full_sync_after_data_sync_init() that use the Zone-level stop/start, but i don't know of any that control gateways individually

a central concept of these tests are the 'checkpoints', like zonegroup_bucket_checkpoint() which waits for bucket sync on all zones in the zonegroup to catch up to all others, then verifies that all of their contents match. so you'd use some combination of object writes, bucket checkpoints, and gateway stops/starts to verify that sync continues while gateways are down

does that make sense?

jzhu116-bloomberg · 2023-12-04T05:22:21Z

does that make sense?
Yup, it makes sense. Thanks!
I'll create some test cases.

jzhu116-bloomberg · 2024-01-10T05:50:41Z

Added some multisite test cases trying to cover as mush code path changed in this PR as possible. In these new test cases, the 2nd rgw instance in each zone is being brought down before the test and brought up after the test.

All the newly added test cases are under @attr('rgw_down'). The following is an example of how to run all of them.

$ MON=1 OSD=1 MGR=0 MSD=0 RGW_MULTI_TEST_CONF=./test_multi.conf nosetests -s /ceph-src/src/test/rgw/test_multi.py -a 'rgw_down'

All the newly added test cases passed in my local run.

jzhu116-bloomberg · 2024-01-10T20:41:23Z

All the newly added test cases work well until I pull the main branch where the test case "test_role_sync" fails in forward_iam_request_to_master.
I did some digging and testing. It looks like potentially this PR breaks it.
I've open this new tracker for this issue since it doesn't look related to the changes in this PR.

…quests to them when appropriate Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>

jzhu116-bloomberg · 2024-01-16T19:03:05Z

Rebased and all src/test/rgw/test_multi.py -a 'rgw_down' multisite tests passed now.

cbodley · 2024-01-19T20:03:02Z

src/test/rgw/rgw_multi/tests.py

+    if not stop_2nd_rgw(zonegroup):
+        log.warning('Skipping the test - test_bucket_create_rgw_down: '
+                    'more than one rgw needed in any one or multiple zone(s)')
+        return


can use SkipTest so these get reported as a skip instead of a pass, ex:

raise SkipTest("test_multi_period_incremental_sync skipped. Requires 3 or more zones in master zonegroup.")

cbodley · 2024-01-19T20:05:08Z

src/test/rgw/rgw_multi/tests.py

+    zonegroup_conns = ZonegroupConns(zonegroup)
+    buckets, _ = create_bucket_per_zone(zonegroup_conns, 2)
+    zonegroup_meta_checkpoint(zonegroup)
+
+    for zone in zonegroup_conns.zones:
+        assert check_all_buckets_exist(zone, buckets)
+
+    start_2nd_rgw(zonegroup)


consider using try/finally to make sure the gateways always get restarted, even if the test fails above

cbodley · 2024-01-19T20:51:23Z

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>

jzhu116-bloomberg · 2024-01-22T20:24:10Z

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

The heavier the traffic, the better the performance improvement.
Currently the connection status expiration period is hard-coded as CONN_STATUS_EXPIRE_SECS = 2. This can impact the overall performance and can potentially be made configurable.

cbodley · 2024-01-22T20:40:34Z

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

The heavier the traffic, the better the performance improvement. Currently the connection status expiration period is hard-coded as CONN_STATUS_EXPIRE_SECS = 2. This can impact the overall performance and can potentially be made configurable.

it seems that would cause at most a 2-second delay to each test case after restarting its rgw?

i'd guess that more significant delays are caused by the fact that radosgw doesn't drop the data/mdlog shard locks it holds on shutdown, so nobody can make progress on those for the next rgw_sync_lease_period=2 minutes

i'll run this through qa. if see that it adds too much to the runtime of rgw/multisite jobs, we might look for ways to simplify the tests

cbodley · 2024-01-24T15:23:17Z

https://pulpito.ceph.com/cbodley-2024-01-23_15:29:56-rgw:multisite-wip-cbodley-testing-distro-default-smithi/

the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone

cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete

smanjara · 2024-01-24T15:38:38Z

https://pulpito.ceph.com/cbodley-2024-01-23_15:29:56-rgw:multisite-wip-cbodley-testing-distro-default-smithi/

the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone

cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

cbodley · 2024-01-24T15:44:37Z

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

okay, so it would live outside of qa/suites/rgw

there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently

i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably

smanjara · 2024-01-24T15:49:31Z

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

okay, so it would live outside of qa/suites/rgw

there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently

i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably

sounds good. about the present multisite failures, I updated the result here: #54442

cbodley · 2024-01-24T15:54:03Z

thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py

smanjara · 2024-01-24T15:57:47Z

thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py

sure, please go ahead. thanks!

cbodley · 2024-01-24T16:02:10Z

thanks @jzhu116-bloomberg, and thanks @mkogan1 for the help with testing

soumyakoduri · 2024-04-02T07:32:19Z

src/rgw/rgw_rest_conn.cc

+                   << " diff=" << diff << dendl;
+
+    static constexpr uint32_t CONN_STATUS_EXPIRE_SECS = 2;
+    if (diff >= CONN_STATUS_EXPIRE_SECS) {


@jzhu116-bloomberg .. could you please explain what it means if connection_status is not expired here (i.e, diff < 2sec). Shouldn't it be considered valid endpoint?

This check added here breaks cloud-transition code and perhaps cloud-sync module too - https://tracker.ceph.com/issues/65251

@soumyakoduri this tracks endpoints that recently returned a connection error so we can retry against other endpoints. 'not expired' here means that it failed recently enough that we shouldn't try again

however, this logic is not very good at differentiating between connection errors and other http errors. it just checks for EIO, which turns out to be the default case in rgw_http_error_to_errno()

@yuvalif and i discussed this recently in #55527, and changed some pubsub ops to return ERR_SERVICE_UNAVAILABLE instead of EIO to work around that

i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite

for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?

for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?

Thanks Casey.
For simple object transition case, I can think of requests such as target bucket creation (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1217) and fetching HEAD object (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282) which may fail. This is expected at times and hence the errors are ignored. The errors returned by the cloud endpoint for these ops may include EIO.

There may be few more such cases with multipart upload to the cloud.

i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite

if we can't prevent these cloud transition ops from returning errors that map to EIO, removing EIO from rgw_http_error_to_errno() seems like the only viable solution. i pushed #56704 as an example. @smanjara do you think we could reasonably validate such a change with multisite testing at this point?

there are places where we return EIO explicitly but I am not sure where we map some other error to EIO. I'll pull your changes from #56704 and run it through our suite @cbodley

@smanjara i've added that to a test batch already. i'll share the results and we can go through them together

Thanks Casey and Shilpa

jzhu116-bloomberg requested a review from a team as a code owner September 6, 2023 19:30

github-actions bot added the rgw label Sep 6, 2023

jzhu116-bloomberg commented Sep 6, 2023

View reviewed changes

mkogan1 approved these changes Oct 9, 2023

View reviewed changes

github-actions bot added the needs-rebase label Oct 9, 2023

jzhu116-bloomberg force-pushed the wip-62710 branch from fad79d4 to 7014f69 Compare October 17, 2023 03:40

github-actions bot removed the needs-rebase label Oct 17, 2023

jzhu116-bloomberg force-pushed the wip-62710 branch from 7014f69 to cb7af19 Compare October 17, 2023 03:55

jzhu116-bloomberg mentioned this pull request Oct 18, 2023

rgw/multisite: retry radosgw-admin sync status on endpoint connection failures #52812

Closed

14 tasks

dang added the needs-qa label Oct 25, 2023

github-actions bot added the needs-rebase label Nov 6, 2023

github-actions bot removed the needs-rebase label Nov 6, 2023

github-actions bot added the tests label Jan 10, 2024

jzhu116-bloomberg force-pushed the wip-62710 branch from 01dd685 to e467909 Compare January 10, 2024 05:38

rgw/multisite: maintain endpoints connectable status and retry the re…

e200499

…quests to them when appropriate Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>

jzhu116-bloomberg force-pushed the wip-62710 branch from 9612399 to 3ee43f0 Compare January 16, 2024 19:00

cbodley reviewed Jan 19, 2024

View reviewed changes

rgw/multisite: add multisite test cases with some rgw instances down

9b36e31

Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>

jzhu116-bloomberg force-pushed the wip-62710 branch from 3ee43f0 to 9b36e31 Compare January 22, 2024 20:17

cbodley added the wip-cbodley-testing label Jan 22, 2024

cbodley merged commit d3256c4 into ceph:main Jan 24, 2024
11 checks passed

cbodley mentioned this pull request Feb 13, 2024

rgw/notifications: handle migration state between v1 and v2 #55527

Closed

14 tasks

soumyakoduri reviewed Apr 2, 2024

View reviewed changes

soumyakoduri mentioned this pull request Apr 3, 2024

rgw/cloud: Handle RGWRESTStreamS3PutObj initialization failures #56657

Merged

14 tasks

rgw/multisite: maintain endpoints connectable status and retry the requests to them when appropriate #53320

rgw/multisite: maintain endpoints connectable status and retry the requests to them when appropriate #53320

Conversation

jzhu116-bloomberg commented Sep 6, 2023

Contribution Guidelines

Checklist

jzhu116-bloomberg commented Sep 6, 2023

Multisite clusters configuration:

Pre-fix:

Post-fix:

Choose a reason for hiding this comment

jzhu116-bloomberg commented Sep 7, 2023

jzhu116-bloomberg commented Sep 8, 2023

jzhu116-bloomberg commented Sep 13, 2023

mkogan1 left a comment • edited

Choose a reason for hiding this comment

github-actions bot commented Oct 9, 2023

jzhu116-bloomberg commented Oct 17, 2023

jzhu116-bloomberg commented Oct 17, 2023

jzhu116-bloomberg commented Oct 24, 2023

github-actions bot commented Nov 6, 2023

jzhu116-bloomberg commented Nov 11, 2023

cbodley commented Nov 29, 2023

jzhu116-bloomberg commented Dec 4, 2023

jzhu116-bloomberg commented Jan 10, 2024

jzhu116-bloomberg commented Jan 10, 2024 • edited

jzhu116-bloomberg commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbodley commented Jan 19, 2024

jzhu116-bloomberg commented Jan 22, 2024

cbodley commented Jan 22, 2024

cbodley commented Jan 24, 2024

smanjara commented Jan 24, 2024

cbodley commented Jan 24, 2024

smanjara commented Jan 24, 2024

cbodley commented Jan 24, 2024

smanjara commented Jan 24, 2024

cbodley commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkogan1 left a comment •

edited

jzhu116-bloomberg commented Jan 10, 2024 •

edited