Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/multisite: maintain endpoints connectable status and retry the requests to them when appropriate #53320

Merged
merged 2 commits into from Jan 24, 2024

Conversation

jzhu116-bloomberg
Copy link
Contributor

Maintain endpoints connectable status and retry the requests to them when appropriate.

Fixes: https://tracker.ceph.com/issues/62710
Signed-off-by: Jane Zhu jzhu116@bloomberg.net

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@jzhu116-bloomberg jzhu116-bloomberg requested a review from a team as a code owner September 6, 2023 19:30
@github-actions github-actions bot added the rgw label Sep 6, 2023
@jzhu116-bloomberg
Copy link
Contributor Author

Replication lag comparison between pre-fix and post-fix.

Multisite clusters configuration:

  • 2 clusters
  • each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
  • all rgw instances are added in the zonegroup settings
  • shutdown all the rgw instances on one primary rgw node

Pre-fix:

  • Client traffic: cosbench write only, 15 users, 30 workers, 600 seconds
  • Replication lag: 49 mins

Post-fix:

  • Client traffic: cosbench write only (with similar pattern to the pre-fix test), 15 users, 30 workers, 3600 seconds
  • Replication lag: roughly 16 mins

@@ -67,8 +67,12 @@ inline param_vec_t make_param_list(const std::map<std::string, std::string> *pp)

class RGWRESTConn
{
/* the endpoint is not able to connect if the timestamp is not real_clock::zero */
using endpoint_status_map = std::unordered_map<std::string, std::atomic<ceph::real_time>>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment @cbodley made in this PR on the same changes, which have been moved and updated here.

endpoints_status could probably just be a std::map<std::string, ceph::real_time> that only stores the failed endpoints. std::unordered_map is probably overkill when we only expect a handful of endpoints per zone. boost::container::flat_map could be a good alternative to reduce memory allocations

Here are the reasons why I still choose to use the std::unordered_map to maintain the status for each and every endpoint are the following:

  • the endpoints_status map itself is created when the RGWRESTConn is instantiated and won't be updated after that, hence the map structure itself don't need to be thread-safe.
  • however the timestamp/status of each endpoint might be accessed simultaneously by multiple threads so it should be handled in a thread-safe way. In order to reduce the lock granularity, I choose to use std::atomic for each endpoint instead of locking the entire map. Since majority operations on the map would be read-only, this can reduce the overall time spent on locking.
  • std::atomic is neither copyable nor movable hence I'm not able to use boost::container::flat_map here.

@jzhu116-bloomberg
Copy link
Contributor Author

jenkins test make check

@jzhu116-bloomberg
Copy link
Contributor Author

jenkins test ceph API tests

@jzhu116-bloomberg
Copy link
Contributor Author

jenkins test make check

Copy link
Contributor

@mkogan1 mkogan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley
tested, on a 3x3 dedicated for sync RGWs,

graph is sync rate from PRI to SEC, bottom scale is seconds, at:
at the 1400 seconds mark 1 RGW of the 3 was stopped and synching continued with 2 RGWs out of 3
at the 3000 seconds mark another RGW was stopped and synching continued with 1 RGWs out of 3

image

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@jzhu116-bloomberg
Copy link
Contributor Author

rebased

@jzhu116-bloomberg
Copy link
Contributor Author

jenkins test make check

@jzhu116-bloomberg
Copy link
Contributor Author

does this PR need needs-qa flag?

Copy link

github-actions bot commented Nov 6, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@jzhu116-bloomberg
Copy link
Contributor Author

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented Nov 29, 2023

@jzhu116-bloomberg regarding multisite functional tests, there's a README in src/test/rgw/test_multi.md for running the tests locally from a ceph build directory

the test cases themselves are in src/test/rgw/rgw_multi/tests.py. see test_object_sync() for a simple example

src/test/rgw/rgw_multi/multisite.py contains the abstractions for Gateway, Zone, etc. individual gateways can be stopped/started with Gateway.stop()/start(). there's also a Zone.stop()/start() that does the same for all gateways in the zone. there are several examples like test_bucket_full_sync_after_data_sync_init() that use the Zone-level stop/start, but i don't know of any that control gateways individually

a central concept of these tests are the 'checkpoints', like zonegroup_bucket_checkpoint() which waits for bucket sync on all zones in the zonegroup to catch up to all others, then verifies that all of their contents match. so you'd use some combination of object writes, bucket checkpoints, and gateway stops/starts to verify that sync continues while gateways are down

does that make sense?

@jzhu116-bloomberg
Copy link
Contributor Author

does that make sense?
Yup, it makes sense. Thanks!
I'll create some test cases.

@jzhu116-bloomberg
Copy link
Contributor Author

Added some multisite test cases trying to cover as mush code path changed in this PR as possible. In these new test cases, the 2nd rgw instance in each zone is being brought down before the test and brought up after the test.

All the newly added test cases are under @attr('rgw_down'). The following is an example of how to run all of them.

$ MON=1 OSD=1 MGR=0 MSD=0 RGW_MULTI_TEST_CONF=./test_multi.conf nosetests -s /ceph-src/src/test/rgw/test_multi.py -a 'rgw_down'

All the newly added test cases passed in my local run.

@jzhu116-bloomberg
Copy link
Contributor Author

jzhu116-bloomberg commented Jan 10, 2024

All the newly added test cases work well until I pull the main branch where the test case "test_role_sync" fails in forward_iam_request_to_master.
I did some digging and testing. It looks like potentially this PR breaks it.
I've open this new tracker for this issue since it doesn't look related to the changes in this PR.

…quests to them when appropriate

Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
@jzhu116-bloomberg
Copy link
Contributor Author

Rebased and all src/test/rgw/test_multi.py -a 'rgw_down' multisite tests passed now.

Comment on lines 2909 to 2912
if not stop_2nd_rgw(zonegroup):
log.warning('Skipping the test - test_bucket_create_rgw_down: '
'more than one rgw needed in any one or multiple zone(s)')
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use SkipTest so these get reported as a skip instead of a pass, ex:

raise SkipTest("test_multi_period_incremental_sync skipped. Requires 3 or more zones in master zonegroup.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 2914 to 2921
zonegroup_conns = ZonegroupConns(zonegroup)
buckets, _ = create_bucket_per_zone(zonegroup_conns, 2)
zonegroup_meta_checkpoint(zonegroup)

for zone in zonegroup_conns.zones:
assert check_all_buckets_exist(zone, buckets)

start_2nd_rgw(zonegroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using try/finally to make sure the gateways always get restarted, even if the test fails above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cbodley
Copy link
Contributor

cbodley commented Jan 19, 2024

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

Signed-off-by: Juan Zhu <jzhu4@dev-10-34-20-139.pw1.bcc.bloomberg.com>
@jzhu116-bloomberg
Copy link
Contributor Author

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

The heavier the traffic, the better the performance improvement.
Currently the connection status expiration period is hard-coded as CONN_STATUS_EXPIRE_SECS = 2. This can impact the overall performance and can potentially be made configurable.

@cbodley
Copy link
Contributor

cbodley commented Jan 22, 2024

verified that the tests pass locally, albeit slowly, with gateways_per_zone = 2

Ran 14 tests in 1756.486s

The heavier the traffic, the better the performance improvement. Currently the connection status expiration period is hard-coded as CONN_STATUS_EXPIRE_SECS = 2. This can impact the overall performance and can potentially be made configurable.

it seems that would cause at most a 2-second delay to each test case after restarting its rgw?

i'd guess that more significant delays are caused by the fact that radosgw doesn't drop the data/mdlog shard locks it holds on shutdown, so nobody can make progress on those for the next rgw_sync_lease_period=2 minutes

i'll run this through qa. if see that it adds too much to the runtime of rgw/multisite jobs, we might look for ways to simplify the tests

@cbodley
Copy link
Contributor

cbodley commented Jan 24, 2024

https://pulpito.ceph.com/cbodley-2024-01-23_15:29:56-rgw:multisite-wip-cbodley-testing-distro-default-smithi/

the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone

cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete

@smanjara
Copy link
Contributor

https://pulpito.ceph.com/cbodley-2024-01-23_15:29:56-rgw:multisite-wip-cbodley-testing-distro-default-smithi/

the runtimes were 3h38m and 4h53m, but all of the rgw_down tests were skipped because we only run a single rgw per zone

cc @smanjara @yuvalif do you think we should add a rgw/multisite job that runs two gateways per zone so we can exercise this? i just worry that it'll take much longer. the rest of the rgw suite tends to complete in under an hour, so i often approve/merge things before the multisite jobs even complete

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

@cbodley
Copy link
Contributor

cbodley commented Jan 24, 2024

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

okay, so it would live outside of qa/suites/rgw

there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently

i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably

@smanjara
Copy link
Contributor

we could add two gateways per zone in a separate yaml and have those tests run periodically at some interval like during point releases. we may not have to run them for every multisite pr approval I think?

okay, so it would live outside of qa/suites/rgw

there's already a qa/suites/rgw-multisite-upgrade subsuite that we had to move out of qa/suites/upgrade in #42869 because the multisite tests fail consistently

i guess we could add qa/suites/rgw-multisite for more comprehensive coverage, but we might want to wait until the existing tests can pass reliably

sounds good. about the present multisite failures, I updated the result here: #54442

@cbodley
Copy link
Contributor

cbodley commented Jan 24, 2024

thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py

@smanjara
Copy link
Contributor

thanks @smanjara; any objection to merging this as-is? i was able to validate the new tests locally with test_multi.py

sure, please go ahead. thanks!

@cbodley cbodley merged commit d3256c4 into ceph:main Jan 24, 2024
11 checks passed
@cbodley
Copy link
Contributor

cbodley commented Jan 24, 2024

thanks @jzhu116-bloomberg, and thanks @mkogan1 for the help with testing

<< " diff=" << diff << dendl;

static constexpr uint32_t CONN_STATUS_EXPIRE_SECS = 2;
if (diff >= CONN_STATUS_EXPIRE_SECS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzhu116-bloomberg .. could you please explain what it means if connection_status is not expired here (i.e, diff < 2sec). Shouldn't it be considered valid endpoint?

This check added here breaks cloud-transition code and perhaps cloud-sync module too - https://tracker.ceph.com/issues/65251

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumyakoduri this tracks endpoints that recently returned a connection error so we can retry against other endpoints. 'not expired' here means that it failed recently enough that we shouldn't try again

however, this logic is not very good at differentiating between connection errors and other http errors. it just checks for EIO, which turns out to be the default case in rgw_http_error_to_errno()

@yuvalif and i discussed this recently in #55527, and changed some pubsub ops to return ERR_SERVICE_UNAVAILABLE instead of EIO to work around that

i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite

for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for your cloud transition testing, can you tell which earlier request is getting mapped to EIO?

Thanks Casey.
For simple object transition case, I can think of requests such as target bucket creation (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1217) and fetching HEAD object (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282) which may fail. This is expected at times and hence the errors are ignored. The errors returned by the cloud endpoint for these ops may include EIO.

There may be few more such cases with multipart upload to the cloud.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to find some other default error code for rgw_http_error_to_errno(), but it's not clear what that might break in multisite

if we can't prevent these cloud transition ops from returning errors that map to EIO, removing EIO from rgw_http_error_to_errno() seems like the only viable solution. i pushed #56704 as an example. @smanjara do you think we could reasonably validate such a change with multisite testing at this point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are places where we return EIO explicitly but I am not sure where we map some other error to EIO. I'll pull your changes from #56704 and run it through our suite @cbodley

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smanjara i've added that to a test batch already. i'll share the results and we can go through them together

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Casey and Shilpa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants