Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/multisite: handle shard_id correctly in data sync when bucket num_shards is 0 #51085

Merged
merged 1 commit into from May 15, 2023

Conversation

smanjara
Copy link
Contributor

@smanjara smanjara commented Apr 15, 2023

For buckets that have num_shards set to 0, bucket instance
will not have a shard_id delimiter. When this bucket instance is parsed,
we end up assigning a value of -1 to shard_id, which is invalid
in data sync. This change ensures that we represent the shard_id correctly
by giving it a valid number

https://tracker.ceph.com/issues/59489

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@smanjara smanjara requested a review from a team as a code owner April 15, 2023 00:11
@smanjara smanjara requested a review from cbodley April 15, 2023 00:12
@github-actions github-actions bot added the rgw label Apr 15, 2023
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change to init_index() looks correct..

Comment on lines 210 to 213
get_bucket_index_objects(bucket_oid_base, idx_layout.layout.normal.num_shards,
get_bucket_index_objects(bucket_oid_base, rgw::num_shards(idx_layout.layout.normal),
idx_layout.gen, bucket_objs, shard_id);
if (bucket_instance_ids) {
get_bucket_instance_ids(bucket_info, idx_layout.layout.normal.num_shards,
get_bucket_instance_ids(bucket_info, rgw::num_shards(idx_layout.layout.normal),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..but won't this break access to existing buckets that had num_shards=0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I am still testing this.

@cbodley
Copy link
Contributor

cbodley commented Apr 17, 2023

since we can't change the shard object names of existing buckets in open_bucket_index(), i don't think this is really going to fix anything. if num_shards=0 is causing other bugs, we need to root-cause and fix those instead

@cbodley for a simple case of a bucket with num_shards 0 (without this fix), an incremental sync causes radosgw to crash:

 8: (RGWSyncBucketCR::operate(DoutPrefixProvider const*)+0x1e84) [0x55aa93b1cb92]
 9: (RGWCoroutine::operate_wrapper(DoutPrefixProvider const*)+0xa) [0x55aa936271de]
 10: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x118) [0x55aa937727a8]
 11: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWC
oroutinesStack*> >&)+0x12e) [0x55aa937731b4]
 12: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x84) [0x55aa9377400a]
 13: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x85) [0x55aa93b216dd]
 14: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x46) [0x55aa93974760]

but you are right that the older buckets will still have the same shard object names and an incremental sync can still result in the crash. Looking further into what's causing it.

@smanjara
Copy link
Contributor Author

smanjara commented Apr 17, 2023

since we can't change the shard object names of existing buckets in open_bucket_index(), i don't think this is really going to fix anything. if num_shards=0 is causing other bugs, we need to root-cause and fix those instead

@cbodley for a simple case of a bucket with num_shards 0 (without this fix), an incremental sync causes radosgw to crash:

 8: (RGWSyncBucketCR::operate(DoutPrefixProvider const*)+0x1e84) [0x55aa93b1cb92]
 9: (RGWCoroutine::operate_wrapper(DoutPrefixProvider const*)+0xa) [0x55aa936271de]
 10: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x118) [0x55aa937727a8]
 11: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWC
oroutinesStack*> >&)+0x12e) [0x55aa937731b4]
 12: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x84) [0x55aa9377400a]
 13: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x85) [0x55aa93b216dd]
 14: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x46) [0x55aa93974760]

but you are right that the older buckets will still have the same shard object names and an incremental sync can still result in the crash. Looking further into what's causing it.

seems to be crashing when CheckBucketShardStatusIsIncremental() is called. specifically at:
https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_data_sync.cc#L3382

@smanjara
Copy link
Contributor Author

since we can't change the shard object names of existing buckets in open_bucket_index(), i don't think this is really going to fix anything. if num_shards=0 is causing other bugs, we need to root-cause and fix those instead

@cbodley for a simple case of a bucket with num_shards 0 (without this fix), an incremental sync causes radosgw to crash:

 8: (RGWSyncBucketCR::operate(DoutPrefixProvider const*)+0x1e84) [0x55aa93b1cb92]
 9: (RGWCoroutine::operate_wrapper(DoutPrefixProvider const*)+0xa) [0x55aa936271de]
 10: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x118) [0x55aa937727a8]
 11: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWC
oroutinesStack*> >&)+0x12e) [0x55aa937731b4]
 12: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x84) [0x55aa9377400a]
 13: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x85) [0x55aa93b216dd]
 14: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x46) [0x55aa93974760]

but you are right that the older buckets will still have the same shard object names and an incremental sync can still result in the crash. Looking further into what's causing it.

seems to be crashing when CheckBucketShardStatusIsIncremental() is called. specifically at: https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_data_sync.cc#L3382

huh actually no. it happens right when we enter the incremental sync at https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_data_sync.cc#L5864
we are asserting if shard_id is non-negative and the shard_id we pass happens to be -1.

@smanjara
Copy link
Contributor Author

smanjara commented Apr 18, 2023

so it comes down to the way we parse bucket instance without the shard id delimiter in
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_bucket.cc#L57 where shard_id gets the value of '-1'.
we could probably fix it by telling data sync to treat it as '0' instead.

@cbodley
Copy link
Contributor

cbodley commented Apr 18, 2023

yeah, i think it makes sense for data sync to treat -1 as 1 shard

…case

For buckets that have num_shards set to 0, bucket instance
will not have a shard_id delimiter. When this bucket instance is parsed,
we end up assigning a value of -1 to shard_id, which is invalid
in data sync. This change ensures that we represent the shard_id correctly
by giving it a valid number

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
@smanjara smanjara changed the title rgw/multisite: ensure that num_shards 0 is interpreted as 1 while cre… rgw/multisite: handle shard_id correctly in data sync when bucket num_shards is 0 Apr 18, 2023
@smanjara
Copy link
Contributor Author

@cbodley please review.

@smanjara
Copy link
Contributor Author

@smanjara
Copy link
Contributor Author

smanjara commented May 1, 2023

@cbodley could you review this one please?

@cbodley
Copy link
Contributor

cbodley commented May 2, 2023

jenkins test api

@smanjara smanjara merged commit 835de6d into ceph:main May 15, 2023
10 of 15 checks passed
@cbodley
Copy link
Contributor

cbodley commented May 15, 2023

@smanjara can you please make sure these fixes make it to reef?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants