Skip to content

rgw: fix PUT on versioned bucket fails with NoSuchKey#38705

Closed
mkogan1 wants to merge 1 commit intoceph:masterfrom
mkogan1:wip-rgw-broken-pipe
Closed

rgw: fix PUT on versioned bucket fails with NoSuchKey#38705
mkogan1 wants to merge 1 commit intoceph:masterfrom
mkogan1:wip-rgw-broken-pipe

Conversation

@mkogan1
Copy link
Copy Markdown
Contributor

@mkogan1 mkogan1 commented Dec 23, 2020

Fixes: https://tracker.ceph.com/issues/48709

Signed-off-by: Mark Kogan mkogan@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@mkogan1
Copy link
Copy Markdown
Contributor Author

mkogan1 commented Dec 23, 2020

ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl;
if (ret == -ENOENT) {
ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << ", modifying to " << -ERR_INTERNAL_ERROR << dendl;
ret = -ERR_INTERNAL_ERROR;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change on this path seems sane--nothing has I think happened, so request seems definitely retryable

int ret = get_bucket_shard(&bs);
if (ret < 0) {
ldout(store->ctx(), 5) << "failed to get BucketShard object: ret=" << ret << dendl;
if (ret == -ENOENT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less sure about this path; what request are we executing, and what has happened in the current request to this point?

for (int i = 0; i < NUM_RESHARD_RETRIES; ++i) {
r = bs->init(pobj->bucket, *pobj, nullptr /* no RGWBucketInfo */);
if (r < 0) {
ldout(cct, 5) << "bs.init() returned ret=" << r << dendl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this is the same as RGWRados::Bucket::UpdateIndex::guard_reshard()--but I didn't realize there were two guard_reshard() methods

bs.init(obj_instance.bucket, obj_instance, nullptr /* no RGWBucketInfo */);
if (ret < 0) {
ldout(cct, 5) << "bs.init() returned ret=" << ret << dendl;
if (ret == -ENOENT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here again I'm unsure whether to trust that the current request has had no side effect/is safely retryable

bs.init(obj_instance.bucket, obj_instance, nullptr /* no RGWBucketInfo */);
if (ret < 0) {
ldout(cct, 5) << "bs.init() returned ret=" << ret << dendl;
if (ret == -ENOENT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here; what has happened to this point, for the current request?

@cbodley
Copy link
Copy Markdown
Contributor

cbodley commented Jan 5, 2021

this looks valid as a workaround, but i think we're still missing something here

once a reshard completes, guard_reshard() calls target->update_bucket_id() so it can retry the op against the new bucket instance/index. we read that new bucket instance metadata into RGWRados::Bucket::bucket_info and use that for the retry

but when writing versioned objects, there are multiple calls to guard_reshard() in RGWRados::Object::Write::_do_write_meta(). if the first one hits this update_bucket_id() case, the updated RGWRados::Bucket::bucket_info will only be used for the retry - later calls to guard_reshard() would still be using the original pre-resharded version of the bucket_info, and so get ENOENT once reshard finishes deleting it

it sounds like everything under _do_write_meta() should be passing the bucket info as a mutable reference RGWBucketInfo&, so that the call to update_bucket_id() applies throughout. does that make sense?

@mattbenjamin
Copy link
Copy Markdown
Contributor

@cbodley it sounds right in that uses after reshard starts would be uniform?

@github-actions
Copy link
Copy Markdown

github-actions bot commented May 5, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@stale
Copy link
Copy Markdown

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@mkogan1
Copy link
Copy Markdown
Contributor Author

mkogan1 commented Aug 5, 2021

Unstale please

@stale stale bot removed the stale label Aug 5, 2021
@stale
Copy link
Copy Markdown

stale bot commented Jan 9, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 9, 2022
@cbodley cbodley removed the needs-qa label Jan 27, 2022
@stale stale bot removed the stale label Jan 27, 2022
@mkogan1
Copy link
Copy Markdown
Contributor Author

mkogan1 commented Mar 16, 2022

fixed by

  1. cls/rgw: rgw_dir_suggest_changes detects race with completion #45345 -- cls/rgw: rgw_dir_suggest_changes detects race with completion
  2. rgw: Update "CEPH_RGW_DIR_SUGGEST_LOG_OP" for remove entries #45300 -- rgw: Update "CEPH_RGW_DIR_SUGGEST_LOG_OP" for remove entries

@mkogan1 mkogan1 closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants