Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/cloud-transition: fix the crash with publish_commit #57356

Merged
merged 2 commits into from
May 13, 2024

Conversation

soumyakoduri
Copy link
Contributor

As part of cloud transition, object's head/attrs may get updated and hence state->attrs will not be valid anymore. Fetch obj_state post the transition to access the attrs.

Fixes: https://tracker.ceph.com/issues/65862
Signed-off-by: Soumya Koduri skoduri@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@soumyakoduri soumyakoduri requested a review from a team as a code owner May 8, 2024 10:49
@github-actions github-actions bot added the rgw label May 8, 2024
@soumyakoduri soumyakoduri requested review from mattbenjamin, yuvalif and cbodley and removed request for a team May 8, 2024 10:51
@dang dang added the needs-qa label May 8, 2024
Comment on lines 1439 to 1440
RGWObjState* obj_state{nullptr};
ret = obj->get_obj_state(oc.dpp, &obj_state, null_yield, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really need RGWObjState to get attrs, do we? what does obj->get_attrs() return at this point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I am not mistaken, you do indeed need it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you omit get obj state, you'll get an empty attrs sequence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ask me how I know)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RadosObject::transition_to_cloud() calls RadosReadOp::prepare() to read the head object. this part should initialize those attrs:
https://github.com/ceph/ceph/blob/9b6d380/src/rgw/driver/rados/rgw_sal_rados.cc#L2738

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that in remove_expired_obj() which is used in regular LC expiration too, publish_commit is called after the object is deleted - https://github.com/ceph/ceph/blob/9b6d380/src/rgw/rgw_lc.cc#L625 . Will the attrset be still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be cleaner and safe to just save etag in all the callers of publish_commit() before applying any LC operation (like you mentioned above)?

Copy link
Contributor

@cbodley cbodley May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be cleaner and safe to just save etag in all the callers of publish_commit() before applying any LC operation (like you mentioned above)?

i think so

in the refactoring call, we talked about changes to get_obj_state() to avoid the dangling RGWObjState*. @dang's suggestion is to rename it to load_obj_state() and rely on StoreObject::state to store its updated state. so we'd call obj->load_obj_state() then read the etag from obj->get_attrs(). if you leave the get_obj_state() call where it was but just save the etag, it should be easy to follow up with those other changes later

what do you think @mattbenjamin @dang?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the original commit for 7.1, but don't object to the idea, I'll rebase it if we take that version this weekish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dang @cbodley @mattbenjamin updated the PR with the changes as discussed above. Please review.

The obj_state may not be valid anymore post LC operations (esp.,
cloud-transition). Hence read and store etag prior to them to be used
later by notification (publish_commit).

Fixes: https://tracker.ceph.com/issues/65862
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
LC Cloud transition should use set_atomic() to prevent any overwrite
while updating the HEAD object.

Signed-off-by: Soumya Koduri <skoduri@redhat.com>
@cbodley
Copy link
Contributor

cbodley commented May 9, 2024

https://jenkins.ceph.com/job/ceph-api/73729/

ERROR: test_list_enabled_module (tasks.mgr.dashboard.test_mgr_module.MgrModuleTest)

commented (again) on https://tracker.ceph.com/issues/62972

@cbodley
Copy link
Contributor

cbodley commented May 9, 2024

jenkins test api

@soumyakoduri
Copy link
Contributor Author

jenkins test make check arm64

@cbodley cbodley merged commit 7528bf5 into ceph:main May 13, 2024
11 checks passed
@@ -1445,7 +1460,7 @@ class LCOpAction_Transition : public LCOpAction {
// send request to notification manager
int publish_ret = notify->publish_commit(oc.dpp, obj_state->size,
ceph::real_clock::now(),
obj_state->attrset[RGW_ATTR_ETAG].to_str(),
etag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorrry, just noticed when reviewing #57079 - this call to publish_commit() still relies on obj_state->size. don't we have the same lifetime issue there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right and moreover post cloud-transition obj_size can be '0' which can be misleading in the notification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we should save the value of size before the expiration/transition like you did for etag

what do you want to do about the downstream version of this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants