-
Notifications
You must be signed in to change notification settings - Fork 6.3k
rgw/lifecycle-notification: Do not block lc processing for notification errors. #57079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/rgw/rgw_lc.cc
Outdated
| static std::string lc_id = "rgw lifecycle"; | ||
| static std::string lc_req_id = "0"; | ||
|
|
||
| static void may_be_send_notification(const DoutPrefixProvider *dpp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can change to send_notification() sending is not conditional, just could fail.
the fact that it does not return the error code mean that the failure should be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
1607dff to
e1f2af9
Compare
e1f2af9 to
ff71cea
Compare
src/rgw/rgw_lc.cc
Outdated
| if (publish_ret < 0) { | ||
| ldpp_dout(dpp, 5) << "WARNING: notify publish_commit failed, with error: " << publish_ret << dendl; | ||
| } | ||
| send_notification(dpp, driver, obj.get(), oc.bucket, etag, obj_state->size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by Casey in #57356 (comment), obj_state may not be valid post transition/deletion., esp. after cloud transition which can reset obj_size to '0' or delete the object.
Similar to etag, we would need to read obj_state->size before any LC operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj_state->size
@soumyakoduri is obj_state->size an attribute as well like etag ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj_state->size
@soumyakoduri is
obj_state->sizean attribute as well like etag ?
done, started using the obj->get_obj_size(), and verified the size is populated.
c810e78 to
f9148ef
Compare
src/rgw/rgw_lc.cc
Outdated
| return; | ||
| } | ||
| ret = | ||
| notify->publish_commit(dpp, obj->get_obj_size(), ceph::real_clock::now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this will get replaced with new SAL APIs to fetch right obj_state & size from the backend store.
Are there any test-cases to check notifications with LC expiration & cloud-transition? Post transition to cloud, either the object's HEAD is retained with data erased (so the size will be '0' bytes) or it is deleted (same as the case of LC expiration), in which in case get_obj_size() may return ENOENT if that call is passed to backend store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not made any changes as part of this PR to use sal object, I have just reorganised the code
If a future PR is gonna change this behaviour then should this PR be blocked for that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yuvalif
the bucket notification tests do include some coverage for lifecycle, but not cloud transition:
https://github.com/ceph/ceph/blob/a052683/src/test/rgw/bucket_notification/test_bn.py#L1720-L1913
the rgw/notifications teuthology suite is what runs this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which in case get_obj_size() may return ENOENT if that call is passed to backend store.
similar to Object::get_attrs(), get_obj_size() is just returning the cached object state, so shouldn't ever go to the backend. some mutations may update that cached state, but i don't think delete_obj() does. callers probably shouldn't expect getters like get_obj_size() to return sensible answers after deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which in case get_obj_size() may return ENOENT if that call is passed to backend store.
similar to
Object::get_attrs(), get_obj_size() is just returning the cached object state, so shouldn't ever go to the backend. some mutations may update that cached state, but i don't thinkdelete_obj()does. callers probably shouldn't expect getters likeget_obj_size()to return sensible answers after deletion
@cbodley so you suggesting, instead of calling the get_obj_size after the object is removed (lc'd), just get the size prior of deletion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like was done for etag, yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f9148ef to
745beae
Compare
soumyakoduri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest all looks good to me.
valgrind errors are related to |
there were other prs in the test batch so it's hard to tell whether the d4n failure is related: trying to run against main for comparison |
|
https://pulpito.ceph.com/cbodley-2024-06-05_13:55:19-rgw-main-distro-default-smithi/ against main shows the same valgrind and d4n failures, now tracked in https://tracker.ceph.com/issues/66336 and https://tracker.ceph.com/issues/66365 there was also a notification job with these failures:
@yuvalif i'm guessing those are unrelated issues with the mock http server? (note this batch included changes to persistent topics from #57536) |
|
|
jenkins test make check |
the persistent topic stats test is failing unrelated to the change here. |
ok, i'll push a rebase of this and #57536 and rerun against that |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
rerunning dead jobs from https://pulpito.ceph.com/cbodley-2024-06-10_12:56:39-rgw-wip-cbodley-testing-distro-default-smithi/ |
|
re-rerun looks good enough: https://pulpito.ceph.com/cbodley-2024-06-11_19:40:35-rgw-wip-cbodley-testing-distro-default-smithi/ @kchheda3 if the rebase is trivial, we don't need to qa again |
…on errors. Currently if there is any error while calling publish_reserve the lc processing is cancelled for that object. This is different from behavior we have for replication events where the notification errors are not blocking replication. On similar note, lc being internal ceph processing, notification error's should not block the lc processing. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
…s for processing for object tag decode errors. This was the behavior prior to ceph#55795, however while refactoring this behavior was changed, hence revert back the logic to not send the events if the obj_tag decoding fails. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
9271b3f to
02295f9
Compare
was a trivial rebase, rename of event |
|
|
jenkins test api |
|
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
jenkins test api |
|
jenkins pls 🙏 |
Fixes --> https://tracker.ceph.com/issues/66018
Currently if there is any error while calling publish_reserve the lc processing is cancelled for that object. This is different from behavior we have for replication events where the notification errors are not blocking replication. On similar note, lc being internal ceph processing,
do not block lc if there are notification error's.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e