Skip to content

Conversation

@kchheda3
Copy link
Contributor

@kchheda3 kchheda3 commented Apr 24, 2024

Fixes --> https://tracker.ceph.com/issues/66018
Currently if there is any error while calling publish_reserve the lc processing is cancelled for that object. This is different from behavior we have for replication events where the notification errors are not blocking replication. On similar note, lc being internal ceph processing,
do not block lc if there are notification error's.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • [] Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@kchheda3 kchheda3 requested a review from a team as a code owner April 24, 2024 16:28
@kchheda3 kchheda3 self-assigned this Apr 24, 2024
@github-actions github-actions bot added the rgw label Apr 24, 2024
@kchheda3 kchheda3 requested a review from BBoozmen April 24, 2024 16:28
static std::string lc_id = "rgw lifecycle";
static std::string lc_req_id = "0";

static void may_be_send_notification(const DoutPrefixProvider *dpp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can change to send_notification() sending is not conditional, just could fail.
the fact that it does not return the error code mean that the failure should be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kchheda3 kchheda3 force-pushed the wip-lc-notification branch from 1607dff to e1f2af9 Compare May 10, 2024 01:52
@cbodley
Copy link
Contributor

cbodley commented May 13, 2024

merge conflict from #57356. also note that #57377 is hiding RGWObjState altogether. please just pass the etag to send_notification()

@kchheda3
Copy link
Contributor Author

merge conflict from #57356. also note that #57377 is hiding RGWObjState altogether. please just pass the etag to send_notification()

done

if (publish_ret < 0) {
ldpp_dout(dpp, 5) << "WARNING: notify publish_commit failed, with error: " << publish_ret << dendl;
}
send_notification(dpp, driver, obj.get(), oc.bucket, etag, obj_state->size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by Casey in #57356 (comment), obj_state may not be valid post transition/deletion., esp. after cloud transition which can reset obj_size to '0' or delete the object.

Similar to etag, we would need to read obj_state->size before any LC operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj_state->size

@soumyakoduri is obj_state->size an attribute as well like etag ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj_state->size

@soumyakoduri is obj_state->size an attribute as well like etag ?

done, started using the obj->get_obj_size(), and verified the size is populated.

@kchheda3 kchheda3 force-pushed the wip-lc-notification branch 2 times, most recently from c810e78 to f9148ef Compare May 14, 2024 14:58
return;
}
ret =
notify->publish_commit(dpp, obj->get_obj_size(), ceph::real_clock::now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this will get replaced with new SAL APIs to fetch right obj_state & size from the backend store.

Are there any test-cases to check notifications with LC expiration & cloud-transition? Post transition to cloud, either the object's HEAD is retained with data erased (so the size will be '0' bytes) or it is deleted (same as the case of LC expiration), in which in case get_obj_size() may return ENOENT if that call is passed to backend store.

Request @cbodley @dang to comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not made any changes as part of this PR to use sal object, I have just reorganised the code
If a future PR is gonna change this behaviour then should this PR be blocked for that ?

Copy link
Contributor

@cbodley cbodley May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yuvalif

the bucket notification tests do include some coverage for lifecycle, but not cloud transition:
https://github.com/ceph/ceph/blob/a052683/src/test/rgw/bucket_notification/test_bn.py#L1720-L1913

the rgw/notifications teuthology suite is what runs this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which in case get_obj_size() may return ENOENT if that call is passed to backend store.

similar to Object::get_attrs(), get_obj_size() is just returning the cached object state, so shouldn't ever go to the backend. some mutations may update that cached state, but i don't think delete_obj() does. callers probably shouldn't expect getters like get_obj_size() to return sensible answers after deletion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which in case get_obj_size() may return ENOENT if that call is passed to backend store.

similar to Object::get_attrs(), get_obj_size() is just returning the cached object state, so shouldn't ever go to the backend. some mutations may update that cached state, but i don't think delete_obj() does. callers probably shouldn't expect getters like get_obj_size() to return sensible answers after deletion

@cbodley so you suggesting, instead of calling the get_obj_size after the object is removed (lc'd), just get the size prior of deletion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like was done for etag, yeah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kchheda3 kchheda3 force-pushed the wip-lc-notification branch from f9148ef to 745beae Compare May 14, 2024 18:59
@kchheda3 kchheda3 requested review from cbodley and soumyakoduri May 15, 2024 17:28
@kchheda3 kchheda3 closed this May 15, 2024
@kchheda3 kchheda3 deleted the wip-lc-notification branch May 15, 2024 17:40
@kchheda3 kchheda3 restored the wip-lc-notification branch May 15, 2024 17:40
@kchheda3 kchheda3 reopened this May 15, 2024
Copy link
Contributor

@soumyakoduri soumyakoduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest all looks good to me.

@kchheda3
Copy link
Contributor Author

kchheda3 commented Jun 4, 2024

having trouble getting a clean qa run: https://pulpito.ceph.com/cbodley-2024-05-29_22:01:03-rgw-wip-cbodley-testing-distro-default-smithi/

lots of valgrind errors which are showing up on other branches too, and a test_d4n.sh failure that didn't go away with a --rerun

valgrind errors are related to get_monmap_and_config() and the single test_d4n failure is due to failed to remove '/home/ubuntu/cephtest': Directory not empty
so the failures are unrelated to the changes ?

@cbodley
Copy link
Contributor

cbodley commented Jun 4, 2024

valgrind errors are related to get_monmap_and_config() and the single test_d4n failure is due to failed to remove '/home/ubuntu/cephtest': Directory not empty so the failures are unrelated to the changes ?

there were other prs in the test batch so it's hard to tell whether the d4n failure is related:

2024-05-29T22:24:47.627 INFO:tasks.workunit.client.0.smithi107.stdout:[==========] Running 11 tests from 2 test suites.
2024-05-29T22:24:47.627 INFO:tasks.workunit.client.0.smithi107.stdout:[----------] Global test environment set-up.
2024-05-29T22:24:47.629 INFO:tasks.workunit.client.0.smithi107.stdout:[----------] 5 tests from ObjectDirectoryFixture
2024-05-29T22:24:47.629 INFO:tasks.workunit.client.0.smithi107.stdout:[ RUN      ] ObjectDirectoryFixture.SetYield
2024-05-29T22:24:47.657 INFO:tasks.workunit.client.0.smithi107.stdout:[       OK ] ObjectDirectoryFixture.SetYield (28 ms)
2024-05-29T22:24:47.657 INFO:tasks.workunit.client.0.smithi107.stdout:[ RUN      ] ObjectDirectoryFixture.GetYield
2024-05-29T22:24:47.682 INFO:tasks.workunit.client.0.smithi107.stdout:[       OK ] ObjectDirectoryFixture.GetYield (25 ms)
2024-05-29T22:24:47.682 INFO:tasks.workunit.client.0.smithi107.stdout:[ RUN      ] ObjectDirectoryFixture.CopyYield
2024-05-29T22:24:47.683 INFO:tasks.workunit.client.0.smithi107.stdout:./src/test/rgw/test_d4n_directory.cc:207: Failure
2024-05-29T22:24:47.683 INFO:tasks.workunit.client.0.smithi107.stdout:Expected equality of these values:
2024-05-29T22:24:47.683 INFO:tasks.workunit.client.0.smithi107.stdout:  0
2024-05-29T22:24:47.683 INFO:tasks.workunit.client.0.smithi107.stdout:  dir->copy(obj, "copyTestName", "copyBucketName", yield)
2024-05-29T22:24:47.684 INFO:tasks.workunit.client.0.smithi107.stdout:    Which is: -22
2024-05-30T01:24:47.605 DEBUG:teuthology.orchestra.run:got remote process result: 124

trying to run against main for comparison

@cbodley
Copy link
Contributor

cbodley commented Jun 6, 2024

https://pulpito.ceph.com/cbodley-2024-06-05_13:55:19-rgw-main-distro-default-smithi/ against main shows the same valgrind and d4n failures, now tracked in https://tracker.ceph.com/issues/66336 and https://tracker.ceph.com/issues/66365

there was also a notification job with these failures:

FAIL: test persistent topic stats
FAIL: test that when object is deleted due to lifecycle policy, notification is sent on master

@yuvalif i'm guessing those are unrelated issues with the mock http server? (note this batch included changes to persistent topics from #57536)

@cbodley
Copy link
Contributor

cbodley commented Jun 6, 2024

The following tests FAILED:
34 - run-rbd-unit-tests-127.sh (Failed)

@cbodley
Copy link
Contributor

cbodley commented Jun 6, 2024

jenkins test make check

@yuvalif
Copy link
Contributor

yuvalif commented Jun 6, 2024

https://pulpito.ceph.com/cbodley-2024-06-05_13:55:19-rgw-main-distro-default-smithi/ against main shows the same valgrind and d4n failures, now tracked in https://tracker.ceph.com/issues/66336 and https://tracker.ceph.com/issues/66365

there was also a notification job with these failures:

FAIL: test persistent topic stats
FAIL: test that when object is deleted due to lifecycle policy, notification is sent on master

@yuvalif i'm guessing those are unrelated issues with the mock http server? (note this batch included changes to persistent topics from #57536)

the persistent topic stats test is failing unrelated to the change here.
however, lifecycle test failure is uncommon. since this PR is changing this area of the code, it is probably worth investigating.
also note that with the latest test change (#57550), the lifecycle tests are going to be executed against http, kafka and amqp backends. if they fail only with an http backend then we can probbaly consider it as a pass

@cbodley
Copy link
Contributor

cbodley commented Jun 6, 2024

however, lifecycle test failure is uncommon. since this PR is changing this area of the code, it is probably worth investigating.
also note that with the latest test change (#57550), the lifecycle tests are going to be executed against http, kafka and amqp backends. if they fail only with an http backend then we can probbaly consider it as a pass

ok, i'll push a rebase of this and #57536 and rerun against that

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@cbodley
Copy link
Contributor

cbodley commented Jun 11, 2024

@cbodley
Copy link
Contributor

cbodley commented Jun 12, 2024

re-rerun looks good enough: https://pulpito.ceph.com/cbodley-2024-06-11_19:40:35-rgw-wip-cbodley-testing-distro-default-smithi/

@kchheda3 if the rebase is trivial, we don't need to qa again

kchheda3 added 2 commits June 12, 2024 14:09
…on errors.

Currently if there is any error while calling publish_reserve the lc processing is cancelled for that object. This is different from behavior we have for replication events where the notification errors are not blocking replication. On similar note, lc being internal ceph processing, notification error's should not block the lc processing.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
…s for processing for object tag decode errors.

This was the behavior prior to ceph#55795, however while refactoring this behavior was changed, hence revert back the logic to not send the events if the obj_tag decoding fails.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
@kchheda3
Copy link
Contributor Author

re-rerun looks good enough: https://pulpito.ceph.com/cbodley-2024-06-11_19:40:35-rgw-wip-cbodley-testing-distro-default-smithi/

@kchheda3 if the rebase is trivial, we don't need to qa again

was a trivial rebase, rename of event Noncurrent to NonCurrent

@cbodley
Copy link
Contributor

cbodley commented Jun 13, 2024

FAIL: test_get_indiv_flag (tasks.mgr.dashboard.test_osd.OsdFlagsTest)

@cbodley
Copy link
Contributor

cbodley commented Jun 13, 2024

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented Jun 13, 2024

The following tests FAILED:
203 - unittest_osdmap (Subprocess aborted)

@cbodley
Copy link
Contributor

cbodley commented Jun 13, 2024

jenkins test make check

1 similar comment
@cbodley
Copy link
Contributor

cbodley commented Jun 19, 2024

jenkins test make check

@cbodley
Copy link
Contributor

cbodley commented Jun 19, 2024

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented Jun 19, 2024

jenkins pls 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants