-
Notifications
You must be signed in to change notification settings - Fork 6.3k
crimson/osd: fix Notify life-time mismanagement in Watch::notify_ack #51945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We might not need #51936 after this. |
Matan-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's verify we no longer hit this issue in the suite's rbd api tests before merging.
I scheduled a build in Shaman.
athanatos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also appears to be fixing a bug where we were calling complete_watcher on all in progress notifies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ RUN ] TestLibRBD.QuiesceWatchTimeout is crashing the osds.
0# gsignal in /lib64/libc.so.6
1# abort in /lib64/libc.so.6
..
4# crimson::osd::Watch::notify_ack(unsigned long, ceph::buffer::v15_2_0::list const&) in ceph-osd
See rbd_api_tests_old_format/rbd_api_tests. I will also try to look at it.
https://pulpito.ceph.com/matan-2023-06-08_06:36:16-crimson-rados-wip-matanb-crimson-testing-bug-61504-distro-crimson-smithi/
Edit: On the positive side, rbd_python_api_tests_old_format and rbd_python_api_tests look good! It seems that this was responsible for the recent regression with rbd tests.
src/crimson/osd/watch.cc
Outdated
| }); | ||
| logger().info("{} notify_id={}", __func__, notify_id); | ||
| const auto it = in_progress_notifies.find(notify_id); | ||
| assert(it != std::end(in_progress_notifies)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves the TestLibRBD.QuiesceWatchTimeout for me locally, what do you think?
| assert(it != std::end(in_progress_notifies)); | |
| if (it == std::end(in_progress_notifies)) { | |
| logger().debug("Watch::notify_ack gid={} cookie={} notify_id={} already completed", | |
| get_watcher_gid(), | |
| get_cookie(), | |
| notify_id); | |
| return seastar::now(); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we shouldn't be getting a notify ack from something not in the in_progress_notifies list, I don't think. Perhaps make it error() and debug it after this PR merges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to logging an error.
src/crimson/osd/watch.cc
Outdated
| in_progress_notifies.clear(); | ||
| return seastar::now(); | ||
| }); | ||
| logger().info("{} notify_id={}", __func__, notify_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| logger().info("{} notify_id={}", __func__, notify_id); | |
| logger().debug("{} gid={} cookie={} notify_id={}", | |
| __func__, get_watcher_gid(), get_cookie(), notify_id); |
Fixes: https://tracker.ceph.com/issues/61504 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Replaced the assert with an error log entry but ultimately this commit should be reverted. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Matan-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a follow-up tracker to the suppressed issue once this is merged.
|
Failures, seem unrelated: 7301383, 7301385, 7301394, 7301396: https://tracker.ceph.com/issues/61651 (new tracker) Merging based on |
|
Follow-up tracker: https://tracker.ceph.com/issues/61652 |
Fixes: https://tracker.ceph.com/issues/61504
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows