New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/rbd_support: recover from rados client blocklisting #49742
Conversation
a2531e2
to
0ba629b
Compare
14995b5
to
167b87f
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Fixed them |
jenkins test make check |
jenkins test make check arm64 |
1 similar comment
jenkins test make check arm64 |
Overall, I think there is room for improvement in the tests (probably OK to defer to another PR though):
|
@idryomov accidentally requested for re-review. pls ignore |
Current tests appear to be stable with the following fixups:
|
... requests to be completed. Signed-off-by: Ramana Raja <rraja@redhat.com>
Signed-off-by: Ramana Raja <rraja@redhat.com>
In certain scenarios the OSDs were slow to process RBD requests. This lead to the rbd_support module's RBD client not being able to gracefully handover a RBD exclusive lock to another RBD client. After the condition persisted for some time, the other RBD client forcefully acquired the lock by blocklisting the rbd_support module's RBD client, and consequently blocklisted the module's RADOS client. The rbd_support module stopped working. To recover the module, the entire mgr service had to be restarted which reloaded other mgr modules. Instead of recovering the rbd_support module from client blocklisting by being disruptive to other mgr modules, recover the module automatically without restarting the mgr serivce. On client getting blocklisted, shutdown the module's handlers and blocklisted client, create a new rados client for the module, and start the new handlers. Fixes: https://tracker.ceph.com/issues/56724 Signed-off-by: Ramana Raja <rraja@redhat.com>
... after the module's RADOS client is blocklisted. Signed-off-by: Ramana Raja <rraja@redhat.com>
Created tracker ticket for now, https://tracker.ceph.com/issues/59681 |
No related failures: https://pulpito.ceph.com/dis-2023-05-08_22:48:48-rbd-wip-dis-testing-distro-default-smithi/ This is with #49975 excluded in the last rerun -- it's causing "Exiting scrub checking -- not all pgs scrubbed." errors. Per @neha-ojha the plan is to introduce a more aggressive QoS profile for teuthology tests. |
In certain scenarios the OSDs were slow to process RBD requests.
This lead to the rbd_support module's RBD client not being able to
gracefully handover a RBD exclusive lock to another RBD client.
After the condition persisted for some time, the other RBD client
forcefully acquired the lock by blocklisting the rbd_support module's
RBD client, and consequently blocklisted the module's RADOS client. The
rbd_support module stopped working. To recover the module, the entire
mgr service had to be restarted which reloaded other mgr modules.
Instead of recovering the rbd_support module from client blocklisting
by being disruptive to other mgr modules, recover the module
automatically without restarting the mgr serivce. On client getting
blocklisted, shutdown the module's handlers and blocklisted client,
create a new rados client for the module, and start the new handlers.
Fixes: https://tracker.ceph.com/issues/56724
Signed-off-by: Ramana Raja rraja@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows