New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/scrub: tag replica scrub messages to identify stale events #42684
Conversation
7113ad1
to
1e4453c
Compare
I don't quite understand the token lifecycle. It doesn't look like it gets encoded in any primary->replica messages, so it's purely used for the replica to cancel any outstanding local events when its reservation is canceled? |
Yes. (And - even internal to the replica, I started by adding it to some of the events, but found out |
yes
|
504b683
to
aa5fe4c
Compare
@ronen-fr how was pre-refactored code handling this? It seems like https://tracker.ceph.com/issues/52012#note-1 was possible even earlier? |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
1 similar comment
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
(before actual analysis of the code:) I think so. Requires a combination of "scrub abort" and max_osd_scrubs>1. |
aa5fe4c
to
e4e211b
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
e4e211b
to
1e4ffc0
Compare
A lot can happen while a replica is waiting for the backend to collect the map data. The Primary might, for example, abort the scrub and start a new one (following no-scrub commands). The 'token' introduced here tags 'replica scrub resched' messages with an index value that is modified on each 'release scrub resources' request from the Primary. Fixes: https://tracker.ceph.com/issues/52012 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
…ir scrubs Previously, after-repair scrubs were started without waiting for either local or remote OSDs' scrub resources. The tagging of scrub sessions by the replicas is based on monitoring replica-request and replica-release messages from the primary. Scrub-map requests arriving without any reservations interfere with this mechanism. The benefits of this fast-track were limited at best, and do not justify the complexity of a solution that accommodates both the bypass and the tagging. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
1e4ffc0
to
7011d73
Compare
Initial set of 72 tests is clean: |
Runs: Failures: #6382926: “cephadm/test_dashboard_e2e.sh” - dashboard related failure. Seems to be https://tracker.ceph.com/issues/52417 The dead job: test issue (reimaging) |
A lot can happen while a replica is waiting for the backend
to collect the map data. The Primary might, for example, abort
the scrub and start a new one (following no-scrub commands).
The 'token' introduced here tags 'replica scrub resched'
messages with an index value that is modified on each 'release
scrub resources' request from the Primary.
Fixes: https://tracker.ceph.com/issues/52012
Signed-off-by: Ronen Friedman rfriedma@redhat.com