New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/PG: fix recovery op leak #18524
osd/PG: fix recovery op leak #18524
Conversation
@dzafman I think may have actually come from e708410, which cleared the waiting_for_backfill but didn't finish_recovery_op(max) if it was non-empty. I made the same mistake ("wth is this finish_recovery_op for? i'll leave it off") when I added the unfound and rejected cases recently. In any case, I think this patch clears it up? |
we also meet this problem, backfilling state not cleared with rops > 0. |
pg->state_set(PG_STATE_BACKFILL_TOOFULL); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd feel better if we cleared PG_STATE_BACKFILLING here like the other 2 cases. There is a case where we are able to get reservations, but PrimaryLogPG::do_scan() noticed that we've gone over the backfill_full_ratio while backfilling. We might as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to include that here, I'll put it in later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated! also added more detail to the commit msg (when bug was intorduced and then duplicated)
5014783
to
b6dfb20
Compare
For multiple backfill targets, we start MAX multiple times. Signed-off-by: Sage Weil <sage@redhat.com>
Previously, there was only one time we would end up in this region of code: when the backfill was rejected by the peer. Previously that was apparently reliably when we had an outstanding SCAN request, because we would unconditionally cancle the MAX recovery op and clear waiting_on_backfill. See 624aaf2 for when this code appeared. Now we have several similar paths, and we don't always have an outstanding scan call (I don't think!). Regardless, move most these three cases into a common helper and make the finish_recovery_op completion conditional on whether there is an outstanding SCAN. This fixes a leak of a recovery op when we defer while a scan is outstanding (this bug was recently introduced by e708410 and then duplicated by 2463c64). Note that there is still one other time we register MAX ops: when we are finishing backfill. There, we start one per target. But we will always get back our reply and process it in the normal way (that old commit did not change the timing for these). Signed-off-by: Sage Weil <sage@redhat.com>
b6dfb20
to
efd1a77
Compare
…gets If we have multiple targets, we may still be waiting on them when we get a revocation. Signed-off-by: Sage Weil <sage@redhat.com>
This fixes a recovery op leak recently intorduced by the recovery preemption
changes and the new DeferRecovery stuff. Or possibly a bit early? Either way, I think this fixes it.
/a/sage-2017-10-24_22:15:33-rados-wip-sage-testing-2017-10-24-1544-distro-basic-smithi/1771206