Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: fix recovery op leak #18524

Merged
merged 3 commits into from Oct 26, 2017
Merged

osd/PG: fix recovery op leak #18524

merged 3 commits into from Oct 26, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Oct 25, 2017

This fixes a recovery op leak recently intorduced by the recovery preemption
changes and the new DeferRecovery stuff. Or possibly a bit early? Either way, I think this fixes it.

/a/sage-2017-10-24_22:15:33-rados-wip-sage-testing-2017-10-24-1544-distro-basic-smithi/1771206

@liewegas
Copy link
Member Author

liewegas commented Oct 25, 2017

@dzafman I think may have actually come from e708410, which cleared the waiting_for_backfill but didn't finish_recovery_op(max) if it was non-empty. I made the same mistake ("wth is this finish_recovery_op for? i'll leave it off") when I added the unfound and rejected cases recently.

In any case, I think this patch clears it up?

@hjwsm1989
Copy link
Contributor

we also meet this problem, backfilling state not cleared with rops > 0.

pg->state_set(PG_STATE_BACKFILL_TOOFULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better if we cleared PG_STATE_BACKFILLING here like the other 2 cases. There is a case where we are able to get reservations, but PrimaryLogPG::do_scan() noticed that we've gone over the backfill_full_ratio while backfilling. We might as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to include that here, I'll put it in later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated! also added more detail to the commit msg (when bug was intorduced and then duplicated)

For multiple backfill targets, we start MAX multiple times.

Signed-off-by: Sage Weil <sage@redhat.com>
Previously, there was only one time we would end up in this region of code:
when the backfill was rejected by the peer.  Previously that was apparently
reliably when we had an outstanding SCAN request, because we would
unconditionally cancle the MAX recovery op and clear waiting_on_backfill.

See 624aaf2 for when this code appeared.

Now we have several similar paths, and we don't always have an outstanding
scan call (I don't think!).  Regardless, move most these three cases into
a common helper and make the finish_recovery_op completion conditional
on whether there is an outstanding SCAN.  This fixes a leak of a recovery
op when we defer while a scan is outstanding (this bug was recently
introduced by e708410 and then
duplicated by 2463c64).

Note that there is still one other time we register MAX ops: when we are
finishing backfill.  There, we start one per target.  But we will always
get back our reply and process it in the normal way (that old commit
did not change the timing for these).

Signed-off-by: Sage Weil <sage@redhat.com>
…gets

If we have multiple targets, we may still be waiting on them when we get
a revocation.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 3f6e0b6 into ceph:master Oct 26, 2017
@liewegas liewegas deleted the wip-backfill-rops branch October 26, 2017 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants