Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: discover missing objects when an OSD peers and PG is degraded #27288

Merged
merged 1 commit into from Apr 22, 2019

Conversation

Projects
None yet
5 participants
@TheJJ
Copy link

commented Apr 1, 2019

When a PG is remapped from OSD a to OSD b, the objects are
backfilled. When OSD a is restarted, objects become degraded
as a is no longer queried or considered as a backfill source.

As the PG is degraded, PG::discover_all_missing is not called
when a candidate OSD peers with the primary: The PG is already
active, thus PG::activate (and in turn missing object discovery)
is not called. Discovery is also not initiated from
PG::RecoveryState::Active::react(const MNotifyRec& notevt)
as there are no unfound objects.

This patch adds a call to discover_all_missing when
when an OSD sends its MNotifyRec message and the PG is degraded.

Fixes: https://tracker.ceph.com/issues/37439
Signed-off-by: Jonas Jelten jj@stusta.net

  • References tracker ticket

@neha-ojha neha-ojha requested review from neha-ojha and dzafman Apr 1, 2019

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from 43b4a58 to d564741 Apr 2, 2019

Show resolved Hide resolved src/osd/PG.cc Outdated

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from d564741 to 7947c93 Apr 2, 2019

@TheJJ TheJJ changed the title osd/PG: discover unfound objects if PG is degraded osd/PG: perform missing objects discovery when a candidate OSD peers Apr 2, 2019

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from 7947c93 to 8acc738 Apr 2, 2019

Show resolved Hide resolved src/osd/PG.cc Outdated

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from 8acc738 to 31ade33 Apr 3, 2019

@xiexingguo
Copy link
Member

left a comment

👍

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 3, 2019

This is probably a hot candidate to be backported :)

Show resolved Hide resolved src/osd/PG.cc Outdated

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch 2 times, most recently from 32888aa to b21753a Apr 4, 2019

@TheJJ TheJJ changed the title osd/PG: perform missing objects discovery when a candidate OSD peers osd/PG: discover missing objects when an OSD peers and PG is degraded Apr 4, 2019

@neha-ojha neha-ojha added the needs-qa label Apr 5, 2019

@xiexingguo

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

@TheJJ @neha-ojha Sorry for not responding earlier(There was a national holiday last week).

I think a better idea would be combining those two checker together, e.g.:

if (pg->have_unfound() ||  (pg->is_degraded() && pg->might_have_unfound.count(notevt.from)))

because there are still some corner cases(e.g., because that peer is down, or has identical log with ours) we want to avoid sending out a MQuery message. (Not sure that will be causing problems, but since we are already aware of that, then we'd better not :-)

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from b21753a to b3f0101 Apr 8, 2019

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 8, 2019

I've update my patch but I'm more and more getting the feeling that the missing object discovery has to be somehow improved as it seems very fragile. In my intuition, calling PG::discover_all_missing should be very robust and only query actual candidate peers and is not able to crash other OSDs. 😃

test passed

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2019

@neha-ojha good to merge?

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 11, 2019

I've applied the patch to one of my clusters and experienced horrible things, one OSD is now in such a broken state that whenever it is running, some PGs on it fail to peer and remain offline. When I shut down that OSD, the PGs become active again.

It seems unlikely that this patch causes such meltdown, but I just want to be sure my patch doesn't break things for other people unintentionally...

@neha-ojha

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

I've applied the patch to one of my clusters and experienced horrible things, one OSD is now in such a broken state that whenever it is running, some PGs on it fail to peer and remain offline. When I shut down that OSD, the PGs become active again.

It seems unlikely that this patch causes such meltdown, but I just want to be sure my patch doesn't break things for other people unintentionally...

Well, is just the application of this patch that caused problems or were there other things going on as well? If not, we should make sure that this patch is the right resolution.

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 12, 2019

Well, is just the application of this patch that caused problems or were there other things going on as well?

I've applied the patch and built the package for Ubuntu. Updated the ceph-osd-package and its dependencies. The cluster wasn't healthy, it's been plagued with the issue that this pullrequest aims to fix for a long time (i.e. ongoing rebalancing, then an OSD hangs/restarts/..., the remapped pg is degraded and tries to recover, wait until clean, repeat).

When I restarted some OSDs, some PGs became inactive and remained peering. In order to bring the cluster available again quickly, I didn't spend much time on investigating the exact reason, but after installing the original 14.2.0-osd again and restarting all OSDs, some (not all) PGs were still waiting for peering with one OSD, which is the one that had to be taken offline (systemctl stop) now.

As I said I'm really not sure if my patch caused this. Anyway, I'd really be glad if the bug this patch addresses is fixed one way or another 😃. I can imagine many clusters end up degraded and recovering even though not necessary.

Jonas Jelten
osd/PG: discover missing objects when an OSD peers and PG is degraded
When a PG is remapped from OSD `a` to OSD `b`, the objects are
backfilled. When OSD `a` is restarted, objects become degraded
as `a` is no longer queried or considered as a backfill source.

As the PG is degraded, `PG::discover_all_missing` is not called
when a candidate OSD peers with the primary: The PG is already
active, thus `PG::activate` (and in turn missing object discovery)
is not called. Discovery is also not initiated from
`PG::RecoveryState::Active::react(const MNotifyRec& notevt)`
as there are no unfound objects.

This patch adds a call to `discover_all_missing` when
when an OSD sends its `MNotifyRec` message and the PG is degraded.

Fixes: https://tracker.ceph.com/issues/37439
Signed-off-by: Jonas Jelten <jj@stusta.net>

@TheJJ TheJJ force-pushed the TheJJ:fix-finding-remapped-source-osd branch from b3f0101 to e152d09 Apr 16, 2019

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 16, 2019

Do you need anything else or are we getting closer to a merge (and backport)? :)

@neha-ojha

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Do you need anything else or are we getting closer to a merge (and backport)? :)

We were very close to merging this, but the fact that you saw problems after applying this patch, makes me a bit nervous.

I think I will run this through another round of testing, just to make sure we catch any bugs, before merging - better safe than sorry :)

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 16, 2019

Alright, that's good, thanks!

@liewegas liewegas merged commit e152d09 into ceph:master Apr 22, 2019

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

liewegas added a commit that referenced this pull request Apr 22, 2019

Merge PR #27288 into master
* refs/pull/27288/head:
	osd/PG: discover missing objects when an OSD peers and PG is degraded

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Neha Ojha <nojha@redhat.com>
@neha-ojha

This comment has been minimized.

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 23, 2019

\o/

@TheJJ TheJJ deleted the TheJJ:fix-finding-remapped-source-osd branch Apr 23, 2019

@TheJJ

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

for reference, the backport pulls are:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.