Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fix getting osd maps on initial osd startup #22585

Merged
merged 1 commit into from Jun 20, 2018
Merged

Conversation

@emmericp
Copy link
Contributor

emmericp commented Jun 15, 2018

This fixes https://tracker.ceph.com/issues/24423

Is there any other case where that osd map might not be available that should be handled by triggering an assert here? I think that can only happen on first startup.

@emmericp emmericp force-pushed the croit:fix-24423 branch from 0c02621 to 04d39da Jun 16, 2018
@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jun 16, 2018

How to apply this fix to installed one?

if (!(lastmap = service.try_get_map(i.first - 1))) {
dout(0) << __func__ << " can't get previous map " << i.first - 1
<< " probably first start of this osd" << dendl;
break;

This comment has been minimized.

Copy link
@liewegas

liewegas Jun 18, 2018

Member

s/break/continue/, and change the dout level to 10

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jun 18, 2018

This looks like the right fix, with the exception of the break statement

@liewegas liewegas removed the needs-qa label Jun 18, 2018
@gregsfortytwo

This comment has been minimized.

Copy link
Member

gregsfortytwo commented Jun 18, 2018

Oh, I misunderstood this hunk when I saw it go by, but now I get it.

LGTM too!

@emmericp emmericp force-pushed the croit:fix-24423 branch from 04d39da to d8d47e1 Jun 18, 2018
@emmericp

This comment has been minimized.

Copy link
Contributor Author

emmericp commented Jun 18, 2018

updated

@liewegas liewegas added the needs-qa label Jun 18, 2018
@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jun 18, 2018

Thanks!

Commit 89d0c35 introduced a check for
deleted pools that relies on getting an older osd map that might not be
available in some situations on OSDs starting up for the very first
time.

fixes https://tracker.ceph.com/issues/24423

Signed-off-by: Paul Emmerich <paul.emmerich@croit.io>
@emmericp emmericp force-pushed the croit:fix-24423 branch from d8d47e1 to 02180f6 Jun 18, 2018
@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jun 19, 2018

I have try to do this fix, but still get the same result. Following is my related log lines.

-4> 2018-06-19 07:48:53.336 7f4d0fa16700  3 osd.0 0 handle_osd_map epochs [1758,1758], i have 0, src has [1758,3739]
-2> 2018-06-19 07:48:53.336 7f4d0fa16700 -1 osd.0 0 failed to load OSD map for epoch 1757, got 0 bytes

It seem that the problem could by just following revision.

- lastmap = get_map(i.first - 1);
+ lastmap = get_map(i.first);

The OSD try to load 1757 but the source only have 1758-3739. Following are some of my trials of getting OSDMap of some epochs which success only starting from 1758.

[root@management-a ~]# ceph osd getmap 1755 -o /tmp/ceph_osd_getmap.bin
Error ENOENT: there is no map for epoch 1755
[root@management-a ~]# ceph osd getmap 1756 -o /tmp/ceph_osd_getmap.bin
Error ENOENT: there is no map for epoch 1756
[root@management-a ~]# ceph osd getmap 1757 -o /tmp/ceph_osd_getmap.bin
Error ENOENT: there is no map for epoch 1757
[root@management-a ~]# ceph osd getmap 1758 -o /tmp/ceph_osd_getmap.bin
got osdmap epoch 1758
[root@management-a ~]# ceph osd getmap 1759 -o /tmp/ceph_osd_getmap.bin
got osdmap epoch 1759
[root@management-a ~]# ceph osd getmap 1760 -o /tmp/ceph_osd_getmap.bin
got osdmap epoch 1760

Anyway, why "i.first - 1"? Is there a method to inject OSDMap to such OSD with last epoch?

@emmericp

This comment has been minimized.

Copy link
Contributor Author

emmericp commented Jun 19, 2018

@Mrxlazuardin it looks like you are trying to mess around with things you don't understand on something that seems to be a production cluster. That can be very dangerous, do not do that.

If you broke your production cluster due to this issue: wait for 13.2.1. Or get professional help to backport that to 13.2.0 for you if you absolutely need a new OSD.

@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jun 19, 2018

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jun 19, 2018

In general we want i.first - 1 because we are looking for changes: newly deleted pools or pg_num adjustments. It only fails in the new osd case because i.first is the oldest osdmap in the system and i.first -1 isn't available. In that case, we just skip it and start with the next epoch.

I reproduced the issue locally and tested the fix and it works for me!

@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jun 19, 2018

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jun 19, 2018

@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jun 19, 2018

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jun 19, 2018

@emmericp

This comment has been minimized.

Copy link
Contributor Author

emmericp commented Jun 19, 2018

We sometimes backport fixes like for our clients, we usually just build the full packages and deploy them. The instructions from http://docs.ceph.com/docs/master/install/build-ceph/#rpm-package-manager do work exactly as described for centos, just did that a few weeks ago for a few clusters running into another problem for which a patch already existed.

This patch applies cleanly on 13.2.0, so it's really easy to backport. Good luck!

@liewegas liewegas merged commit 02180f6 into ceph:master Jun 20, 2018
5 checks passed
5 checks passed
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
liewegas added a commit that referenced this pull request Jun 20, 2018
* refs/pull/22585/head:
	osd: fix getting osd maps on initial osd startup

Reviewed-by: Sage Weil <sage@redhat.com>
@Mrxlazuardin

This comment has been minimized.

Copy link

Mrxlazuardin commented Jul 27, 2018

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Jul 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.