New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/PG: force rebuild of missing set on jewel upgrade #16950
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gregsfortytwo
approved these changes
Aug 9, 2017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Greg Farnum gfarnum@redhat.com
I believe the idea is to use luminous milestone only for PRs targeting luminous. Correct me if I'm wrong! |
Yep! |
Previously we were detecting the need to rebuild missing based on whether the "divergent_priors" omap key was present. Unfortunately, jewel does not always set this, so it is not a reliable indicator. (It only gets set if you actually have a divergent prior at some point in the PG's life time on that OSD.) Fix by using the info_struct_v on the PG to detect whether we need to do the conversion. We didn't bump the value when we adding the missing persistence, but the fastinfo was also added during the same period between jewel and kraken, so it will work just as well. Fixes: http://tracker.ceph.com/issues/20958 Signed-off-by: Sage Weil <sage@redhat.com>
We can't kill and restart osds because that will interfere with the upgrade process. We can, however, thrash the layout by tweaking osd weights and so on. This will exercise osd recovery paths during the upgrade that aren't normally exercised (outside of stress-split..which doesn't upgrade individual osds while they are non-clean). Signed-off-by: Sage Weil <sage@redhat.com>
finally: teuthology:sage-2017-08-10_02:10:13-upgrade:jewel-x:parallel-wip-20959-distro-basic-smithi 01:21 PM $ zgrep -a 'forced rebuild of missing' */remote/*/log/*osd* | grep -v missing.0\ may_ 1503147/remote/smithi168/log/ceph-osd.1.log.gz:2017-08-10 04:35:05.389377 7effffc70d00 10 osd.1 pg_epoch: 276 pg[4.1( v 276'1108 lc 118'436 (0'0,276'1108] local-lis/les=275/275 n=50 ec=17/17 lis/c 274/265 les/c/f 275/265/0 274/274/197) [] r=-1 lpr=0 pi=[264,274)/1 crt=276'1108 lcod 0'0 unknown m=45] read_state forced rebuild of missing got missing(45 may_include_deletes = 0) it was way harder to trigger this case than I thought. had to add thrashing to the parallel/ collection. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously we were detecting the need to rebuild missing based on
whether the "divergent_priors" omap key was present. Unfortunately,
jewel does not always set this, so it is not a reliable indicator.
(It only gets set if you actually have a divergent prior at some
point in the PG's life time on that OSD.)
Fix by using the info_struct_v on the PG to detect whether we need
to do the conversion. We didn't bump the value when we adding
the missing persistence, but the fastinfo was also added during
the same period between jewel and kraken, so it will work just as
well.
Fixes: http://tracker.ceph.com/issues/20958
Signed-off-by: Sage Weil sage@redhat.com