-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/PG: restart peering for undersized PG on any down stray peer coming back #33106
osd/PG: restart peering for undersized PG on any down stray peer coming back #33106
Conversation
src/osd/PeeringState.cc
Outdated
@@ -525,6 +525,18 @@ bool PeeringState::should_restart_peering( | |||
<< dendl; | |||
return true; | |||
} | |||
if (osdmap->get_pg_size(info.pgid.pgid) > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works in general. It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. Because different osds might have different past_intervals, it isn't even necessarily the case that all osds with this pg will consider it an interval change, so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.
When osd.2 in your example comes back up, does it send an MOSDPGNotify to the primary? I think the right way to make this work right is for
boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)
to check whether the newly booted node could be validly chosen for the acting set and request a new temp mapping. The new temp mapping would trigger a real interval change. I'll note, however, that that's a significantly larger refactor and if osd.2 is down long enough it would need backfill anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. ... so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.
No, it does not form a new interval change at this moment. But as primary restarting peering process, it will redo choose_acting, generate a new pg_temp containing the stray peer that has come back and hence trigger a real interval change finally. Also there are some other exceptions need to be carefully handled.
I agree that this isn't a very ideal fix but we have locally verified that it (together with another small fix) turned out to work pretty well.
Anyway, thanks for your inputs. Yan and I will try to figure out a better fix soon.
4d65082
to
645e3b2
Compare
@athanatos Could you take another look? |
// if so, request pg_temp change to trigger a new interval transition | ||
pg_shard_t auth_log_shard; | ||
bool history_les_bound = false; | ||
ps->choose_acting(auth_log_shard, false, &history_les_bound, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a debugging statement here at level 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and the case is hit with following log:
2020-02-19 16:44:26.822991 7f846caf7700 10 osd.0 pg_epoch: 73119 pg[1.3e( v 71749'838968 (71624'837389,71749'838968] local-lis/les=73117/73118 n=1481 ec=107/107 lis/c 73117/73093 les/c/f 73118/73094/0 73096/73117/70738) [0,17,33]/[0,33] backfill=[17] r=0 lpr=73117 pi=[73093,73117)/1 rops=1 bft=17 crt=71749'838968 lcod 71749'838967 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={255={}}] state<Started/Primary/Active>: Active: got notify from previous acting member 3 , requesting pg_temp change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you observe this in a teuthology run? This is a pretty core change to peering, and it's important that it have coverage in our regular test runs, both because we need to ensure we don't rebreak it and because we want some confidence that it won't break something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, but I managed to add a reproducer anyway.
As far as I can tell, this should work. Once you've added the debugging, could you confirm that this case actually gets hit by the existing teuthology tests in a normal run? |
retest this please |
645e3b2
to
b3d9254
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for our test case :-)
b3d9254
to
ed0c9e2
Compare
…coming back One of our customers wants to verify the data safety of Ceph during scaling the cluster up, and the test case looks like: - keep checking the status of a speficied pg, who's up is [1, 2, 3] - add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5], pg is remapped - stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized - restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set, hence leaving the corresponding pg pinning undersized for a long time until all backfill targets completes It does not pose any critical problem -- we'll end up getting that pg back into active + clean, except that the long live DEGRADED warnings keep bothering our customer who cares about data safety more than any thing else. The right way to achieve the above goal is for: boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt) to check whether the newly booted node could be validly chosen for the acting set and request a new temp mapping. The new temp mapping would then trigger a real interval change that will get rid of the DEGRADED warning. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>
ed0c9e2
to
023524a
Compare
One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:
pg is remapped
hence leaving the corresponding pg pinning undersized for a long time until all backfill
targets completes
It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any other thing..
The right way to achieve the above goal is for:
to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn
Signed-off-by: Yan Jun yan.jun8@zte.com.cn
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox