osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

Yan-waller · 2020-02-06T13:38:08Z

One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:

keep checking the status of a speficied pg, who's up is [1, 2, 3]
add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
pg is remapped
stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
hence leaving the corresponding pg pinning undersized for a long time until all backfill
targets completes

It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any other thing..

The right way to achieve the above goal is for:

boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn
Signed-off-by: Yan Jun yan.jun8@zte.com.cn

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

athanatos · 2020-02-06T20:39:37Z

src/osd/PeeringState.cc

@@ -525,6 +525,18 @@ bool PeeringState::should_restart_peering(
 	       << dendl;
    return true;
  }
+  if (osdmap->get_pg_size(info.pgid.pgid) >


I don't think this works in general. It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. Because different osds might have different past_intervals, it isn't even necessarily the case that all osds with this pg will consider it an interval change, so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.

When osd.2 in your example comes back up, does it send an MOSDPGNotify to the primary? I think the right way to make this work right is for

boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and request a new temp mapping. The new temp mapping would trigger a real interval change. I'll note, however, that that's a significantly larger refactor and if osd.2 is down long enough it would need backfill anyway.

It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. ... so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.

No, it does not form a new interval change at this moment. But as primary restarting peering process, it will redo choose_acting, generate a new pg_temp containing the stray peer that has come back and hence trigger a real interval change finally. Also there are some other exceptions need to be carefully handled.

I agree that this isn't a very ideal fix but we have locally verified that it (together with another small fix) turned out to work pretty well.

Anyway, thanks for your inputs. Yan and I will try to figure out a better fix soon.

Yan-waller · 2020-02-15T07:03:48Z

@athanatos Could you take another look?

athanatos · 2020-02-17T19:06:15Z

src/osd/PeeringState.cc

+    // if so, request pg_temp change to trigger a new interval transition
+    pg_shard_t auth_log_shard;
+    bool history_les_bound = false;
+    ps->choose_acting(auth_log_shard, false, &history_les_bound, true);


Add a debugging statement here at level 10.

added and the case is hit with following log:

2020-02-19 16:44:26.822991 7f846caf7700 10 osd.0 pg_epoch: 73119 pg[1.3e( v 71749'838968 (71624'837389,71749'838968] local-lis/les=73117/73118 n=1481 ec=107/107 lis/c 73117/73093 les/c/f 73118/73094/0 73096/73117/70738) [0,17,33]/[0,33] backfill=[17] r=0 lpr=73117 pi=[73093,73117)/1 rops=1 bft=17 crt=71749'838968 lcod 71749'838967 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={255={}}] state<Started/Primary/Active>: Active: got notify from previous acting member 3 , requesting pg_temp change

Did you observe this in a teuthology run? This is a pretty core change to peering, and it's important that it have coverage in our regular test runs, both because we need to ensure we don't rebreak it and because we want some confidence that it won't break something else.

Nope, but I managed to add a reproducer anyway.

athanatos · 2020-02-17T19:10:27Z

As far as I can tell, this should work. Once you've added the debugging, could you confirm that this case actually gets hit by the existing teuthology tests in a normal run?

athanatos · 2020-02-17T19:10:48Z

retest this please

xiexingguo

works for our test case :-)

…coming back One of our customers wants to verify the data safety of Ceph during scaling the cluster up, and the test case looks like: - keep checking the status of a speficied pg, who's up is [1, 2, 3] - add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5], pg is remapped - stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized - restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set, hence leaving the corresponding pg pinning undersized for a long time until all backfill targets completes It does not pose any critical problem -- we'll end up getting that pg back into active + clean, except that the long live DEGRADED warnings keep bothering our customer who cares about data safety more than any thing else. The right way to achieve the above goal is for: boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt) to check whether the newly booted node could be validly chosen for the acting set and request a new temp mapping. The new temp mapping would then trigger a real interval change that will get rid of the DEGRADED warning. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>

xiexingguo · 2020-02-22T03:23:58Z

http://pulpito.ceph.com/xxg-2020-02-22_00:46:40-rados-wip-repeer-undersized-pg-distro-basic-smithi/

jdurgin requested review from athanatos and neha-ojha February 6, 2020 15:25

jdurgin added bug-fix core labels Feb 6, 2020

athanatos reviewed Feb 6, 2020

View reviewed changes

Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from 4d65082 to 645e3b2 Compare February 15, 2020 05:57

athanatos reviewed Feb 17, 2020

View reviewed changes

Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from 645e3b2 to b3d9254 Compare February 19, 2020 10:37

xiexingguo approved these changes Feb 20, 2020

View reviewed changes

xiexingguo added the needs-qa label Feb 20, 2020

Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from b3d9254 to ed0c9e2 Compare February 21, 2020 09:49

Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from ed0c9e2 to 023524a Compare February 21, 2020 09:53

athanatos self-requested a review February 21, 2020 17:17

athanatos approved these changes Feb 21, 2020

View reviewed changes

xiexingguo merged commit ddb2d0c into ceph:master Feb 22, 2020

xiexingguo mentioned this pull request Mar 24, 2020

octopus: osd/PeeringState: fix pending want_acting vs osd offline race #34123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

Yan-waller commented Feb 6, 2020 •

edited

Loading

athanatos Feb 6, 2020

xiexingguo Feb 8, 2020 •

edited

Loading

Yan-waller commented Feb 15, 2020 •

edited

Loading

athanatos Feb 17, 2020

Yan-waller Feb 19, 2020

athanatos Feb 20, 2020

Yan-waller Feb 21, 2020

athanatos commented Feb 17, 2020

athanatos commented Feb 17, 2020

xiexingguo left a comment

xiexingguo commented Feb 22, 2020

osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

Conversation

Yan-waller commented Feb 6, 2020 • edited Loading

Checklist

athanatos Feb 6, 2020

Choose a reason for hiding this comment

xiexingguo Feb 8, 2020 • edited Loading

Choose a reason for hiding this comment

Yan-waller commented Feb 15, 2020 • edited Loading

athanatos Feb 17, 2020

Choose a reason for hiding this comment

Yan-waller Feb 19, 2020

Choose a reason for hiding this comment

athanatos Feb 20, 2020

Choose a reason for hiding this comment

Yan-waller Feb 21, 2020

Choose a reason for hiding this comment

athanatos commented Feb 17, 2020

athanatos commented Feb 17, 2020

xiexingguo left a comment

Choose a reason for hiding this comment

xiexingguo commented Feb 22, 2020

Yan-waller commented Feb 6, 2020 •

edited

Loading

xiexingguo Feb 8, 2020 •

edited

Loading

Yan-waller commented Feb 15, 2020 •

edited

Loading