Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: restart peering for undersized PG on any down stray peer coming back #33106

Merged
merged 1 commit into from
Feb 22, 2020

Conversation

Yan-waller
Copy link
Contributor

@Yan-waller Yan-waller commented Feb 6, 2020

One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:

  • keep checking the status of a speficied pg, who's up is [1, 2, 3]
  • add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
    pg is remapped
  • stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
  • restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
    hence leaving the corresponding pg pinning undersized for a long time until all backfill
    targets completes

It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any other thing..

The right way to achieve the above goal is for:

boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn
Signed-off-by: Yan Jun yan.jun8@zte.com.cn

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@@ -525,6 +525,18 @@ bool PeeringState::should_restart_peering(
<< dendl;
return true;
}
if (osdmap->get_pg_size(info.pgid.pgid) >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works in general. It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. Because different osds might have different past_intervals, it isn't even necessarily the case that all osds with this pg will consider it an interval change, so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.

When osd.2 in your example comes back up, does it send an MOSDPGNotify to the primary? I think the right way to make this work right is for

boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and request a new temp mapping. The new temp mapping would trigger a real interval change. I'll note, however, that that's a significantly larger refactor and if osd.2 is down long enough it would need backfill anyway.

Copy link
Member

@xiexingguo xiexingguo Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an actual interval change as far as the clients are concerned, so clients won't resend ops. ... so some osds may or may not even respond to peering messages resulting in a broken pg and/or crashed osds.

No, it does not form a new interval change at this moment. But as primary restarting peering process, it will redo choose_acting, generate a new pg_temp containing the stray peer that has come back and hence trigger a real interval change finally. Also there are some other exceptions need to be carefully handled.

I agree that this isn't a very ideal fix but we have locally verified that it (together with another small fix) turned out to work pretty well.

Anyway, thanks for your inputs. Yan and I will try to figure out a better fix soon.

@Yan-waller Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from 4d65082 to 645e3b2 Compare February 15, 2020 05:57
@Yan-waller
Copy link
Contributor Author

Yan-waller commented Feb 15, 2020

@athanatos Could you take another look?

// if so, request pg_temp change to trigger a new interval transition
pg_shard_t auth_log_shard;
bool history_les_bound = false;
ps->choose_acting(auth_log_shard, false, &history_les_bound, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a debugging statement here at level 10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and the case is hit with following log:

2020-02-19 16:44:26.822991 7f846caf7700 10 osd.0 pg_epoch: 73119 pg[1.3e( v 71749'838968 (71624'837389,71749'838968] local-lis/les=73117/73118 n=1481 ec=107/107 lis/c 73117/73093 les/c/f 73118/73094/0 73096/73117/70738) [0,17,33]/[0,33] backfill=[17] r=0 lpr=73117 pi=[73093,73117)/1 rops=1 bft=17 crt=71749'838968 lcod 71749'838967 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={255={}}] state<Started/Primary/Active>: Active: got notify from previous acting member 3 , requesting pg_temp change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe this in a teuthology run? This is a pretty core change to peering, and it's important that it have coverage in our regular test runs, both because we need to ensure we don't rebreak it and because we want some confidence that it won't break something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, but I managed to add a reproducer anyway.

@athanatos
Copy link
Contributor

As far as I can tell, this should work. Once you've added the debugging, could you confirm that this case actually gets hit by the existing teuthology tests in a normal run?

@athanatos
Copy link
Contributor

retest this please

@Yan-waller Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from 645e3b2 to b3d9254 Compare February 19, 2020 10:37
Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for our test case :-)

…coming back

One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:
- keep checking the status of a speficied pg, who's up is [1, 2, 3]
- add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
  pg is remapped
- stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
- restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
  hence leaving the corresponding pg pinning undersized for a long time until all backfill
  targets completes

It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any thing else.

The right way to achieve the above goal is for:

	boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>
@Yan-waller Yan-waller force-pushed the wip-walle-fixbackfillundersized branch from ed0c9e2 to 023524a Compare February 21, 2020 09:53
@athanatos athanatos self-requested a review February 21, 2020 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants