Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

dzafman · 2018-02-23T20:13:51Z

No description provided.

gregsfortytwo · 2018-02-23T22:20:25Z

So all these problems were caused by a partly-imported PG? Several of the backtraces on that ticket are once the OSD is already running, after OSD::handle_advance_map gets into PG::start_peering_interval. (And one of them reports it migrated to other OSDs, as they started removing OSDs...but maybe there was a whole stale set which had started but not finished deleting the PG?)

Are you sure we want to delete the collection, though? Or I guess if epoch_created == 0 then we know there isn't actually any data in the PG?

dzafman · 2018-02-23T22:36:58Z

@gregsfortytwo Yes, a DNE PG is empty unless this is a pg_info_t::pg_history_t corruption. We've noticed these DNE PGs after many crashes which we assumed was newly started backfill or recovery. I didn't figure out what would leave behind a newly created PG in DNE state (split?). I tested my fix by hacking ceph-objectstore-tool to clear epoch_created.

dzafman · 2018-02-23T23:08:12Z

@gregsfortytwo Per recent logs for crash start_peering_interval() assert, the DNE pg is also appears empty because last_update.version == 0.

2018-02-14 03:05:02.468466 7f92379c7700 10 osd.46 pg_epoch: 40258 pg[7.77s5( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [51,41,53,23,25,9,58,31]/[51,2147483647,53,33,38,59,2147483647,4] r=-1 lpr=40258 pi=[28396,40202)/9 crt=0'0 unknown NOTIFY] Not blocking outgoing recovery messages

gregsfortytwo

This is a simple enough fix, then. Just one nit.

gregsfortytwo · 2018-02-24T01:54:56Z

src/osd/OSD.cc

+      service.pg_remove_epoch(pg->pg_id);
+      pg->unlock();
+      // Delete pg
+      RWLock::WLocker l(pg_map_lock);


I'm not sure it matters at this point in the boot process, but we need to drop this before invoking recursive_remove_collection() since that prompts disk accesses. I'd just put a block around the pg_map modifying bits.

dzafman · 2018-02-24T17:33:08Z

@gregsfortytwo Dropped pg_map_lock before recursive_remove_collection() call.

gregsfortytwo · 2018-02-24T17:46:49Z

LGTM

tchaikov · 2018-02-26T01:44:13Z

http://pulpito.ceph.com/kchai-2018-02-25_02:17:51-rados-wip-kefu-testing-2018-02-25-0801-distro-basic-mira/

tchaikov · 2018-02-26T01:51:25Z

src/osd/OSD.cc

+      {
+	// Delete pg
+	RWLock::WLocker l(pg_map_lock);
+	auto p = pg_map.find(pg->get_pgid());


if a pg is returned by _open_lock_pg(), i think we can assume that this pg is always added to pg_map, am i right? so this check is not necessary, and can be replaced with an assert() i guess?

@tchaikov Changed to an assert

Fixes: http://tracker.ceph.com/issues/21833 Signed-off-by: David Zafman <dzafman@redhat.com>

dzafman · 2018-02-26T19:39:06Z

@tchaikov I don't know that another QA run is really necessary. This change passed make check and run-standalone.sh. I don't think rados suite would even go through this code path. I'll manually test this again, remove needs-qa and then merge.

dzafman added bug-fix core needs-qa labels Feb 23, 2018

dzafman requested review from liewegas and jdurgin February 23, 2018 20:13

gregsfortytwo requested changes Feb 24, 2018

View reviewed changes

dzafman force-pushed the wip-21833-2 branch from cb898dd to bf5387e Compare February 24, 2018 03:59

tchaikov added the wip-kefu-testing label Feb 24, 2018

gregsfortytwo approved these changes Feb 24, 2018

View reviewed changes

tchaikov reviewed Feb 26, 2018

View reviewed changes

tchaikov removed needs-qa wip-kefu-testing labels Feb 26, 2018

osd: Remove partially created pg known as DNE

5ca5607

Fixes: http://tracker.ceph.com/issues/21833 Signed-off-by: David Zafman <dzafman@redhat.com>

dzafman force-pushed the wip-21833-2 branch from bf5387e to 5ca5607 Compare February 26, 2018 15:56

dzafman changed the title ~~"FAILED assert(p.same_interval_since)", without importing PG?~~ Fix asserts caused by DNE pgs left behind after lots of OSD restarts Feb 26, 2018

tchaikov approved these changes Feb 26, 2018

View reviewed changes

tchaikov added the needs-qa label Feb 26, 2018

dzafman removed the needs-qa label Feb 26, 2018

dzafman merged commit a2a6f60 into ceph:master Feb 26, 2018

dzafman deleted the wip-21833-2 branch February 26, 2018 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

dzafman commented Feb 23, 2018

gregsfortytwo commented Feb 23, 2018

dzafman commented Feb 23, 2018

dzafman commented Feb 23, 2018

gregsfortytwo left a comment

gregsfortytwo Feb 24, 2018

dzafman commented Feb 24, 2018

gregsfortytwo commented Feb 24, 2018

tchaikov commented Feb 26, 2018

tchaikov Feb 26, 2018

dzafman Feb 26, 2018

dzafman Feb 26, 2018

dzafman commented Feb 26, 2018

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

Conversation

dzafman commented Feb 23, 2018

gregsfortytwo commented Feb 23, 2018

dzafman commented Feb 23, 2018

dzafman commented Feb 23, 2018

gregsfortytwo left a comment

Choose a reason for hiding this comment

gregsfortytwo Feb 24, 2018

Choose a reason for hiding this comment

dzafman commented Feb 24, 2018

gregsfortytwo commented Feb 24, 2018

tchaikov commented Feb 26, 2018

tchaikov Feb 26, 2018

Choose a reason for hiding this comment

dzafman Feb 26, 2018

Choose a reason for hiding this comment

dzafman Feb 26, 2018

Choose a reason for hiding this comment

dzafman commented Feb 26, 2018