Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

Merged
merged 1 commit into from Feb 26, 2018

Conversation

Projects
None yet
3 participants
@dzafman
Copy link
Member

commented Feb 23, 2018

No description provided.

@dzafman dzafman requested review from liewegas and jdurgin Feb 23, 2018

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Feb 23, 2018

So all these problems were caused by a partly-imported PG? Several of the backtraces on that ticket are once the OSD is already running, after OSD::handle_advance_map gets into PG::start_peering_interval. (And one of them reports it migrated to other OSDs, as they started removing OSDs...but maybe there was a whole stale set which had started but not finished deleting the PG?)

Are you sure we want to delete the collection, though? Or I guess if epoch_created == 0 then we know there isn't actually any data in the PG?

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2018

@gregsfortytwo Yes, a DNE PG is empty unless this is a pg_info_t::pg_history_t corruption. We've noticed these DNE PGs after many crashes which we assumed was newly started backfill or recovery. I didn't figure out what would leave behind a newly created PG in DNE state (split?). I tested my fix by hacking ceph-objectstore-tool to clear epoch_created.

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2018

@gregsfortytwo Per recent logs for crash start_peering_interval() assert, the DNE pg is also appears empty because last_update.version == 0.

2018-02-14 03:05:02.468466 7f92379c7700 10 osd.46 pg_epoch: 40258 pg[7.77s5( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [51,41,53,23,25,9,58,31]/[51,2147483647,53,33,38,59,2147483647,4] r=-1 lpr=40258 pi=[28396,40202)/9 crt=0'0 unknown NOTIFY] Not blocking outgoing recovery messages

@gregsfortytwo
Copy link
Member

left a comment

This is a simple enough fix, then. Just one nit.

service.pg_remove_epoch(pg->pg_id);
pg->unlock();
// Delete pg
RWLock::WLocker l(pg_map_lock);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Feb 24, 2018

Member

I'm not sure it matters at this point in the boot process, but we need to drop this before invoking recursive_remove_collection() since that prompts disk accesses. I'd just put a block around the pg_map modifying bits.

@dzafman dzafman force-pushed the dzafman:wip-21833-2 branch from cb898dd to bf5387e Feb 24, 2018

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2018

@gregsfortytwo Dropped pg_map_lock before recursive_remove_collection() call.

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Feb 24, 2018

LGTM

{
// Delete pg
RWLock::WLocker l(pg_map_lock);
auto p = pg_map.find(pg->get_pgid());

This comment has been minimized.

Copy link
@tchaikov

tchaikov Feb 26, 2018

Contributor

if a pg is returned by _open_lock_pg(), i think we can assume that this pg is always added to pg_map, am i right? so this check is not necessary, and can be replaced with an assert() i guess?

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 26, 2018

Author Member

True

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 26, 2018

Author Member

@tchaikov Changed to an assert

osd: Remove partially created pg known as DNE
Fixes: http://tracker.ceph.com/issues/21833

Signed-off-by: David Zafman <dzafman@redhat.com>

@dzafman dzafman force-pushed the dzafman:wip-21833-2 branch from bf5387e to 5ca5607 Feb 26, 2018

@dzafman dzafman changed the title "FAILED assert(p.same_interval_since)", without importing PG? Fix asserts caused by DNE pgs left behind after lots of OSD restarts Feb 26, 2018

@tchaikov tchaikov added the needs-qa label Feb 26, 2018

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2018

@tchaikov I don't know that another QA run is really necessary. This change passed make check and run-standalone.sh. I don't think rados suite would even go through this code path. I'll manually test this again, remove needs-qa and then merge.

@dzafman dzafman removed the needs-qa label Feb 26, 2018

@dzafman dzafman merged commit a2a6f60 into ceph:master Feb 26, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@dzafman dzafman deleted the dzafman:wip-21833-2 branch Feb 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.