Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix asserts caused by DNE pgs left behind after lots of OSD restarts #20571

Merged
merged 1 commit into from Feb 26, 2018

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented Feb 23, 2018

No description provided.

@gregsfortytwo
Copy link
Member

So all these problems were caused by a partly-imported PG? Several of the backtraces on that ticket are once the OSD is already running, after OSD::handle_advance_map gets into PG::start_peering_interval. (And one of them reports it migrated to other OSDs, as they started removing OSDs...but maybe there was a whole stale set which had started but not finished deleting the PG?)

Are you sure we want to delete the collection, though? Or I guess if epoch_created == 0 then we know there isn't actually any data in the PG?

@dzafman
Copy link
Contributor Author

dzafman commented Feb 23, 2018

@gregsfortytwo Yes, a DNE PG is empty unless this is a pg_info_t::pg_history_t corruption. We've noticed these DNE PGs after many crashes which we assumed was newly started backfill or recovery. I didn't figure out what would leave behind a newly created PG in DNE state (split?). I tested my fix by hacking ceph-objectstore-tool to clear epoch_created.

@dzafman
Copy link
Contributor Author

dzafman commented Feb 23, 2018

@gregsfortytwo Per recent logs for crash start_peering_interval() assert, the DNE pg is also appears empty because last_update.version == 0.

2018-02-14 03:05:02.468466 7f92379c7700 10 osd.46 pg_epoch: 40258 pg[7.77s5( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [51,41,53,23,25,9,58,31]/[51,2147483647,53,33,38,59,2147483647,4] r=-1 lpr=40258 pi=[28396,40202)/9 crt=0'0 unknown NOTIFY] Not blocking outgoing recovery messages

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple enough fix, then. Just one nit.

src/osd/OSD.cc Outdated
service.pg_remove_epoch(pg->pg_id);
pg->unlock();
// Delete pg
RWLock::WLocker l(pg_map_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it matters at this point in the boot process, but we need to drop this before invoking recursive_remove_collection() since that prompts disk accesses. I'd just put a block around the pg_map modifying bits.

@dzafman
Copy link
Contributor Author

dzafman commented Feb 24, 2018

@gregsfortytwo Dropped pg_map_lock before recursive_remove_collection() call.

@gregsfortytwo
Copy link
Member

LGTM

{
// Delete pg
RWLock::WLocker l(pg_map_lock);
auto p = pg_map.find(pg->get_pgid());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a pg is returned by _open_lock_pg(), i think we can assume that this pg is always added to pg_map, am i right? so this check is not necessary, and can be replaced with an assert() i guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov Changed to an assert

Fixes: http://tracker.ceph.com/issues/21833

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman changed the title "FAILED assert(p.same_interval_since)", without importing PG? Fix asserts caused by DNE pgs left behind after lots of OSD restarts Feb 26, 2018
@dzafman
Copy link
Contributor Author

dzafman commented Feb 26, 2018

@tchaikov I don't know that another QA run is really necessary. This change passed make check and run-standalone.sh. I don't think rados suite would even go through this code path. I'll manually test this again, remove needs-qa and then merge.

@dzafman dzafman removed the needs-qa label Feb 26, 2018
@dzafman dzafman merged commit a2a6f60 into ceph:master Feb 26, 2018
@dzafman dzafman deleted the wip-21833-2 branch February 26, 2018 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants