Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: show destroyed status in tree view; do not auto-out destroyed osds #16446

Merged
merged 4 commits into from Jul 26, 2017

Conversation

Projects
None yet
3 participants
@xiexingguo
Copy link
Member

xiexingguo commented Jul 20, 2017

E.g:

ID CLASS WEIGHT  TYPE NAME                                        UP/DOWN REWEIGHT PRI-AFF DESTROYED 
-1       3.00000 root default                                                                        
-2       3.00000     host gitbuilder-ceph-rpm-centos7-amd64-basic                                    
 0   ssd 1.00000         osd.0                                       down  1.00000 1.00000       yes 
 1   ssd 1.00000         osd.1                                       down  1.00000 1.00000       yes 
 2   ssd 1.00000         osd.2                                         up  1.00000 1.00000        no 
HEALTH_WARN 2 osds destroyed; 2 osds down; Degraded data redundancy: 42/63 objects degraded (66.667%), 16 pgs unclean, 16 pgs degraded, 16 pgs undersized
OSD_DESTROYED 2 osds destroyed
    osd.0 is destroyed
    osd.1 is destroyed

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-destroyed branch from e71b04f to 807c012 Jul 20, 2017

@xiexingguo xiexingguo requested a review from liewegas Jul 20, 2017

@@ -3112,7 +3113,8 @@ class OSDTreePlainDumper : public CrushTreeDumper::Dumper<TextTable> {
} else {
*tbl << (osdmap->is_up(qi.id) ? "up" : "down")

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 20, 2017

Member

Instead of adding a new column, let's make this column (up, down, destroyed)

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Jul 21, 2017

Author Member

Do we allow a combination of "down + destroyed" here? Or just "destroyed" will be good enough since it implicitly indicates that it is already down?

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 21, 2017

Member

destroyed shoudl imply down. (maybe reverify that the destroy and purge commands fail is the osd i sup, and that prepare_boot also fails if the osd is marked destroyed?)

ss << "osd." << osd << " is destroyed";
d.detail.push_back(ss.str());
}
}

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 20, 2017

Member

I'm not sure we need a warning here. Just like a down + out osd is not an error (or warning) condition. It's clutter in the map but harmless and not dangerous.

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Jul 21, 2017

Author Member

Just a reminder. But I think we can merge this into the osd down warnings. What do you think?

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 21, 2017

Member

Perhaps we keep the same policy as OSD_DOWN.. we complain if the osd is 'in'; ignore if it's 'out'?

@@ -3346,7 +3346,8 @@ void OSDMonitor::tick()
}

if (g_conf->mon_osd_down_out_interval > 0 &&
down.sec() >= grace) {
down.sec() >= grace &&
!osdmap.is_destroyed(o)) {

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 20, 2017

Member

Same here, but I'm not sure. What is your line of thinking?

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Jul 21, 2017

Author Member

The "osd destory" command is introduced mainly for osd replacing process. We want to minimize the data migration, but the auto-out mechanism does quite the opposite(for data safety).

Since user is aware of what he/she is doing(osd destroy can be only manually initiated), we can definitely disable this safely.

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 21, 2017

Member

Yeah, that makes sense. But perhaps we make it explicit by adding a mon_osd_destroyed_out_interval option, with behaiior identical do the down option?

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-destroyed branch from 807c012 to e63a595 Jul 22, 2017

@xiexingguo

This comment has been minimized.

Copy link
Member Author

xiexingguo commented Jul 24, 2017

maybe reverify that the destroy and purge commands fail is the osd i sup

@liewegas Most of your concerns are addressed now, expect the above one which is exactly the current code logic doing. Mind rechecking?

@xiexingguo xiexingguo added this to the luminous milestone Jul 24, 2017

@liewegas
Copy link
Member

liewegas left a comment

Couple nits, otherwise looks good!

}
if ((filter & (OSDMap::DUMP_UP|OSDMap::DUMP_DOWN|OSDMap::DUMP_DESTROYED)) ==
(OSDMap::DUMP_UP|OSDMap::DUMP_DOWN|OSDMap::DUMP_DESTROYED)) {
ss << "can not specify 'up', 'down' and 'destroyed' simutaneously";

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 24, 2017

Member

This isn't quite right... we need to make sure they specify no more than one of the three and this allows any two.

@@ -5678,6 +5678,8 @@ void OSD::_preboot(epoch_t oldest, epoch_t newest)
// if our map within recent history, try to add ourselves to the osdmap.
if (osdmap->get_epoch() == 0) {
derr << "waiting for initial osdmap" << dendl;
} else if (osdmap->is_destroyed(whoami)) {
derr << "osdmap says I am destroyed, waiting for re-creation" << dendl;

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 24, 2017

Member

It is probably better to actually exit here, since if we are recreated it will be a new osd with this id, with a new uuid, that is created.

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 24, 2017

Member

(If we are stuck here we are the old osd with old data trying to start)

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-destroyed branch from e63a595 to 1dab814 Jul 24, 2017

@xiexingguo

This comment has been minimized.

Copy link
Member Author

xiexingguo commented Jul 24, 2017

@liewegas fixed and repushed

@@ -5678,6 +5678,9 @@ void OSD::_preboot(epoch_t oldest, epoch_t newest)
// if our map within recent history, try to add ourselves to the osdmap.
if (osdmap->get_epoch() == 0) {
derr << "waiting for initial osdmap" << dendl;
} else if (osdmap->is_destroyed(whoami)) {
derr << "osdmap says I am destroyed, exiting" << dendl;
exit(1);

This comment has been minimized.

Copy link
@liewegas

liewegas Jul 24, 2017

Member

Do you know what the rule is (if any) on how we should exit if we want systemd not to try to restart us?

@liewegas liewegas added the needs-qa label Jul 24, 2017

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-destroyed branch from 1dab814 to 9c47454 Jul 25, 2017

@xiexingguo

This comment has been minimized.

Copy link
Member Author

xiexingguo commented Jul 25, 2017

Do you know what the rule is (if any) on how we should exit if we want systemd not to try to restart us?

We only restart on-failure, so exit(0) will do. Repushed.

@xiexingguo

This comment has been minimized.

Copy link
Member Author

xiexingguo commented Jul 25, 2017

retest this please

@@ -227,6 +227,7 @@ OPTION(mon_osd_auto_mark_in, OPT_BOOL, false) // mark any booting osds '
OPTION(mon_osd_auto_mark_auto_out_in, OPT_BOOL, true) // mark booting auto-marked-out osds 'in'
OPTION(mon_osd_auto_mark_new_in, OPT_BOOL, true) // mark booting new osds 'in'
OPTION(mon_osd_down_out_interval, OPT_INT, 600) // seconds
OPTION(mon_osd_destroyed_out_interval, OPT_INT, 600) // seconds

This comment has been minimized.

Copy link
@tchaikov

tchaikov Jul 26, 2017

Contributor

could you move this option up before mon_osd_down_out_interval?

This comment has been minimized.

Copy link
@xiexingguo

xiexingguo Jul 26, 2017

Author Member

Sure.

xiexingguo added some commits Jul 21, 2017

osd/OSDMap: show "destroyed" status in tree view
It would be a pain if we destroyed an osd and there is no obvious way
to find it out.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
mon/OSDMonitor: apply new 'destroyed' status to 'osd tree' filter
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
osd/OSD: abort booting if osd is marked as destroyed
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

@xiexingguo xiexingguo force-pushed the xiexingguo:wip-destroyed branch from 9c47454 to 1e867d7 Jul 26, 2017

mon/OSDMonitor: introduce mon_osd_destroyed_out_interval for destroye…
…d out control

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

@tchaikov tchaikov merged commit d85a788 into ceph:master Jul 26, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@xiexingguo xiexingguo deleted the xiexingguo:wip-destroyed branch Jul 26, 2017

@xiexingguo

This comment has been minimized.

Copy link
Member Author

xiexingguo commented Jul 26, 2017

@tchaikov Thanks for testing and merging, kefu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.