New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon,mgr: extricate PGmap from monitor #15073

Merged
merged 131 commits into from Jun 4, 2017

Conversation

Projects
None yet
3 participants
@liewegas
Member

liewegas commented May 13, 2017

Rebased #15022, then separated necessary bits of PGmap into PGMapDigest parent.

  • make check
  • robust flush_pg_stats machinery and/or adjust all of the tests that use it
  • qa rados
  • include PGMapDigest in 'mon report'
  • clean out pre-luminous pg stat queue on OSD (current code is racy)
  • limit size or remove health_detail from report
  • send pg scrub and osd scrub commands via mgr
  • trim old mgrstat states
  • clean up PGMapDigest encode/decode
  • clean up digestbl useless bufferlist wrapper

eventually

  • remove osd_stat from PGMapDigest (it gets big!)

@liewegas liewegas added mgr mon labels May 13, 2017

Show outdated Hide outdated src/mon/OSDMonitor.cc
// should have prepared the latest pgmap if any
mon->pgservice->maybe_add_creating_pgs(creating_pgs.last_scan_epoch,
&pending_creatings);
if (!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 15, 2017

Member

If I read this correctly when updating, flagging this on LUMINOUS is exactly backwards. @tchaikov? I think we're trying to handle the overlap where the pgmap might have gotten an update which we didn't pick up in the separate creating_pgs set.

This also illustrates the main missing thing we need to deal with, which is the switch on upgrade from using the local pgmap to going via the ceph-mgr.
Also remaining: persistence of whatever state we need (from the mgr to the mon), and a sane way of dealing with stale data.

@gregsfortytwo

gregsfortytwo May 15, 2017

Member

If I read this correctly when updating, flagging this on LUMINOUS is exactly backwards. @tchaikov? I think we're trying to handle the overlap where the pgmap might have gotten an update which we didn't pick up in the separate creating_pgs set.

This also illustrates the main missing thing we need to deal with, which is the switch on upgrade from using the local pgmap to going via the ceph-mgr.
Also remaining: persistence of whatever state we need (from the mgr to the mon), and a sane way of dealing with stale data.

This comment has been minimized.

@tchaikov

tchaikov May 16, 2017

Contributor

no, i don't think it's backwards. we don't need to sync up with pgmap if all osds are luminous. but we need to do so while some osds are still pre-luminous.

i try to keep creating_pgs updated with the bits from pgmap which is ensured to be updated before CEPH_OSDMAP_REQUIRE_LUMINOUS is set. so we can have a complete creating_pgs set when we are switched to CEPH_OSDMAP_REQUIRE_LUMINOUS. assuming following upgrade steps:

  1. all mon are luminous, and FEATURE_LUMINOUS in monmap is set when the last luminous mon joins the quorum. but not all osds are luminous. so we can not set CEPH_OSDMAP_REQUIRE_LUMINOUS

  2. the leader starts to update creating_pgs, but it is not able have the complete set of pgs being created just by consuming the (inc) osdmap after FEATURE_LUMINOUS is set. because

    1. there are some pools is still being created when the FEATURE_LUMINOUS is set. so the pgs belonging to that pools could be still being created also. and they only exist in pgmap.
    2. there are some osds are still at pre-luminous, so they keep PGMonitor (now PGMapStatService) updated with MPGStats, but they don't send MOSDPGCreated to mon. that's what trim_creating_pgs() is for. please note we also guard it with !osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS) in OSDMonitor::encode_pending().
  3. but once we have CEPH_OSDMAP_REQUIRE_LUMINOUS in osdmap, all osds are luminous. we don't need the pgmap for updating creating_pgs anymore, the luminous osds will ensure this.

am i missing anything?

@tchaikov

tchaikov May 16, 2017

Contributor

no, i don't think it's backwards. we don't need to sync up with pgmap if all osds are luminous. but we need to do so while some osds are still pre-luminous.

i try to keep creating_pgs updated with the bits from pgmap which is ensured to be updated before CEPH_OSDMAP_REQUIRE_LUMINOUS is set. so we can have a complete creating_pgs set when we are switched to CEPH_OSDMAP_REQUIRE_LUMINOUS. assuming following upgrade steps:

  1. all mon are luminous, and FEATURE_LUMINOUS in monmap is set when the last luminous mon joins the quorum. but not all osds are luminous. so we can not set CEPH_OSDMAP_REQUIRE_LUMINOUS

  2. the leader starts to update creating_pgs, but it is not able have the complete set of pgs being created just by consuming the (inc) osdmap after FEATURE_LUMINOUS is set. because

    1. there are some pools is still being created when the FEATURE_LUMINOUS is set. so the pgs belonging to that pools could be still being created also. and they only exist in pgmap.
    2. there are some osds are still at pre-luminous, so they keep PGMonitor (now PGMapStatService) updated with MPGStats, but they don't send MOSDPGCreated to mon. that's what trim_creating_pgs() is for. please note we also guard it with !osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS) in OSDMonitor::encode_pending().
  3. but once we have CEPH_OSDMAP_REQUIRE_LUMINOUS in osdmap, all osds are luminous. we don't need the pgmap for updating creating_pgs anymore, the luminous osds will ensure this.

am i missing anything?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 16, 2017

Member

Ah, that sounds right. The dueling monitor and osdmap LUMINOUS flags just keep jumbling up in my head.

@gregsfortytwo

gregsfortytwo May 16, 2017

Member

Ah, that sounds right. The dueling monitor and osdmap LUMINOUS flags just keep jumbling up in my head.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 15, 2017

Member

ok, added a bunch more stuff...

  1. can't have the digest in MgrMonitor or else the epoch for MgrMap rolls over every 5 seconds. moved it to a new MgrStatMonitor.
  2. Fixed the PGMap ref in PGMonitor

Now when i start with -o 'mon debug no require luminous = true' it works both before and after i 'ceph osd set require_luminous_osds'.

  • still need to fix the upgrade thing @gregsfortytwo greg pointed out
  • maybe the report frequency shoudl be every 1 or 2 seconds instead of every 5 seconds?
  • mgr needs to do log INFO for pg stat updates so that you see the io stats etc in the log i think?

and obviously this whole series needs a lot of clean up...

Member

liewegas commented May 15, 2017

ok, added a bunch more stuff...

  1. can't have the digest in MgrMonitor or else the epoch for MgrMap rolls over every 5 seconds. moved it to a new MgrStatMonitor.
  2. Fixed the PGMap ref in PGMonitor

Now when i start with -o 'mon debug no require luminous = true' it works both before and after i 'ceph osd set require_luminous_osds'.

  • still need to fix the upgrade thing @gregsfortytwo greg pointed out
  • maybe the report frequency shoudl be every 1 or 2 seconds instead of every 5 seconds?
  • mgr needs to do log INFO for pg stat updates so that you see the io stats etc in the log i think?

and obviously this whole series needs a lot of clean up...

Show outdated Hide outdated src/osd/OSD.cc
newmap->test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
dout(10) << __func__ << " REQUIRE_LUMINOUS flag changed in " << newmap->get_epoch()
<< dendl;
clear_pg_stat_queue();

This comment has been minimized.

@tchaikov

tchaikov May 16, 2017

Contributor

quote from liewegas#35 (comment)

this isn't quite sufficient since the pgs osdmaps lag and they will still queue themselves...

@tchaikov

tchaikov May 16, 2017

Contributor

quote from liewegas#35 (comment)

this isn't quite sufficient since the pgs osdmaps lag and they will still queue themselves...

This comment has been minimized.

@tchaikov

tchaikov May 17, 2017

Contributor

@liewegas i don't think so. see f71c18a#diff-dfb9ddca0a3ee32b266623e8fa489626R2756 . please note, we are checking against the osdmap in OSD not osdmap_ref or last_persisted_osdmap_ref.

pg checks the osdmap flag before queuing themselves. imaging following sequence:

  1. osdmap does not have CEPH_OSDMAP_REQUIRE_LUMINOUS, and pg enqueue itself
  2. osd consume an osdmap with CEPH_OSDMAP_REQUIRE_LUMINOUS, so the queue is cleared
  3. pg does not enqueue itself anymore.
@tchaikov

tchaikov May 17, 2017

Contributor

@liewegas i don't think so. see f71c18a#diff-dfb9ddca0a3ee32b266623e8fa489626R2756 . please note, we are checking against the osdmap in OSD not osdmap_ref or last_persisted_osdmap_ref.

pg checks the osdmap flag before queuing themselves. imaging following sequence:

  1. osdmap does not have CEPH_OSDMAP_REQUIRE_LUMINOUS, and pg enqueue itself
  2. osd consume an osdmap with CEPH_OSDMAP_REQUIRE_LUMINOUS, so the queue is cleared
  3. pg does not enqueue itself anymore.

This comment has been minimized.

@liewegas

liewegas May 17, 2017

Member

i'm assuming that the pg enqueuing itself is based on PG::get_osdmap(), which is often an older man than the OSD's OSDMap.

Perhaps what we need here is an OSDService::get_min_pg_epoch() before/after comparison and when that crosses the flag threshold then we clear the queue?

@liewegas

liewegas May 17, 2017

Member

i'm assuming that the pg enqueuing itself is based on PG::get_osdmap(), which is often an older man than the OSD's OSDMap.

Perhaps what we need here is an OSDService::get_min_pg_epoch() before/after comparison and when that crosses the flag threshold then we clear the queue?

This comment has been minimized.

@tchaikov

tchaikov May 17, 2017

Contributor

@liewegas , pg enqueues itself based on PG::get_osdmap(), but see f71c18a#diff-dfb9ddca0a3ee32b266623e8fa489626R2756, i disable the enqueue based on the OSD's latest osdmap.

@tchaikov

tchaikov May 17, 2017

Contributor

@liewegas , pg enqueues itself based on PG::get_osdmap(), but see f71c18a#diff-dfb9ddca0a3ee32b266623e8fa489626R2756, i disable the enqueue based on the OSD's latest osdmap.

@liewegas liewegas changed the title from DNM mon,mgr: extricate PGmap from monitor to mon,mgr: extricate PGmap from monitor May 16, 2017

@liewegas liewegas added the needs-qa label May 16, 2017

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 17, 2017

Contributor

@liewegas could you elaborate a little bit

include PGMapDigest in 'mon report'

? what is "mon report"? maybe i can work on it tomorrow.

regarding to flush_pg_stats

maybe we can have a helper in ceph-helper.sh, and it calls osd flush_pgs_stats and then mgr send-mon?

Contributor

tchaikov commented May 17, 2017

@liewegas could you elaborate a little bit

include PGMapDigest in 'mon report'

? what is "mon report"? maybe i can work on it tomorrow.

regarding to flush_pg_stats

maybe we can have a helper in ceph-helper.sh, and it calls osd flush_pgs_stats and then mgr send-mon?

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 17, 2017

Member
Member

liewegas commented May 17, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 17, 2017

Member

@tchaikov look for "report" in Monitor.cc.. it's a big json dump of everything interesting about the cluster state.

Member

liewegas commented May 17, 2017

@tchaikov look for "report" in Monitor.cc.. it's a big json dump of everything interesting about the cluster state.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 17, 2017

Contributor

thanks. i will work on it tomorrow if nobody takes it by then.

Contributor

tchaikov commented May 17, 2017

thanks. i will work on it tomorrow if nobody takes it by then.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 28, 2017

Contributor

being tested at http://pulpito.ceph.com/kchai-2017-05-27_07:55:19-rados-wip-mgr-stats-kefu---basic-mira/, seems all qa/workunits/rados/test.sh failed. but the teuthology.log is too large (1.6GB) to inspect. will find a way to shrink it a little bit.

liewegas#44 should address these failures

Contributor

tchaikov commented May 28, 2017

being tested at http://pulpito.ceph.com/kchai-2017-05-27_07:55:19-rados-wip-mgr-stats-kefu---basic-mira/, seems all qa/workunits/rados/test.sh failed. but the teuthology.log is too large (1.6GB) to inspect. will find a way to shrink it a little bit.

liewegas#44 should address these failures

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 28, 2017

Contributor

will rebase this branch against master tmr.

Contributor

tchaikov commented May 28, 2017

will rebase this branch against master tmr.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 29, 2017

Contributor

@liewegas how shall we "remove osd_stat from PGMapDigest", are we going to move features based on this stats to mgr? for example, OSDMonitor uses it for preparing the output of osd df. if that's the case, i can do it tomorrow.

Contributor

tchaikov commented May 29, 2017

@liewegas how shall we "remove osd_stat from PGMapDigest", are we going to move features based on this stats to mgr? for example, OSDMonitor uses it for preparing the output of osd df. if that's the case, i can do it tomorrow.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 29, 2017

Member

I don't think the osd_stats are a blocking issue. And @gregsfortytwo already started on it I believe. If this is passing tests we should merge

Member

liewegas commented May 29, 2017

I don't think the osd_stats are a blocking issue. And @gregsfortytwo already started on it I believe. If this is passing tests we should merge

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 30, 2017

Contributor

i see. http://pulpito.ceph.com/kchai-2017-05-29_18:22:20-rados-wip-mgr-stats-kefu---basic-mira/ still has 17 tests failed/timedout. most of them include rados.yaml. i will fix them before the merge.

Contributor

tchaikov commented May 30, 2017

i see. http://pulpito.ceph.com/kchai-2017-05-29_18:22:20-rados-wip-mgr-stats-kefu---basic-mira/ still has 17 tests failed/timedout. most of them include rados.yaml. i will fix them before the merge.

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo May 30, 2017

Member

Yep, working on the osd_stat map.

Kept getting blocked on other things so I haven't been able to track this the way I'd like, should be able to do so tomorrow. Does upgrading work properly, and been tested?

Member

gregsfortytwo commented May 30, 2017

Yep, working on the osd_stat map.

Kept getting blocked on other things so I haven't been able to track this the way I'd like, should be able to do so tomorrow. Does upgrading work properly, and been tested?

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 30, 2017

Contributor

@gregsfortytwo i thought the rados suite run should cover some upgrade tests, but it turns out that my last rados run does not include any of them: http://pulpito.ceph.com/kchai-2017-05-27_07:55:19-rados-wip-mgr-stats-kefu---basic-mira/?suite=rados%3Aupgrade.

i am scheduling an upgrade run at http://pulpito.ceph.com/kchai-2017-05-30_05:52:48-upgrade-wip-mgr-stats-kefu---basic-mira/ . let's see how it goes!

Contributor

tchaikov commented May 30, 2017

@gregsfortytwo i thought the rados suite run should cover some upgrade tests, but it turns out that my last rados run does not include any of them: http://pulpito.ceph.com/kchai-2017-05-27_07:55:19-rados-wip-mgr-stats-kefu---basic-mira/?suite=rados%3Aupgrade.

i am scheduling an upgrade run at http://pulpito.ceph.com/kchai-2017-05-30_05:52:48-upgrade-wip-mgr-stats-kefu---basic-mira/ . let's see how it goes!

Show outdated Hide outdated src/mon/OSDMonitor.cc
@@ -1073,14 +1130,17 @@ void OSDMonitor::prime_pg_temp(
const OSDMap& next,
pg_t pgid)
{
if (mon->monmap->get_required_features().contains_all(
if (mon->monmap->get_required_features().empty()) {
dout(10) << __func__ << ": quorum not formed" << dendl;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, don't we have a way to directly query the quorum status, instead of this indirect feature query?

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, don't we have a way to directly query the quorum status, instead of this indirect feature query?

This comment has been minimized.

@liewegas

liewegas May 30, 2017

Member

i think this should just be dropped, actually.. the monmap doesn't know anything about quorum, and i can't tell what the intent was here

@liewegas

liewegas May 30, 2017

Member

i think this should just be dropped, actually.. the monmap doesn't know anything about quorum, and i can't tell what the intent was here

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

It's trying not to read the data if we don't expect it to be valid. I'm not sure if that's necessary or useful — I think I said elsewhere that it seemed weird for us to be invoking any of these functions while not active...

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

It's trying not to read the data if we don't expect it to be valid. I'm not sure if that's necessary or useful — I think I said elsewhere that it seemed weird for us to be invoking any of these functions while not active...

Show outdated Hide outdated src/mon/OSDMonitor.cc
@@ -1434,24 +1494,29 @@ void OSDMonitor::share_map_with_random_osd()
version_t OSDMonitor::get_trim_to()
{
epoch_t floor;
if (mon->monmap->get_required_features().empty()) {
dout(10) << __func__ << ": quorum not formed" << dendl;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, same here about direct quorum querying. We can query !Monitor::get_quorum().empty() for instance.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, same here about direct quorum querying. We can query !Monitor::get_quorum().empty() for instance.

@@ -535,6 +578,10 @@ bool PGMonitor::preprocess_query(MonOpRequestRef op)
bool PGMonitor::prepare_update(MonOpRequestRef op)
{
if (mon->osdmon()->osdmap.require_osd_release >= CEPH_RELEASE_LUMINOUS) {
return false;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, there's some haphazard quick returns on invoking pgmon post-luminous, but shouldn't we be doing something like removing it from the dispatch loops entirely?
I'd actually think we want to just not load it into memory at all once we're past the upgrade point, and maybe even unload it, but I haven't looked into how tricky that is yet....

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, there's some haphazard quick returns on invoking pgmon post-luminous, but shouldn't we be doing something like removing it from the dispatch loops entirely?
I'd actually think we want to just not load it into memory at all once we're past the upgrade point, and maybe even unload it, but I haven't looked into how tricky that is yet....

This comment has been minimized.

@liewegas

liewegas May 30, 2017

Member

There is no 'loop' per se, just a couple call sites in Monitor.cc. We can move these calls to the caller but I don't think it makes things any better or worse. I don't know that there is much more we can do to remove/disable this code until we don't have to support the upgrade case any more (or that we'd improve runtime performance by doing so).

@liewegas

liewegas May 30, 2017

Member

There is no 'loop' per se, just a couple call sites in Monitor.cc. We can move these calls to the caller but I don't think it makes things any better or worse. I don't know that there is much more we can do to remove/disable this code until we don't have to support the upgrade case any more (or that we'd improve runtime performance by doing so).

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

I meant the call sites in the dispatch event loop. :)

And speaking of those call sites, see below about disabling it.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

I meant the call sites in the dispatch event loop. :)

And speaking of those call sites, see below about disabling it.

if (tx_size > g_conf->mon_sync_max_payload_size*2) {
mon->store->apply_transaction(t);
t = MonitorDBStore::TransactionRef();
tx_size = 0;
}
if (mon->monmap->get_required_features().contains_all(
ceph::features::mon::FEATURE_LUMINOUS)) {
creating_pgs = update_pending_pgs(inc);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov @liewegas did this ever get discussed? From commit:

We are not persisiting the updated creating_pgs here; this is bad! I'm not sure why it was there to begin with?

It originated in 43ded67 as its own hunk, so it was clearly deliberate. I think it must have been so we could identify some of the state-changed PGs and I'm a bit worried about dropping the hunk. (For instance, if we've deleted a pool with creating pgs, we don't want to tell OSDs to create them! And we don't need to persist that data since any subsequent load will produce the same outcome.)

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov @liewegas did this ever get discussed? From commit:

We are not persisiting the updated creating_pgs here; this is bad! I'm not sure why it was there to begin with?

It originated in 43ded67 as its own hunk, so it was clearly deliberate. I think it must have been so we could identify some of the state-changed PGs and I'm a bit worried about dropping the hunk. (For instance, if we've deleted a pool with creating pgs, we don't want to tell OSDs to create them! And we don't need to persist that data since any subsequent load will produce the same outcome.)

This comment has been minimized.

@liewegas

liewegas May 30, 2017

Member

It probably made sense at the time, but it is incorrect to call update_pending_pgs() and not persist the result. @tchaikov can you confirm this isn't needed?

@liewegas

liewegas May 30, 2017

Member

It probably made sense at the time, but it is incorrect to call update_pending_pgs() and not persist the result. @tchaikov can you confirm this isn't needed?

This comment has been minimized.

@tchaikov

tchaikov May 31, 2017

Contributor

@gregsfortytwo i think we can drop this line.

if we just deleted a pool with creating pgs, the quorum should be aware of this and creating_pgs should be updated. so it should not contain the creatings pgs in that pool. if a new mon just joins the quorum, it will be sync'ed with another monitor. OSDMonitor::get_store_prefixes() returns both "osdmap" and "osd_pg_creating", so "pg_creating" will be sync'ed. this ensures us that the new joiner will also have an updated copy of creating_pgs.

@tchaikov

tchaikov May 31, 2017

Contributor

@gregsfortytwo i think we can drop this line.

if we just deleted a pool with creating pgs, the quorum should be aware of this and creating_pgs should be updated. so it should not contain the creatings pgs in that pool. if a new mon just joins the quorum, it will be sync'ed with another monitor. OSDMonitor::get_store_prefixes() returns both "osdmap" and "osd_pg_creating", so "pg_creating" will be sync'ed. this ensures us that the new joiner will also have an updated copy of creating_pgs.

}
put_last_committed(t, pending_inc.version);
put_value(t, "deleted", 1);
return;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, I don't remember how much we care about the pgmap state, but should we perhaps copy it elsewhere in case of disaster?

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, I don't remember how much we care about the pgmap state, but should we perhaps copy it elsewhere in case of disaster?

This comment has been minimized.

@liewegas

liewegas May 30, 2017

Member

We could leave it where it is, but then I don't know when we would delete it. My thinking is that since we don't bother storing it now it won't hurt to throw out the legacy info. We won't start trimming again until we reestablish a new last_epoch_clean lower bound, so what would we use it for? Maybe we could pg dump to the log or something?

@liewegas

liewegas May 30, 2017

Member

We could leave it where it is, but then I don't know when we would delete it. My thinking is that since we don't bother storing it now it won't hurt to throw out the legacy info. We won't start trimming again until we reestablish a new last_epoch_clean lower bound, so what would we use it for? Maybe we could pg dump to the log or something?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

I was thinking more disaster recovery. But yeah, I'm fine with tossing it. Just wanted to check.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

I was thinking more disaster recovery. But yeah, I'm fine with tossing it. Just wanted to check.

std::unique_ptr<PGStatService> pgservice;
bool do_delete = false; ///< propose deleting pgmap data
bool did_delete = false; ///< we already deleted pgmap data

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

ah, we've got some haphazard did_delete checks in the PGMonitor. How about instead we expose that to the Monitor via an is_enabled() interface, and crash out if anybody tries to access a PaxosService which is not enabled?

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

ah, we've got some haphazard did_delete checks in the PGMonitor. How about instead we expose that to the Monitor via an is_enabled() interface, and crash out if anybody tries to access a PaxosService which is not enabled?

This comment has been minimized.

@liewegas

liewegas May 31, 2017

Member

It's a one-time event so I'm not sure getting Monitor involved will help. I'm also trying to keep the hackery confined to PGMonitor as much as possible.

I like the idea of making it crash out, though.. maybe pgmon() could return null once the switch is flipped?

@liewegas

liewegas May 31, 2017

Member

It's a one-time event so I'm not sure getting Monitor involved will help. I'm also trying to keep the hackery confined to PGMonitor as much as possible.

I like the idea of making it crash out, though.. maybe pgmon() could return null once the switch is flipped?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

Yeah, I was just trying to be general about it. Simple hacks are simple, though!

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

Yeah, I was just trying to be general about it. Simple hacks are simple, though!

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, did these commits sneak in from your messenger lossy connections branch?

"mgr/DaemonServer: keep registry of osd sessions
Occasionally we send them messages."
"mgr/DaemonServer: use registered osd session to send scrub messages
We can't rely on the messenger to track lossy connections for us."

Member

gregsfortytwo commented May 30, 2017

@liewegas, did these commits sneak in from your messenger lossy connections branch?

"mgr/DaemonServer: keep registry of osd sessions
Occasionally we send them messages."
"mgr/DaemonServer: use registered osd session to send scrub messages
We can't rely on the messenger to track lossy connections for us."

Show outdated Hide outdated src/mon/MgrStatMonitor.cc
dout(10) << " " << version << dendl;
bufferlist digestbl, bl;
::encode(pending_digest, digestbl, mon->get_quorum_con_features());
::encode(digestbl, bl);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, the commit adding this digestbl says it's "just for bigbang's benefit", although I'm not sure how. What's up, and should it actually go int?

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@liewegas, the commit adding this digestbl says it's "just for bigbang's benefit", although I'm not sure how. What's up, and should it actually go int?

This comment has been minimized.

@liewegas

liewegas May 31, 2017

Member

Yeah he encoding into a bufferlist that then gets encoded again is there because the first instance of bigbang did that and had to maintain compat. Saem goes for PGMapDigest, which has somewhat weird encoding. I'll separate those patches out

@liewegas

liewegas May 31, 2017

Member

Yeah he encoding into a bufferlist that then gets encoded again is there because the first instance of bigbang did that and had to maintain compat. Saem goes for PGMapDigest, which has somewhat weird encoding. I'll separate those patches out

Show outdated Hide outdated src/mon/Monitor.cc
continue;
reply->pool_stats[pool_name] = *pool_stat;
}
send_reply(op, reply);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, we've managed to keep the RADOS cluster stuff out of Monitor.cc so far; do you think we can put these stat handlers in OSDMonitor?

@gregsfortytwo

gregsfortytwo May 30, 2017

Member

@tchaikov, we've managed to keep the RADOS cluster stuff out of Monitor.cc so far; do you think we can put these stat handlers in OSDMonitor?

This comment has been minimized.

@tchaikov

tchaikov May 31, 2017

Contributor

@gregsfortytwo yes, i agreed.

@tchaikov

tchaikov May 31, 2017

Contributor

@gregsfortytwo yes, i agreed.

This comment has been minimized.

@tchaikov

tchaikov May 31, 2017

Contributor

after a second thought, i think they should go to MgrStatMonitor. MgrStatMonitor is "closer" to where these stats requests are served in long term than OSDMonitor.

@tchaikov

tchaikov May 31, 2017

Contributor

after a second thought, i think they should go to MgrStatMonitor. MgrStatMonitor is "closer" to where these stats requests are served in long term than OSDMonitor.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

Don't forget we need to route it correctly before the upgrade is complete.

@gregsfortytwo

gregsfortytwo May 31, 2017

Member

Don't forget we need to route it correctly before the upgrade is complete.

This comment has been minimized.

@tchaikov

tchaikov Jun 1, 2017

Contributor

ahh, my patch didn't get into this one. it does take this into consideration.

@tchaikov

tchaikov Jun 1, 2017

Contributor

ahh, my patch didn't get into this one. it does take this into consideration.

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo May 30, 2017

Member

Okay, not a full review but I skimmed through it enough to feel good about what's gone on. :) Notably I did not check the pg temp rework. Let me know if somebody hasn't looked at that yet.

Now, let me rebase some patches and get that osd_stat out of there...

Member

gregsfortytwo commented May 30, 2017

Okay, not a full review but I skimmed through it enough to feel good about what's gone on. :) Notably I did not check the pg temp rework. Let me know if somebody hasn't looked at that yet.

Now, let me rebase some patches and get that osd_stat out of there...

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov May 31, 2017

Contributor

@gregsfortytwo the pg temp changes is being reviewed at #15291. it think it's correct per se. but could be improved a little bit. maybe it can use more reviews.

Contributor

tchaikov commented May 31, 2017

@gregsfortytwo the pg temp changes is being reviewed at #15291. it think it's correct per se. but could be improved a little bit. maybe it can use more reviews.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 31, 2017

Member

Good catch, i'll strip out those stray commits, thanks!

Member

liewegas commented May 31, 2017

Good catch, i'll strip out those stray commits, thanks!

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas May 31, 2017

Member
Member

liewegas commented May 31, 2017

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo Jun 1, 2017

Member

Oh, tried to build some patches on top of this and 92d49f9 is broken (it's referring to an osd_con member which I think you undid from the messenger lossy cons thing).

Member

gregsfortytwo commented Jun 1, 2017

Oh, tried to build some patches on top of this and 92d49f9 is broken (it's referring to an osd_con member which I think you undid from the messenger lossy cons thing).

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo Jun 1, 2017

Member

Turns out it wasn't too hard to get rid of the osd stats map (although I haven't run it through any testing, I'm pretty confident): https://github.com/gregsfortytwo/ceph/commits/wip-osd-stats-map

Member

gregsfortytwo commented Jun 1, 2017

Turns out it wasn't too hard to get rid of the osd stats map (although I haven't run it through any testing, I'm pretty confident): https://github.com/gregsfortytwo/ceph/commits/wip-osd-stats-map

@gregsfortytwo

This comment has been minimized.

Show comment
Hide comment
@gregsfortytwo

gregsfortytwo Jun 1, 2017

Member

I also looked into it and at this point the only programmatic access I'm noticing is to pool stats requests. Stale returns or going back in time there can have some nice cascade consequences on eg Cinder, but I don't think it requires any serious code on our end. At this point, there are only two non-constant size members of the report: the osd_epoch vector and the health outputs.

So some remaining things I was considering:

  • make the ceph CLI/rest modules request info from the mon and the mgr, so that we don't need to route anything to the monitor (except for backwards compatibility and answering the pool stats queries)
  • (alternative: forward all requests to the manager so that it serves up the info and the monitor just provides a fallback when the mgr is dead?)
  • remove the MgrStatsMon; its only purpose is persisting useless data and providing a separate "channel" for the pg health messages
  • log a pgmapdigest to the monitor config-key store once a minute or something, so we don't have to go totally stale on manager failover but also don't need special machinery for it.
  • (other alternative: if we really prefer the existing state of affairs, we can at least only persist the pgmapdigest once a minute or something [by handling them in preprocess instead of prepare]; that way we don't need to limit the number of pg health messages we provide)

Oh, and one thing we definitely should do: reset the PGMapDigest encode/decode functions. There's no reason to start them at version 6 and have already-dead members on our first release. ;)

Member

gregsfortytwo commented Jun 1, 2017

I also looked into it and at this point the only programmatic access I'm noticing is to pool stats requests. Stale returns or going back in time there can have some nice cascade consequences on eg Cinder, but I don't think it requires any serious code on our end. At this point, there are only two non-constant size members of the report: the osd_epoch vector and the health outputs.

So some remaining things I was considering:

  • make the ceph CLI/rest modules request info from the mon and the mgr, so that we don't need to route anything to the monitor (except for backwards compatibility and answering the pool stats queries)
  • (alternative: forward all requests to the manager so that it serves up the info and the monitor just provides a fallback when the mgr is dead?)
  • remove the MgrStatsMon; its only purpose is persisting useless data and providing a separate "channel" for the pg health messages
  • log a pgmapdigest to the monitor config-key store once a minute or something, so we don't have to go totally stale on manager failover but also don't need special machinery for it.
  • (other alternative: if we really prefer the existing state of affairs, we can at least only persist the pgmapdigest once a minute or something [by handling them in preprocess instead of prepare]; that way we don't need to limit the number of pg health messages we provide)

Oh, and one thing we definitely should do: reset the PGMapDigest encode/decode functions. There's no reason to start them at version 6 and have already-dead members on our first release. ;)

liewegas and others added some commits May 23, 2017

osd: work around bluestore fragmetned buffers in get_map_bl
Signed-off-by: Sage Weil <sage@redhat.com>
mgr: drop useless __func__ prints
This is part of the default prefix.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: clean up no-beacon message
Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: use newest creation epoch for pgs that we can
If we have a huge pool it may take a while for the PGs to get out of the
queue and be created.  If we use the epoch the pool was created it may
mean a lot of old OSDMaps the OSD has to process.  If we use the current
epoch (the first epoch in which any OSD learned that this PG should
exist) we limit PastIntervals as much as possible.

It is still possible that we start trying to create a PG but the cluster
is unhealthy for a long time, resulting in a long PastIntervals that
needs to be generated by a primary OSD when it eventually comes up. So
this only partially

Partially-fixes: http://tracker.ceph.com/issues/20050
Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: encode delta info in digest
It was already in PGMapDigest, but not encoded.

One field we didn't need; move that back to PGMap.

Signed-off-by: Sage Weil <sage@redhat.com>
mgr: apply PGMap incremental at same interval as reports
We were doing an incremental per osd stat report; this screws up the
delta stats updates when there are more than a handful of OSDs.  Instead,
do it with the same period as the mgr->mon reports.

Signed-off-by: Sage Weil <sage@redhat.com>
mgr/DaemonServer: log pgmap usage to cluster log
Signed-off-by: Sage Weil <sage@redhat.com>
mgr/ClusterState: make pg stat filtering less fragile
We want to drop updates for pgs for pools that don't exist.  Keep an
updated set of those pools instead of relying on the previous PGMap
having them instantiated.  (The previous map may drift due to bugs.)

Signed-off-by: Sage Weil <sage@redhat.com>
mgr/MgrStandby: reset subscriptions when we become non-active
This is a goofy workaround that we're also doing in Mgr::init().  Someday
we should come up with a more elegant solution.  In the meantime, this
works just fine!

Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: count 'unknown' pgs
Also, count "not active" (inactive) pgs instead of active so that we
list "bad" things consistently, and so that 'inactive' is a separate
bucket of pgs than the 'unknown' ones.

Signed-off-by: Sage Weil <sage@redhat.com>
mgr: mgr_tick_period = 2
5 seconds is driving me nuts.  We cap the health message size so the
digest is now small and lightweight.

Signed-off-by: Sage Weil <sage@redhat.com>
mon: print log for the creating_pgs changes
print more log messages when updating creating_pgs.

see-also: http://tracker.ceph.com/issues/20067
Signed-off-by: Kefu Chai <kchai@redhat.com>
osd/OSDMap: more efficient PGMapTemp
Use a flat_map with pointers into a buffer with the actual data.  For a
decoded mapping, we have just two allocations (one for flat_map and one
for the encoded buffer).

This can get slow if you make lots of incremental changes after the fact
since flat_map is not efficient for modifications at large sizes.  :/

Signed-off-by: Sage Weil <sage@redhat.com>
osd: do_shutdown takes precedence over fetching more maps
This is making my osd-markdown.sh test fail reliably.

Signed-off-by: Sage Weil <sage@redhat.com>
mgr: reset pending_inc after applying it
we cannot apply pending_inc twice and expect the result is the same. in
other words, pg_map.apply_incremental(pending_inc) is not an idempotent
operation.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/PGMap: strip out PGMapDigest compat cruft
This was needed for bigbang testing, but not for the final version.

Signed-off-by: Sage Weil <sage@redhat.com>
mon: handle MStatfs using PGStatService
otherwise ceph_test_rados_api_stat: LibRadosStat.ClusterStat will always
timeout once the cluster is switched to luminous

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: handle MGetPoolStats using PGStatService
otherwise ceph_test_rados_api_stat: LibRadosStat.PoolStat will always
timeout once the cluster is switched to luminous

Signed-off-by: Kefu Chai <kchai@redhat.com>
osdmonitor: use a new get_full_osd_counts to generate health without …
…osdstat

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon: pgstat: remove get_osd_stat() from post-luminous state
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon: pgstat: remove get_num_pg_by_osd() from post-luminous state
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon: pgstat: remove unused get_pg_sum interface
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon/OSDMonitor: print pgid before looking it up in mapping
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: mgr: pgstats: do available-space calculations on mgr instead of mon
This means we don't need the osd_stat map on the monitor at all. We're
about to remove it entirely!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon: pgstat: remove unused get_osd_sum from post-luminous state
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon: mgr: remove osd_stats map from PGMapDigest
We use this information only for dumps. Stop dumping per-OSD stats as they're
not needed. In order to maintain pool "fullness" information, calculate
the OSDMap-based rule availibility ratios on the monitor and include those
values in the PGMapDigest. Also do it whenever we call dump_pool_stats_full()
on the manager.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
mon/MonmapMonitor: initialize new cluster monmap with default persist…
…ent features

This is similar to what we do for OSDMonitor::create_initial().

Avoid setting these initial features just for teh mon test that verifies
persistent features get set on a full quorum.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: filter the added creating_pgs added from pgmap
the creating_pgs added from pgmap might contains pgs whose containing
pools have been deleted. this is fine with the PGMonitor, as it has the
updated pg mapping which is consistent with itself. but it does not work
with OSDMonitor's creating_pgs, whose pg mapping is calculated by
itself. so we need to filter the pgmap's creating_pgs when adding them to
OSDMonitor's creating_pgs with the latest osdmap.get_pools().

Fixes: http://tracker.ceph.com/issues/20067
Signed-off-by: Kefu Chai <kchai@redhat.com>
qa/tasks: add a blacklist for flush_pg_stats()
so we don't wait for marked out osds.

Signed-off-by: Kefu Chai <kchai@redhat.com>
selinux: clip the ceph context to ceph-mgr also
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: clean up PGMapDigest encoding
Remove compat cruft due to intitial testing on bigbang.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: add a space after __func__ in log message
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov

This comment has been minimized.

Show comment
Hide comment
Contributor

tchaikov commented Jun 3, 2017

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Jun 4, 2017

Contributor

@liewegas @gregsfortytwo http://pulpito.ceph.com/kchai-2017-06-04_06:03:05-upgrade-master---basic-mira/: i just ran the master branch through our upgrade suite. it appears that wip-mgr-stats and master perform equally bad. i think this branch is ready to merge once we have liewegas#50 merged and squashed. what do you think?

Contributor

tchaikov commented Jun 4, 2017

@liewegas @gregsfortytwo http://pulpito.ceph.com/kchai-2017-06-04_06:03:05-upgrade-master---basic-mira/: i just ran the master branch through our upgrade suite. it appears that wip-mgr-stats and master perform equally bad. i think this branch is ready to merge once we have liewegas#50 merged and squashed. what do you think?

tchaikov added some commits Jun 3, 2017

qa/workunits/rados/test_health_warning: misc fixes
* do not let osd shutdown itself by enlarge osd_max_markdown_count and
  shorten osd_max_markdown_period
* do not shutdown all osds in the last test. if all osds are shutdown at
  the same time. none of them will get updated osdmap after noup is
  unset. we should leave at least one of them, so the gossip protocol
  can kick in, and populate the news to all osds.

Fixes: http://tracker.ceph.com/issues/20174
Signed-off-by: Kefu Chai <kchai@redhat.com>
qa/suites/upgrade/hammer-jewel-x: don't initially start mgr daemons
Signed-off-by: Kefu Chai <kchai@redhat.com>
@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Jun 4, 2017

Member

\o/

Member

liewegas commented Jun 4, 2017

\o/

@liewegas liewegas merged commit 554cf83 into ceph:master Jun 4, 2017

2 of 3 checks passed

default Build triggered. sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@liewegas liewegas deleted the liewegas:wip-mgr-stats branch Jun 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment