New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon,osd: decouple creating pgs from pgmap #13999

Merged
merged 20 commits into from Mar 30, 2017

Conversation

Projects
None yet
4 participants
@tchaikov
Contributor

tchaikov commented Mar 16, 2017

No description provided.

@tchaikov tchaikov added core mon labels Mar 16, 2017

@tchaikov tchaikov changed the title from mon,osd: decouple creating pgs from pgmap to [DNM] mon,osd: decouple creating pgs from pgmap Mar 16, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 16, 2017

some tests time out. looking at them.

void OSD::send_beacon()
{
dout(20) << __func__ << dendl;
monc->send_mon_message(new MOSDBeacon());

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

I think we'll want to throttle this a bit; the failure and full update reports go out pretty quickly, but these should only be every several minutes to minimize mon load.

struct creating_pgs_t {
std::map<pg_t, std::pair<epoch_t, utime_t> > pgs;
std::unordered_set<int64_t> created_pools;

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

std::set probably?

This comment has been minimized.

@tchaikov

tchaikov Mar 17, 2017

Contributor

@liewegas unordered_set is more spatial efficient in this case, as pool id are sequential numbers which are a perfect fit for the hash.

#include <unordered_set>
#include "include/encoding.h"
struct creating_pgs_t {

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

add this to test/encoding/types.h

@@ -38,6 +38,7 @@ class Monitor;
class PGMap;
class MonSession;
class MOSDMap;
class MPGStats;

This comment has been minimized.

@tchaikov

tchaikov Mar 16, 2017

Contributor

remove this.

{
bufferlist bl;
mon->store->get(OSD_PG_CREATINGS_PREFIX, "creatings", bl);

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

s/pg_creatings/pg_creating/

int acting_primary = -1;
auto pgid = pg.first;
auto created = pg.second.first;
mapping.get(pgid, nullptr, nullptr, nullptr, &acting_primary);

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

I don't think we can assume the mapping is usuable and calculated yet; need to check mapping_job first for that.

instead, maybe a separate function that we run when the mapping calculation is done to update creating_pgs_by_osd_epoch and to send out the subs? this will somewhat disconnect creating_pgs notifications from the commit cycle.. i think that should be ok tho

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

key is probably just to make sure the MOSDPGCreate message is tagged with the epoch the mapping was calcualted and not the current paxos epoch (since they might vary?)

This comment has been minimized.

@tchaikov

tchaikov Mar 17, 2017

Contributor

done.

}
scan_for_creating_pgs(osdmap.get_pools(), &pending_creatings);
// do not scan the pending_inc.new_pools now, as the mapping_job will not
// update the mapping with the new osdmap until on_active() after refreshing

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

it seems like we could keep the creating_pgs list strictly in sync with the osdmap, since it doesn't include any mappings. it's only the notifications that need crush calculations and may need a delay?

This comment has been minimized.

@tchaikov

tchaikov Mar 17, 2017

Contributor

yeah, we can.

{}
const char *get_type_name() const override { return "pg_created"; }
void print(ostream& out) const override {
out << "osd_pg_created()";

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

put the list in there? should be a short list per osd.

@@ -2705,6 +2705,9 @@ void PG::publish_stats_to_osd()
if ((state & (PG_STATE_ACTIVE|PG_STATE_PEERED)) &&
!(info.stats.state & (PG_STATE_ACTIVE|PG_STATE_PEERED)))
info.stats.last_became_peered = now;
if (!(state & PG_STATE_CREATING) &&
(info.stats.state & PG_STATE_CREATING))
just_created = true;

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

what if we send the message from right here? and then also send one if we receive an MOSDPGCreate message on PGs that are already active? then we don't have a time/retry at all.

@tchaikov tchaikov changed the title from [DNM] mon,osd: decouple creating pgs from pgmap to mon,osd: decouple creating pgs from pgmap Mar 19, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 19, 2017

@liewegas thanks for your comments! it's ready for review again. i will run the jewel-x upgrade suites with this PR.

return false;
}
for (auto pg_epoch : m->get_pgs()) {
pending_created_pgs[from][pg_epoch.second].insert(pg_epoch.first);

This comment has been minimized.

@liewegas

liewegas Mar 20, 2017

Member

It looks like we only need the pg_t here.. not the epoch?

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

you are looking at the updated code.

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

right. we don't use the epoch afterwards.

if (--size)
out << ",";
}
out << ")";

This comment has been minimized.

@liewegas

liewegas Mar 20, 2017

Member

i think we only need vector<pg_t> actually?

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

you are looking at the outdated code.

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

as we only send a single pg, so, a pg_t would do, i think.

creating_pgs_t creating_pgs;
Spinlock creating_pgs_lock;

This comment has been minimized.

@liewegas

liewegas Mar 20, 2017

Member

this should be a mutex, i think!

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

all code block guarded by this lock is non-blocking code, so i think spin lock would suffice.

@@ -3712,14 +3717,14 @@ void OSD::handle_pg_peering_evt(
{
if (service.splitting(pgid)) {
peering_wait_for_split[pgid].push_back(evt);
return;
return -EEXIST;

This comment has been minimized.

@liewegas

liewegas Mar 20, 2017

Member

speaking of which, we should probably stop sending create messages for split pgs entirely.. we already ignore them!

This comment has been minimized.

@tchaikov

tchaikov Mar 20, 2017

Contributor

right, we don't send split pg now. are you suggesting that we should print some log message in this case? as in luminous, it's an unreachable branch.

@liewegas

This comment has been minimized.

Member

liewegas commented Mar 20, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 20, 2017

okay, will do. yeah, i removed the code sending the pg-create for split pgs in this PR.

@liewegas

looks good!

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 24, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 24, 2017

changelog

  • rebase against master
  • set the pg.created to pool.get_last_change() instead of osdmap's epoch.

@tchaikov tchaikov changed the title from mon,osd: decouple creating pgs from pgmap to [DNM] mon,osd: decouple creating pgs from pgmap Mar 24, 2017

@tchaikov tchaikov changed the title from [DNM] mon,osd: decouple creating pgs from pgmap to mon,osd: decouple creating pgs from pgmap Mar 25, 2017

mon->update_creating_pgs();
mon->check_pg_creates_subs();
osdmon->update_creating_pgs();
osdmon->check_pg_creates_subs();

This comment has been minimized.

@tchaikov

tchaikov Mar 27, 2017

Contributor

i found a racing in my rados qa run:

  • an OSD disconnects from monitor, and its subscriptions are removed from the connected monitor, also the subscription data structures were destroyed.
  • at the meanwhile, OSDMonitor just completed calculating the osd-mapping, and was checking the osd-pg-creates subscriptions. in other words, check_pg_creates_subs() was dereferencing a dangling pointer.

hence this change.

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

It looks to me like all of these locations already hold Monitor::lock... I'm not sure this is the problem?

This comment has been minimized.

@tchaikov

tchaikov Mar 28, 2017

Contributor

not all of them, the osdmapping thread does not hold Monitor::lock. and what complicates this problem is that this completion callback could be called right in OSDMonitor::start_mapping() where the Monitor::lock is already acquired if the mapping is completed at that moment. so we have two choices:

  • change the Monitor::lock to a recursive lock
  • add another lock

i think the latter is simpler, and it also avoids the potential problems could be introduced by a single grand lock.

tchaikov referenced this pull request in ceph/ceph-build Mar 27, 2017

ceph-pr-submodules: fix submodule check.
Prior code never generated any output at all.

Fixes: http://tracker.ceph.com/issues/19244
Signed-off-by: Dan Mick <dan.mick@redhat.com>
@@ -1406,8 +1405,13 @@ void OSDMonitor::share_map_with_random_osd()
version_t OSDMonitor::get_trim_to()
{
if (mon->pgmon()->is_readable() &&
mon->pgmon()->pg_map.creating_pgs.empty()) {
{

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

here too

@liewegas

This comment has been minimized.

Member

liewegas commented Mar 28, 2017

@@ -4510,6 +4519,15 @@ void OSD::tick()
do_waiters();
tick_timer.add_event_after(OSD_TICK_INTERVAL, new C_Tick(this));
if (is_active()) {
utime_t now = ceph_clock_now();

This comment has been minimized.

@yuyuyu101

yuyuyu101 Mar 28, 2017

Member

it's better to use monoclock here, we met lots of clock backoff causing weird behavior

This comment has been minimized.

@tchaikov

tchaikov Mar 28, 2017

Contributor

will switch to mono coarse clock.

@@ -788,6 +788,7 @@ OPTION(osd_mon_heartbeat_interval, OPT_INT, 30) // (seconds) how often to ping
OPTION(osd_mon_report_interval_max, OPT_INT, 600)
OPTION(osd_mon_report_interval_min, OPT_INT, 5) // pg stats, failures, up_thru, boot.
OPTION(osd_mon_report_max_in_flight, OPT_INT, 2) // max updates in flight
OPTION(osd_beacon_report_interval, OPT_INT, 500) // (second) how often to send beacon message to monitor

This comment has been minimized.

@yuyuyu101

yuyuyu101 Mar 28, 2017

Member

is it too large? mon_osd_report_timeout is 900, I think we could set to mon_osd_report_timeout/3 to allow at least 3 messages missing?

@liewegas

lgtm! squash away!

tchaikov added some commits Feb 24, 2017

mon: pass const variables by const ref not pointer
* PGMapUpdater::check_down_pgs(): pass a const reference to pgmap
  instead of a pointer
* PGMapUpdater::register_new_pgs(): pass a const reference to pgmap
  instead of a pointer

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/PGmonitor: remove unused last_sent_pg_create
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/PGMonitor: remove unused variable
last_pg_scan is not used in PGMonitor::update_from_paxos(), so remove
it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: s/check_sub/check_osdmap_sub/
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: add send_pg_create() to OSDMonitor
OSDMonitor will handle the pg-create subscriptions after luminous.
1. scan new pools to get the pgs to be created
2. send pg creates using the collected pgs
3. trim the creating_pgs using the "created!" messages from OSD.

please note that we need to wait for the OSDMonitor::mapping to be fully
populated, so we cannot scan the incrementa map for creating pgs until
it is applied and accepted by other monitors.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: dispatch osd_beacon message to OSDMonitor
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: update comment in update_from_paxos()
this change updates the comment for 7fb3804fb, 97462a3 and e807770,
to reflect the reason why we need to fix latest_full in current code.
as the fix is not a workaround for cuttlefish anymore, it resolves the
issue where

0. mon.c has a latest_full of 5
1. mon.c is shutdown and out of sync with the quorum
2. mon.c starts sync
3. mon.c now has osdmap[31..50], and the latest_full is still 5.

Signed-off-by: Kefu Chai <kchai@redhat.com>
messages/MOSDBeacon: add beacon msg
osd will send beacon message to monitor periodically to inform it that
"i am still alive!", previously, monitor use the pg-stats to check the
status of OSD, but since osd will only send pg stat to mgr after
luminous, we use a dedicated msg for this purpose.

Signed-off-by: Kefu Chai <kchai@redhat.com>
osd: send osd-beacon to mon periodically
add an option named "osd_beacon_report_interval" to specify the interval
to send osd-beacon.

Signed-off-by: Kefu Chai <kchai@redhat.com>
messages/MPGCreated: add MPGCreated
add a new message type MPGCreated. osd is supposed to send this message
to monitor to inform that any pg(s) is created and activated.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: mapping is not optional anymore
as pg_creatings needs mapping to get the acting_primary osd of the
creating pg, so we can send the pg-create message to it if it subscribes
to this information, mapping should always be available now.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: run mapping on peons also
otherwise subcriptions on peons won't get the creating_pgs notification
mapping updated. we want to send the notification from peons also. and
the notifications should be updated with the updated pg mapping.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: acquire lock when accessing mon->session_map
we will access the mon->session_map for sending the osd-pg-creates
messages when finishing osdmapping in OSDMonitor, this could happen in
another thread without the protection of Monitor::lock, or in the same
thread already guarded by Monitor::lock. so instead of changing
Monitor::lock to a recursive lock, a new lock is introduced to protect
session_map.

Signed-off-by: Kefu Chai <kchai@redhat.com>
osd: send pg-created message if any pg is newly created
add an option named "osd_created_report_interval" to specify the
interval to check and send the "pg_created" mesages to mon

because pg could update its state when it is still in the pg_stat_queue,
for example, to change its state to PG_STATE_CLEAN, we cannot tell if we
have sent a "pg-created" message for it or not without introducing a new
member variable in PG.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: handle MOSDPGCreated messages
MOSDPGCreated messages are used to prune the creating_pgs_by_osd_epoch
and creating_pgs, by updating created_pools. as once a pool is created
we will not send MOSDPGCreate to its acting_primary OSD anymore.

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: switch to the new creating_pgs
* prime_pg_temp(): switch to the new creating_pgs
* get_trim_to(): switch to the new creating_pgs

Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: pass by reference not pointer of const param
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: avoid search and lookup anti-pattern
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon: switch to osdmon when handling osd_pg_creates subs
Signed-off-by: Kefu Chai <kchai@redhat.com>
tools/ceph-objectstore-tool: always set first_committed
otherwise the workaround of 7fb3804fb is triggered. we need to remove
that workaround later on.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Mar 30, 2017

@tchaikov tchaikov merged commit ea093ba into ceph:master Mar 30, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@tchaikov tchaikov deleted the tchaikov:wip-extract-creating-pgs branch Mar 30, 2017

if (!mapping_job->is_done()) {
dout(1) << __func__ << " skipping prime_pg_temp; mapping job "
<< mapping_job.get() << " did not complete, "
<< mapping_job->shards << " left" << dendl;
mapping_job->abort();
} else if (mapping->get_epoch() == osdmap.get_epoch()) {
} else if (mapping.get_epoch() == osdmap.get_epoch()) {

This comment has been minimized.

@runsisi

runsisi Apr 27, 2017

Contributor

hi @tchaikov @liewegas , the condition here should be mapping.get_epoch() < osdmap.get_epoch() ?
i think we can only prime pg temp when the mapping is up to date, am i wrong ?

This comment has been minimized.

@tchaikov

tchaikov Apr 27, 2017

Contributor

@runsisi nice catch! i posted #14826 to address your comment, could you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment