New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: restructure prime_pg_temp around a full pg mapping calculated on multiple CPUs #13207

Merged
merged 17 commits into from Feb 18, 2017

Conversation

Projects
None yet
2 participants
@liewegas
Member

liewegas commented Jan 31, 2017

This is removing prime_pg_temps dependency on pg_map, with teh small exception of
checking for creating_pgs (which is being removed in a separate PR).

It also moves processing for calculating PG mappings and checking for needed pg_temp
values to a parallel work queue that runs on multiple cores.

Both work efforts (pg mapping calcs and prime_pg_temp) are time boxed with a graceful fallback.

@liewegas liewegas requested review from tchaikov and jecluis Jan 31, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 3, 2017

i am still reading this.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 3, 2017

retest this please

@tchaikov

will continue the review tmr.

@@ -459,6 +459,7 @@ set(libcommon_files
msg/msg_types.cc
common/hobject.cc
osd/OSDMap.cc
osd/OSDMapMapping.cc

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

might want to squash this commit into the previous one.

void _dump();
public:
void get(pg_t pgid,

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

could pass by the reference.

}
}
void OSDMapMapping::_dump()

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

could override the operator<<() in case we want to debug it using the dout facility.

for (auto osd : osds) {
const vector<pg_t>& pgs = mapping.get_osd_acting_pgs(osd);
dout(20) << __func__ << " osd." << osd << " " << pgs << dendl;
for (auto pgid : pgs) {

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

const auto& as pg_t is not a primitive type.

ceph::unordered_map<pg_t, pg_stat_t>::iterator pp)
void OSDMonitor::prime_pg_temp(
OSDMap& next,
pg_t pgid)

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

could pass by const pg_t&.

<< g_conf->mon_osd_prime_pg_temp_max_time
<< " seconds, stopping"
<< dendl;
abort = true;

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

could just return here.

<< g_conf->mon_osd_prime_pg_temp_max_time
<< " seconds, stopping"
<< dendl;
abort = true;

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

could just return here.

}
}
return num;
pending_inc.new_pg_temp[pgid] = cur_acting;

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

better off using

pending_inc.new_pg_temp[pgid] = std::move(cur_acting);

so less cpu cycles spent on copy ctor.

<< dendl;
pending_inc.new_pg_temp[pgid] = cur_acting;
pending_inc.new_pg_temp[pgid] = acting;

This comment has been minimized.

@tchaikov

tchaikov Feb 14, 2017

Contributor

might want to take this chance to use std::move(acting)

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 14, 2017

simplified the bool abort thing and squashed the first 2 commits.

leaving pg_t args as pass-by-value since it's only a 16 byte struct?

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 15, 2017

retest this please

void ParallelPGMapper::Job::finish_one()
{
Context *fin = 0;

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

nit, use nullptr for more consistency.

protected:
CephContext *cct;
struct item {

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

s/item/Item/

break;
}
}
ParallelPGMapper::Job *job = new PrimeTempJob(next, this);

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

so we can allocate job on stack?

break;
utime_t stop = ceph_clock_now();
stop += g_conf->mon_osd_prime_pg_temp_max_time;
int chunk = 1000;

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

could mark chunk a const or constexpr.

@@ -111,10 +112,14 @@ struct failure_info_t {
class OSDMonitor : public PaxosService {
CephContext *cct;
ParallelPGMapper mapper; ///< for background pg work
unique_ptr<OSDMapMapping> mapping; ///< pg <-> osd mappings

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

@liewegas i don't understand why we want to use a pointer of OSDMapMapping instead of a plain OSDMapMapping? the same applies to mapping_job.

This comment has been minimized.

@liewegas

liewegas Feb 16, 2017

Member

because they are optional, and because the classes weren't necessarily written with re-use in mind.

}
if (check_all) {
for (auto p : pg_map->pg_stat) {

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

might want to use const auto& p, as pair<pg_t,pg_stat_t> is a relatively large struct.

@@ -67,8 +67,8 @@ void PGMonitor::on_restart()
void PGMonitor::on_active()
{
if (mon->is_leader()) {
check_all_pgs = true;

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

yeah, need_check_down_pgs was misleading before.

class CephContext;
class CrushWrapper;
class OSDMap;

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

this is not necessary any more, as OSDMapMapping is declared in its own header file.

pools.emplace(p.first, PoolMapping(p.second.get_size(),
p.second.get_pg_num()));
}
while (q != pools.end()) {

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

nit, just put pools.erase(q , pools.end());

osdmap.pg_to_up_acting_osds(
pg_t(ps, pool),
&up, &up_primary, &acting, &acting_primary);
i->second.set(ps, up, up_primary, acting, acting_primary);

This comment has been minimized.

@tchaikov

tchaikov Feb 16, 2017

Contributor

could use std::move(up) and std::move(acting) for better performance.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 16, 2017

updated; will squash once reviewed!

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 16, 2017

lgtm.

liewegas added some commits Jan 18, 2017

osd/OSDMap: add OSDMapMapping class
This is a precalculated table of all pg mappings.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: maintain full mapping, synchronously
This is temporarily; we'll put this in an async, parallelized thread
shortly.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: prime_pg_temp based on OSDMapMapping
Simplify the code and remove most of the dependency on PGMap.  We still
need PGMap for the creating_pgs to ignore; that can be dropped later.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMapMapping: add ParallelMapper
Calculate a mapping in parallel over a workqueue + threadpool.

Signed-off-by: Sage Weil <sage@redhat.com>
mon: add a threadpool (cpu_tp)
Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: calculate new mapping in parallel
Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: simplify prime_pg_temp slightly
Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: run mapping job in background; skip prime if it didn'…
…t finish

Start the background mapping job when we go active.  If it has finished
by the time we need to prime, use it.  If not, skip prime_pg_temp.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMapMapping: disable up_rmap
The reverse mapping from up osds to pgs is not used; disable it for now.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: avoid priming same pgs more than once
Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMapMapper: count total pgs
Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: prime all pgs if we estimate we're approaching that a…
…nyway

If we have enough noteworthy OSDs it is likely we will touch all or most
PGs anyway, at which point it is faster and simpler to check all pgs.

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added some commits Jan 31, 2017

osd/OSDMapMapping: restructure mapper as generic PG work running
Make Job a parent class that can be specialized for any per-PG batch work.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: use parallel work queue for prime_pg_temp (all)
Do the prime_pg_temp work in parallel using the mapper.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMapMapping: put mappings in a mempool
Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMonitor: move check_down_pgs into PGMapUpdater; use latest map
Two big changes here:

1) Move the updating into PGMapUpdater along with the other related
checks.  Cleaner!

2) We use the current OSDMap to mark things stale.  Before, we had
a weird two stage process: first we would apply the OSDMap to the
PGMap, and then the *next* time the OSDMap updated we would look
for stale PGs.  This was silly.  Instead, we mark the PGs stale
when we process the OSDMap that made them stale.

We have a be slightly careful because the PGMap stats primaries
might be a bit old and, for example, might have a smaller max_osd
than what we have now.

This change is motivated right now because the new prime_pg_temp
code is smarter about a failed OSD.  While previously failing
an OSD would install a pg_temp mapping to the old and new OSDs,
even though the old OSD is now down, the new code does not.  And
previously failing both OSDs for a PG would always result in a
subsequent OSDMap update removing those bad pg_temp values, giving
check_down_pgs() a change to mark the PGs stale, while the new code
has not subsequent OSDMap update, which would leave the PG unstale,
causing the dump_stuck.py test to fail.

Signed-off-by: Sage Weil <sage@redhat.com>
mgr: mark stale PGs
Signed-off-by: Sage Weil <sage@redhat.com>
@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 18, 2017

retest this please.

@liewegas liewegas merged commit c1c1f2e into ceph:master Feb 18, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-mapping branch Feb 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment