Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore,mon: segregate omap keys by pool; report via df #29292

Merged
merged 24 commits into from Aug 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5635e88
os/bluestore: change _do_omap_clear() args
liewegas Jul 23, 2019
e4fd293
os/bluestore: add Onode::get_omap_prefix() helper
liewegas Jul 23, 2019
e20f8c0
os/bluestore: make omap key helpers Onode methods
liewegas Jul 23, 2019
072039f
os/bluestore: fix manual omap key manipulation to use Onode::get_omap…
liewegas Jul 23, 2019
22a969a
kv/KeyValueDB: take key_prefix for estimate_prefix_size()
liewegas Jul 23, 2019
91f533b
os/bluestore: add pool prefix to omap keys
liewegas Jul 23, 2019
e2a0717
os/bluestore: report omap_allocated per-pool
liewegas Jul 23, 2019
aa56c41
osd/osd_types: count per-pool omap capable OSDs
liewegas Jul 23, 2019
19f497c
os/bluestore: set per_pool_omap key on mkfs
liewegas Jul 23, 2019
d6ff61e
osd: report per-pool omap support via store_statfs_t
liewegas Jul 23, 2019
b207973
mon/PGMap: add in actual omap usage into per-pool stats
liewegas Jul 23, 2019
5a6ede0
mon/PGMap: fix stored_raw calculation
liewegas Jul 24, 2019
ab2eb6b
osd/osd_types: separate get_{user,allocated}_bytes() into data and om…
liewegas Jul 24, 2019
a076260
mon/PGMap: add data/omap breakouts for 'df detail' view
liewegas Jul 24, 2019
3cab0b3
os/bluestore: ondisk format change to 3 for per-pool omap
liewegas Jul 25, 2019
b2119ff
os/bluestore: teach fsck to tolerate per-pool omap
liewegas Jul 29, 2019
b1e44c3
os/bluestore: make fsck repair convert to per-pool omap
liewegas Jul 30, 2019
1eb10f3
os/bluestore: fsck: only generate 1 error per omap_head
liewegas Aug 5, 2019
9bbe8d0
os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP
liewegas Aug 5, 2019
52a2d4b
os/bluestore: behave if we *do* set PGMETA and PERPOOL flags
liewegas Aug 5, 2019
dbdd1d9
os/bluestore: default size of 1 TB for testing
liewegas Aug 6, 2019
dee8f8c
os/bluestore: fsck: int64_t for error count
liewegas Aug 8, 2019
6c690ae
os/bluestore: fsck: warning (not error) by default on no per-pool omap
liewegas Aug 8, 2019
b850116
os/bluestore: warn on no per-pool omap
liewegas Aug 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 21 additions & 0 deletions doc/rados/operations/health-checks.rst
Expand Up @@ -356,6 +356,27 @@ This warning can be disabled with::

ceph config set global bluestore_warn_on_legacy_statfs false

BLUESTORE_NO_PER_POOL_OMAP
__________________________

Starting with the Octopus release, BlueStore tracks omap space utilization
by pool, and one or more OSDs have volumes that were created prior to
Octopus. If all OSDs are not running BlueStore with the new tracking
enabled, the cluster will report and approximate value for per-pool omap usage
based on the most recent deep-scrub.

The old OSDs can be updated to track by pool by stopping each OSD,
running a repair operation, and the restarting it. For example, if
``osd.123`` needed to be updated,::

systemctl stop ceph-osd@123
ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
systemctl start ceph-osd@123

This warning can be disabled with::

ceph config set global bluestore_warn_on_no_per_pool_omap false


BLUESTORE_DISK_SIZE_MISMATCH
____________________________
Expand Down
Expand Up @@ -13,6 +13,7 @@ overrides:
conf:
global:
bluestore warn on legacy statfs: false
bluestore warn on no per pool omap: false
mon:
mon warn on osd down out interval zero: false

Expand Down
Expand Up @@ -13,6 +13,7 @@ overrides:
conf:
global:
bluestore warn on legacy statfs: false
bluestore warn on no per pool omap: false
mon:
mon warn on osd down out interval zero: false

Expand Down
1 change: 1 addition & 0 deletions qa/suites/upgrade/mimic-x-singleton/0-cluster/start.yaml
Expand Up @@ -13,6 +13,7 @@ overrides:
ms dump corrupt message level: 0
ms bind msgr2: false
bluestore warn on legacy statfs: false
bluestore warn on no per pool omap: false
mds:
debug ms: 1
debug mds: 20
Expand Down
Expand Up @@ -34,6 +34,7 @@ tasks:
global:
mon warn on pool no app: false
bluestore_warn_on_legacy_statfs: false
bluestore warn on no per pool omap: false
- exec:
osd.0:
- ceph osd require-osd-release mimic
Expand Down
Expand Up @@ -16,6 +16,7 @@ tasks:
conf:
global:
bluestore_warn_on_legacy_statfs: false
bluestore warn on no per pool omap: false
- exec:
osd.0:
- ceph osd require-osd-release mimic
Expand Down
Expand Up @@ -12,6 +12,7 @@ overrides:
global:
ms dump corrupt message level: 0
ms bind msgr2: false
bluestore warn on no per pool omap: false
mds:
debug ms: 1
debug mds: 20
Expand Down
Expand Up @@ -26,6 +26,7 @@ tasks:
global:
mon warn on pool no app: false
bluestore_warn_on_legacy_statfs: false
bluestore warn on no per pool omap: false
- exec:
osd.0:
- ceph osd set-require-min-compat-client nautilus
Expand Down
Expand Up @@ -8,6 +8,7 @@ tasks:
conf:
global:
bluestore_warn_on_legacy_statfs: false
bluestore warn on no per pool omap: false
- exec:
osd.0:
- ceph osd require-osd-release nautilus
Expand Down
2 changes: 2 additions & 0 deletions src/common/legacy_config_opts.h
Expand Up @@ -1071,6 +1071,8 @@ OPTION(bluestore_debug_inject_csum_err_probability, OPT_FLOAT)
OPTION(bluestore_no_per_pool_stats_tolerance, OPT_STR)
OPTION(bluestore_warn_on_bluefs_spillover, OPT_BOOL)
OPTION(bluestore_warn_on_legacy_statfs, OPT_BOOL)
OPTION(bluestore_fsck_error_on_no_per_pool_omap, OPT_BOOL)
OPTION(bluestore_warn_on_no_per_pool_omap, OPT_BOOL)
OPTION(bluestore_log_op_age, OPT_DOUBLE)
OPTION(bluestore_log_omap_iterator_age, OPT_DOUBLE)
OPTION(bluestore_log_collection_list_age, OPT_DOUBLE)
Expand Down
13 changes: 12 additions & 1 deletion src/common/options.cc
Expand Up @@ -341,6 +341,9 @@ constexpr unsigned long long operator"" _M (unsigned long long n) {
constexpr unsigned long long operator"" _G (unsigned long long n) {
return n << 30;
}
constexpr unsigned long long operator"" _T (unsigned long long n) {
return n << 40;
}

std::vector<Option> get_global_options() {
return std::vector<Option>({
Expand Down Expand Up @@ -4371,7 +4374,7 @@ std::vector<Option> get_global_options() {
.set_description("Path to block device/file"),

Option("bluestore_block_size", Option::TYPE_SIZE, Option::LEVEL_DEV)
.set_default(10_G)
.set_default(1_T)
.set_flag(Option::FLAG_CREATE)
.set_description("Size of file to create for backing bluestore"),

Expand Down Expand Up @@ -4861,6 +4864,14 @@ std::vector<Option> get_global_options() {
.set_default(true)
.set_description("Enable health indication on lack of per-pool statfs reporting from bluestore"),

Option("bluestore_fsck_error_on_no_per_pool_omap", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)
.set_default(false)
.set_description("Make fsck error (instead of warn) when objects without per-pool omap are found"),

Option("bluestore_warn_on_no_per_pool_omap", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)
.set_default(true)
.set_description("Enable health indication on lack of per-pool omap"),

Option("bluestore_log_op_age", Option::TYPE_FLOAT, Option::LEVEL_ADVANCED)
.set_default(5)
.set_description("log operation if it's slower than this age (seconds)"),
Expand Down
3 changes: 2 additions & 1 deletion src/kv/KeyValueDB.h
Expand Up @@ -367,7 +367,8 @@ class KeyValueDB {
virtual ~KeyValueDB() {}

/// estimate space utilization for a prefix (in bytes)
virtual int64_t estimate_prefix_size(const string& prefix) {
virtual int64_t estimate_prefix_size(const string& prefix,
const string& key_prefix) {
return 0;
}

Expand Down
15 changes: 8 additions & 7 deletions src/kv/RocksDBStore.cc
Expand Up @@ -691,23 +691,24 @@ void RocksDBStore::split_stats(const std::string &s, char delim, std::vector<std
}
}

int64_t RocksDBStore::estimate_prefix_size(const string& prefix)
int64_t RocksDBStore::estimate_prefix_size(const string& prefix,
const string& key_prefix)
{
auto cf = get_cf_handle(prefix);
uint64_t size = 0;
uint8_t flags =
//rocksdb::DB::INCLUDE_MEMTABLES | // do not include memtables...
rocksdb::DB::INCLUDE_FILES;
if (cf) {
string start(1, '\x00');
string limit("\xff\xff\xff\xff");
string start = key_prefix + string(1, '\x00');
string limit = key_prefix + string("\xff\xff\xff\xff");
rocksdb::Range r(start, limit);
db->GetApproximateSizes(cf, &r, 1, &size, flags);
} else {
string limit = prefix + "\xff\xff\xff\xff";
rocksdb::Range r(prefix, limit);
db->GetApproximateSizes(default_cf,
&r, 1, &size, flags);
string start = prefix + key_prefix;
string limit = prefix + key_prefix + "\xff\xff\xff\xff";
rocksdb::Range r(start, limit);
db->GetApproximateSizes(default_cf, &r, 1, &size, flags);
}
return size;
}
Expand Down
3 changes: 2 additions & 1 deletion src/kv/RocksDBStore.h
Expand Up @@ -197,7 +197,8 @@ class RocksDBStore : public KeyValueDB {
return logger;
}

int64_t estimate_prefix_size(const string& prefix) override;
int64_t estimate_prefix_size(const string& prefix,
const string& key_prefix) override;

struct RocksWBHandler: public rocksdb::WriteBatch::Handler {
std::string seen ;
Expand Down
6 changes: 4 additions & 2 deletions src/librados/librados_c.cc
Expand Up @@ -1018,11 +1018,13 @@ extern "C" int _rados_ioctx_pool_stat(rados_ioctx_t io,
}

::pool_stat_t& r = rawresult[pool_name];
uint64_t allocated_bytes = r.get_allocated_bytes(per_pool);
uint64_t allocated_bytes = r.get_allocated_data_bytes(per_pool) +
r.get_allocated_omap_bytes(per_pool);
// FIXME: raw_used_rate is unknown hence use 1.0 here
// meaning we keep net amount aggregated over all replicas
// Not a big deal so far since this field isn't exposed
uint64_t user_bytes = r.get_user_bytes(1.0, per_pool);
uint64_t user_bytes = r.get_user_data_bytes(1.0, per_pool) +
r.get_user_omap_bytes(1.0, per_pool);

stats->num_kb = shift_round_up(allocated_bytes, 10);
stats->num_bytes = allocated_bytes;
Expand Down
6 changes: 4 additions & 2 deletions src/librados/librados_cxx.cc
Expand Up @@ -2578,11 +2578,13 @@ int librados::Rados::get_pool_stats(std::list<string>& v,
pool_stat_t& pv = result[p->first];
auto& pstat = p->second;
store_statfs_t &statfs = pstat.store_stats;
uint64_t allocated_bytes = pstat.get_allocated_bytes(per_pool);
uint64_t allocated_bytes = pstat.get_allocated_data_bytes(per_pool) +
pstat.get_allocated_omap_bytes(per_pool);
// FIXME: raw_used_rate is unknown hence use 1.0 here
// meaning we keep net amount aggregated over all replicas
// Not a big deal so far since this field isn't exposed
uint64_t user_bytes = pstat.get_user_bytes(1.0, per_pool);
uint64_t user_bytes = pstat.get_user_data_bytes(1.0, per_pool) +
pstat.get_user_omap_bytes(1.0, per_pool);

object_stat_sum_t *sum = &p->second.stats.sum;
pv.num_kb = shift_round_up(allocated_bytes, 10);
Expand Down
46 changes: 41 additions & 5 deletions src/mon/PGMap.cc
Expand Up @@ -759,8 +759,16 @@ void PGMapDigest::dump_pool_stats_full(
tbl.define_column("POOL", TextTable::LEFT, TextTable::LEFT);
tbl.define_column("ID", TextTable::LEFT, TextTable::RIGHT);
tbl.define_column("STORED", TextTable::LEFT, TextTable::RIGHT);
if (verbose) {
tbl.define_column("(DATA)", TextTable::LEFT, TextTable::RIGHT);
tbl.define_column("(OMAP)", TextTable::LEFT, TextTable::RIGHT);
}
tbl.define_column("OBJECTS", TextTable::LEFT, TextTable::RIGHT);
tbl.define_column("USED", TextTable::LEFT, TextTable::RIGHT);
if (verbose) {
tbl.define_column("(DATA)", TextTable::LEFT, TextTable::RIGHT);
tbl.define_column("(OMAP)", TextTable::LEFT, TextTable::RIGHT);
}
tbl.define_column("%USED", TextTable::LEFT, TextTable::RIGHT);
tbl.define_column("MAX AVAIL", TextTable::LEFT, TextTable::RIGHT);

Expand Down Expand Up @@ -808,8 +816,9 @@ void PGMapDigest::dump_pool_stats_full(
}
float raw_used_rate = osd_map.pool_raw_used_rate(pool_id);
bool per_pool = use_per_pool_stats();
bool per_pool_omap = use_per_pool_omap_stats();
dump_object_stat_sum(tbl, f, stat, avail, raw_used_rate, verbose, per_pool,
pool);
per_pool_omap, pool);
if (f) {
f->close_section(); // stats
f->close_section(); // pool
Expand Down Expand Up @@ -840,6 +849,7 @@ void PGMapDigest::dump_cluster_stats(stringstream *ss,
f->dump_float("total_used_raw_ratio", osd_sum.statfs.get_used_raw_ratio());
f->dump_unsigned("num_osds", osd_sum.num_osds);
f->dump_unsigned("num_per_pool_osds", osd_sum.num_per_pool_osds);
f->dump_unsigned("num_per_pool_omap_osds", osd_sum.num_per_pool_omap_osds);
f->close_section();
f->open_object_section("stats_by_class");
for (auto& i : osd_sum_by_class) {
Expand Down Expand Up @@ -890,7 +900,7 @@ void PGMapDigest::dump_cluster_stats(stringstream *ss,
void PGMapDigest::dump_object_stat_sum(
TextTable &tbl, ceph::Formatter *f,
const pool_stat_t &pool_stat, uint64_t avail,
float raw_used_rate, bool verbose, bool per_pool,
float raw_used_rate, bool verbose, bool per_pool, bool per_pool_omap,
const pg_pool_t *pool)
{
const object_stat_sum_t &sum = pool_stat.stats.sum;
Expand All @@ -900,7 +910,9 @@ void PGMapDigest::dump_object_stat_sum(
raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
}

uint64_t used_bytes = pool_stat.get_allocated_bytes(per_pool);
uint64_t used_data_bytes = pool_stat.get_allocated_data_bytes(per_pool);
uint64_t used_omap_bytes = pool_stat.get_allocated_omap_bytes(per_pool_omap);
uint64_t used_bytes = used_data_bytes + used_omap_bytes;

float used = 0.0;
// note avail passed in is raw_avail, calc raw_used here.
Expand All @@ -912,12 +924,26 @@ void PGMapDigest::dump_object_stat_sum(
}
auto avail_res = raw_used_rate ? avail / raw_used_rate : 0;
// an approximation for actually stored user data
auto stored_normalized = pool_stat.get_user_bytes(raw_used_rate, per_pool);
auto stored_data_normalized = pool_stat.get_user_data_bytes(
raw_used_rate, per_pool);
auto stored_omap_normalized = pool_stat.get_user_omap_bytes(
raw_used_rate, per_pool_omap);
auto stored_normalized = stored_data_normalized + stored_omap_normalized;
// same, amplied by replication or EC
auto stored_raw = stored_normalized * raw_used_rate;
if (f) {
f->dump_int("stored", stored_normalized);
if (verbose) {
f->dump_int("stored_data", stored_data_normalized);
f->dump_int("stored_omap", stored_omap_normalized);
}
f->dump_int("objects", sum.num_objects);
f->dump_int("kb_used", shift_round_up(used_bytes, 10));
f->dump_int("bytes_used", used_bytes);
if (verbose) {
f->dump_int("data_bytes_used", used_data_bytes);
f->dump_int("omap_bytes_used", used_omap_bytes);
}
f->dump_float("percent_used", used);
f->dump_unsigned("max_avail", avail_res);
if (verbose) {
Expand All @@ -931,12 +957,20 @@ void PGMapDigest::dump_object_stat_sum(
f->dump_int("compress_bytes_used", statfs.data_compressed_allocated);
f->dump_int("compress_under_bytes", statfs.data_compressed_original);
// Stored by user amplified by replication
f->dump_int("stored_raw", pool_stat.get_user_bytes(1.0, per_pool));
f->dump_int("stored_raw", stored_raw);
}
} else {
tbl << stringify(byte_u_t(stored_normalized));
if (verbose) {
tbl << stringify(byte_u_t(stored_data_normalized));
tbl << stringify(byte_u_t(stored_omap_normalized));
}
tbl << stringify(si_u_t(sum.num_objects));
tbl << stringify(byte_u_t(used_bytes));
if (verbose) {
tbl << stringify(byte_u_t(used_data_bytes));
tbl << stringify(byte_u_t(used_omap_bytes));
}
tbl << percentify(used*100);
tbl << stringify(byte_u_t(avail_res));
if (verbose) {
Expand Down Expand Up @@ -3034,6 +3068,8 @@ void PGMap::get_health_checks(
summary = "Legacy BlueStore stats reporting detected";
} else if (asum.first == "BLUESTORE_DISK_SIZE_MISMATCH") {
summary = "BlueStore has dangerous mismatch between block device and free list sizes";
} else if (asum.first == "BLUESTORE_NO_PER_POOL_OMAP") {
summary = "Legacy BlueStore does not track omap usage by pool";
}
summary += " on ";
summary += stringify(asum.second.first);
Expand Down
4 changes: 4 additions & 0 deletions src/mon/PGMap.h
Expand Up @@ -74,6 +74,9 @@ class PGMapDigest {
bool use_per_pool_stats() const {
return osd_sum.num_osds == osd_sum.num_per_pool_osds;
}
bool use_per_pool_omap_stats() const {
return osd_sum.num_osds == osd_sum.num_per_pool_omap_osds;
}

// recent deltas, and summation
/**
Expand Down Expand Up @@ -175,6 +178,7 @@ class PGMapDigest {
float raw_used_rate,
bool verbose,
bool per_pool,
bool per_pool_omap,
const pg_pool_t *pool);

size_t get_num_pg_by_osd(int osd) const {
Expand Down
3 changes: 2 additions & 1 deletion src/os/ObjectStore.h
Expand Up @@ -337,7 +337,8 @@ class ObjectStore {

virtual int statfs(struct store_statfs_t *buf,
osd_alert_list_t* alerts = nullptr) = 0;
virtual int pool_statfs(uint64_t pool_id, struct store_statfs_t *buf) = 0;
virtual int pool_statfs(uint64_t pool_id, struct store_statfs_t *buf,
bool *per_pool_omap) = 0;

virtual void collect_metadata(std::map<std::string,string> *pm) { }

Expand Down