New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore,mon: segregate omap keys by pool; report via df #29292
Conversation
liewegas
commented
Jul 24, 2019
•
edited
edited
- prefix omap keys with the pool, so we can estimate per-pool usage
- report this via 'ceph df detail'
- fsck
- make repair do a conversion and then set per_pool_omap=1
0e14c4c
to
3a501dd
Compare
Nice! I assume this doesn't count space-amp within rocksdb? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In df output may be replace
STORED (DATA) (OMAP)
with something like
STORED= (data omap)
to be more evident in these columns' relationship
@@ -1080,6 +1080,8 @@ class BlueStore : public ObjectStore, | |||
if (--nref == 0) | |||
delete this; | |||
} | |||
|
|||
const string& get_omap_prefix(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mark with const this and below?
void get_omap_key(const string& key, string *out); | ||
void rewrite_omap_key(const string& old, string *out); | ||
void get_omap_tail(string *out); | ||
void decode_omap_key(const string& key, string *user_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mark with static?
In general - looks good. |
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This makes it easier for them to depend on Onode (Onode::onode) state. Signed-off-by: Sage Weil <sage@redhat.com>
ac261dc
to
a6003cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall, one repair fix
a6003cf
to
da885c7
Compare
@@ -7164,6 +7164,14 @@ int BlueStore::_fsck(bool deep, bool repair) | |||
errors += r; | |||
} | |||
|
|||
if (!per_pool_omap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be more tolerant to this error and have configurable warning/error level here? Similarly as we do for per pool stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot about
const auto& no_pps_mode = cct->_conf->bluestore_no_per_pool_stats_tolerance;
bool need_per_pool_stats = no_pps_mode == "until_fsck" ||
(no_pps_mode == "until_repair" && repair);
bool enforce_no_per_pool_stats = no_pps_mode == "enforce";
I wonder if we could do a slightly simpler model, where we have a warnings count (in addition to error), and the repair would similarly have a warnings_repaired count to match up. The until_fsck/until_repair model is still confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current bluestore_no_per_pool_stats_tolerance model allows three modes: disable new stats (enforce), error if not used (until_fsck) and upgrade when possible (until repair) . Which is quite useful/flexible IMO. Perhaps names aren't good though...
warning-only mode you're suggesting definitely lacks all the flexibility but surely is also doable..
src/os/bluestore/BlueStore.cc
Outdated
(!o->onode.is_perpool_omap() && !o->onode.is_pgmeta_omap())) { | ||
derr << "fsck error: " << oid << " has omap that is not per-pool or pgmeta" | ||
<< dendl; | ||
++errors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will result in a mismatch in error count and repaired count if any. Due to the fact that repairer isn't involved in the repair below
this already happens in _prepare_ondisk_format_super()
|
9b656f7
to
6a656cc
Compare
…_key() Make this code more general and robust. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Set per-onode flag to indicate whether the object has per-pool keys or not. This will allow us to incrementally transition objects later. Put the new keys under a different prefix. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is a minimal change: we aren't separately reporting data vs omap usage (like we do in 'osd df' output for individual osds). Signed-off-by: Sage Weil <sage@redhat.com>
The get_user_bytes() helper is a bit weird because it uses the raw_used_rate (replication/EC factor) so that it can work *backwards* from raw usage to normalized user usage. However, the legacy case that works from PG stats does not use this factor... and the stored_raw value (in the JSON output only) was incorrectly passing in a factor of 1.0, which meant that for legacy mode it was a bogus value. Fix by calculating stored_raw as stored_normalized * raw_used_rate. Signed-off-by: Sage Weil <sage@redhat.com>
…ap variants Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Move to ondisk format v3. This means that per-pool omap keys may exist, but does not imply that *all* objects use the new form until the per_pool_omap=1 super key is also set. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
6a656cc
to
71d3ef7
Compare
c74235e
to
ae7003f
Compare
- error if objects not per-pool - error if per_pool_omap not set - convert as we go, and set the flag at the end Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
pgmeta trumps perpool. Signed-off-by: Sage Weil <sage@redhat.com>
Less ENOSPC from vstart. Signed-off-by: Sage Weil <sage@redhat.com>
Just to be on the safe side. Signed-off-by: Sage Weil <sage@redhat.com>
Unconditionally error if the global per_pool_omap was set, though! Signed-off-by: Sage Weil <sage@redhat.com>
a3c0a6e
to
35d7a2c
Compare
Signed-off-by: Sage Weil <sage@redhat.com>
35d7a2c
to
b850116
Compare
* refs/pull/29292/head: os/bluestore: warn on no per-pool omap os/bluestore: fsck: warning (not error) by default on no per-pool omap os/bluestore: fsck: int64_t for error count os/bluestore: default size of 1 TB for testing os/bluestore: behave if we *do* set PGMETA and PERPOOL flags os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP os/bluestore: fsck: only generate 1 error per omap_head os/bluestore: make fsck repair convert to per-pool omap os/bluestore: teach fsck to tolerate per-pool omap os/bluestore: ondisk format change to 3 for per-pool omap mon/PGMap: add data/omap breakouts for 'df detail' view osd/osd_types: separate get_{user,allocated}_bytes() into data and omap variants mon/PGMap: fix stored_raw calculation mon/PGMap: add in actual omap usage into per-pool stats osd: report per-pool omap support via store_statfs_t os/bluestore: set per_pool_omap key on mkfs osd/osd_types: count per-pool omap capable OSDs os/bluestore: report omap_allocated per-pool os/bluestore: add pool prefix to omap keys kv/KeyValueDB: take key_prefix for estimate_prefix_size() os/bluestore: fix manual omap key manipulation to use Onode::get_omap_key() os/bluestore: make omap key helpers Onode methods os/bluestore: add Onode::get_omap_prefix() helper os/bluestore: change _do_omap_clear() args Reviewed-by: Josh Durgin <jdurgin@redhat.com>
For recovery or backfill, we use temp object if the whole object context can not sent in one shot. Since ceph/ceph#29292, we now segregate omap keys by object's binding pool. However, this does not work for recovery or backfill process need to manipulate temp objects because we deliberately assign a negative pool id for each temp object that is (obviously) different from the corresponding target object. Fix by prefixing all omap-related stuff of temp objects through its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
For recovery or backfill, we use temp object if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
For recovery or backfill, we use temp object at destination if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
For recovery or backfill, we use temp object at destination if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
For recovery or backfill, we use temp object at destination if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
For recovery or backfill, we use temp object at destination if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>