os/bluestore,mon: segregate omap keys by pool; report via df #29292

liewegas · 2019-07-24T19:31:12Z

prefix omap keys with the pool, so we can estimate per-pool usage
report this via 'ceph df detail'

RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    ssd       33 GiB     21 GiB     9.2 GiB       12 GiB         37.11 
    TOTAL     33 GiB     21 GiB     9.2 GiB       12 GiB         37.11 
 
POOLS:
    POOL     ID     STORED      (DATA)      (OMAP)      OBJECTS     USED        (DATA)      (OMAP)     %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED COMPR     UNDER COMPR 
    foo       1     2.1 GiB     2.1 GiB     8.0 MiB         831     6.3 GiB     6.2 GiB     24 MiB     23.49       6.8 GiB     N/A               N/A               831            0 B             0 B

fsck
make repair do a conversion and then set per_pool_omap=1

markhpc · 2019-07-25T13:52:22Z

Nice! I assume this doesn't count space-amp within rocksdb?

liewegas · 2019-07-25T13:54:15Z

https://github.com/facebook/rocksdb/wiki/Basic-Operations#approximate-sizes Probably does?

ifed01

In df output may be replace
STORED (DATA) (OMAP)
with something like
STORED= (data omap)
to be more evident in these columns' relationship

ifed01 · 2019-07-24T19:56:54Z

src/os/bluestore/BlueStore.h

@@ -1080,6 +1080,8 @@ class BlueStore : public ObjectStore,
      if (--nref == 0)
 	delete this;
    }
+
+    const string& get_omap_prefix();


nit: mark with const this and below?

ifed01 · 2019-07-24T19:59:50Z

src/os/bluestore/BlueStore.h

+    void get_omap_key(const string& key, string *out);
+    void rewrite_omap_key(const string& old, string *out);
+    void get_omap_tail(string *out);
+    void decode_omap_key(const string& key, string *user_key);


nit: mark with static?

src/os/bluestore/BlueStore.cc

ifed01 · 2019-07-26T23:00:57Z

In general - looks good.

Signed-off-by: Sage Weil <sage@redhat.com>

This makes it easier for them to depend on Onode (Onode::onode) state. Signed-off-by: Sage Weil <sage@redhat.com>

src/os/bluestore/BlueStore.h

jdurgin

looks good overall, one repair fix

src/os/bluestore/BlueStore.cc

ifed01 · 2019-08-01T14:39:05Z

src/os/bluestore/BlueStore.cc

@@ -7164,6 +7164,14 @@ int BlueStore::_fsck(bool deep, bool repair)
      errors += r;
  }

+  if (!per_pool_omap) {


Shouldn't we be more tolerant to this error and have configurable warning/error level here? Similarly as we do for per pool stats?

Ah, I forgot about

const auto& no_pps_mode = cct->_conf->bluestore_no_per_pool_stats_tolerance; bool need_per_pool_stats = no_pps_mode == "until_fsck" || (no_pps_mode == "until_repair" && repair); bool enforce_no_per_pool_stats = no_pps_mode == "enforce";

I wonder if we could do a slightly simpler model, where we have a warnings count (in addition to error), and the repair would similarly have a warnings_repaired count to match up. The until_fsck/until_repair model is still confusing to me.

Current bluestore_no_per_pool_stats_tolerance model allows three modes: disable new stats (enforce), error if not used (until_fsck) and upgrade when possible (until repair) . Which is quite useful/flexible IMO. Perhaps names aren't good though...
warning-only mode you're suggesting definitely lacks all the flexibility but surely is also doable..

ifed01 · 2019-08-01T15:08:23Z

src/os/bluestore/BlueStore.cc

+	    (!o->onode.is_perpool_omap() && !o->onode.is_pgmeta_omap())) {
+	  derr << "fsck error: " << oid << " has omap that is not per-pool or pgmeta"
+	       << dendl;
+	  ++errors;


This will result in a mismatch in error count and repaired count if any. Due to the fact that repairer isn't involved in the repair below

liewegas · 2019-08-01T19:29:31Z

this already happens in _prepare_ondisk_format_super()

…_key() Make this code more general and robust. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Set per-onode flag to indicate whether the object has per-pool keys or not. This will allow us to incrementally transition objects later. Put the new keys under a different prefix. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

This is a minimal change: we aren't separately reporting data vs omap usage (like we do in 'osd df' output for individual osds). Signed-off-by: Sage Weil <sage@redhat.com>

The get_user_bytes() helper is a bit weird because it uses the raw_used_rate (replication/EC factor) so that it can work *backwards* from raw usage to normalized user usage. However, the legacy case that works from PG stats does not use this factor... and the stored_raw value (in the JSON output only) was incorrectly passing in a factor of 1.0, which meant that for legacy mode it was a bogus value. Fix by calculating stored_raw as stored_normalized * raw_used_rate. Signed-off-by: Sage Weil <sage@redhat.com>

…ap variants Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Move to ondisk format v3. This means that per-pool omap keys may exist, but does not imply that *all* objects use the new form until the per_pool_omap=1 super key is also set. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

- error if objects not per-pool - error if per_pool_omap not set - convert as we go, and set the flag at the end Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

pgmeta trumps perpool. Signed-off-by: Sage Weil <sage@redhat.com>

Less ENOSPC from vstart. Signed-off-by: Sage Weil <sage@redhat.com>

Just to be on the safe side. Signed-off-by: Sage Weil <sage@redhat.com>

Unconditionally error if the global per_pool_omap was set, though! Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

* refs/pull/29292/head: os/bluestore: warn on no per-pool omap os/bluestore: fsck: warning (not error) by default on no per-pool omap os/bluestore: fsck: int64_t for error count os/bluestore: default size of 1 TB for testing os/bluestore: behave if we *do* set PGMETA and PERPOOL flags os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP os/bluestore: fsck: only generate 1 error per omap_head os/bluestore: make fsck repair convert to per-pool omap os/bluestore: teach fsck to tolerate per-pool omap os/bluestore: ondisk format change to 3 for per-pool omap mon/PGMap: add data/omap breakouts for 'df detail' view osd/osd_types: separate get_{user,allocated}_bytes() into data and omap variants mon/PGMap: fix stored_raw calculation mon/PGMap: add in actual omap usage into per-pool stats osd: report per-pool omap support via store_statfs_t os/bluestore: set per_pool_omap key on mkfs osd/osd_types: count per-pool omap capable OSDs os/bluestore: report omap_allocated per-pool os/bluestore: add pool prefix to omap keys kv/KeyValueDB: take key_prefix for estimate_prefix_size() os/bluestore: fix manual omap key manipulation to use Onode::get_omap_key() os/bluestore: make omap key helpers Onode methods os/bluestore: add Onode::get_omap_prefix() helper os/bluestore: change _do_omap_clear() args Reviewed-by: Josh Durgin <jdurgin@redhat.com>

For recovery or backfill, we use temp object if the whole object context can not sent in one shot. Since ceph/ceph#29292, we now segregate omap keys by object's binding pool. However, this does not work for recovery or backfill process need to manipulate temp objects because we deliberately assign a negative pool id for each temp object that is (obviously) different from the corresponding target object. Fix by prefixing all omap-related stuff of temp objects through its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

For recovery or backfill, we use temp object if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

For recovery or backfill, we use temp object at destination if the whole object context can not sent in one shot. And since ceph#29292, we now segregate omap keys by object's binding pool. However, ceph#29292 does not work well for recovery or backfill process that need to manipulate temp objects because we always (deliberately) assign a negative pool id for each temp object which is (obviously) different from the corresponding target object, and we do not fix it when trying to rename the temp object back to the target object at the end of recovery/backfill, as a result of which we totally lose track of the omap portion of the recovered object. Fix by prefixing all omap-related stuff of temp objects by using its real(pg's) pool. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

liewegas requested review from ifed01 and jdurgin July 24, 2019 19:31

liewegas added bluestore core feature labels Jul 24, 2019

liewegas force-pushed the wip-bluestore-per-pool-omap branch from 0e14c4c to 3a501dd Compare July 25, 2019 13:40

liewegas changed the title ~~os/bluestore,mon: segretate omap keys by pool; report via df~~ os/bluestore,mon: segregate omap keys by pool; report via df Jul 25, 2019

liewegas added wip-sage-testing and removed wip-sage-testing labels Jul 26, 2019

ifed01 reviewed Jul 26, 2019

View reviewed changes

liewegas added 3 commits July 29, 2019 21:23

os/bluestore: change _do_omap_clear() args

5635e88

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: add Onode::get_omap_prefix() helper

e4fd293

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: make omap key helpers Onode methods

e20f8c0

This makes it easier for them to depend on Onode (Onode::onode) state. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-per-pool-omap branch from ac261dc to a6003cf Compare July 30, 2019 02:24

jdurgin reviewed Aug 1, 2019

View reviewed changes

src/os/bluestore/BlueStore.h Show resolved Hide resolved

jdurgin requested changes Aug 1, 2019

View reviewed changes

liewegas force-pushed the wip-bluestore-per-pool-omap branch from a6003cf to da885c7 Compare August 1, 2019 09:04

ifed01 reviewed Aug 1, 2019

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

ifed01 reviewed Aug 1, 2019

View reviewed changes

liewegas added the wip-sage3-testing label Aug 2, 2019

liewegas force-pushed the wip-bluestore-per-pool-omap branch from 9b656f7 to 6a656cc Compare August 6, 2019 03:39

liewegas added 5 commits August 6, 2019 09:23

os/bluestore: fix manual omap key manipulation to use Onode::get_omap…

072039f

…_key() Make this code more general and robust. Signed-off-by: Sage Weil <sage@redhat.com>

kv/KeyValueDB: take key_prefix for estimate_prefix_size()

22a969a

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: add pool prefix to omap keys

91f533b

Set per-onode flag to indicate whether the object has per-pool keys or not. This will allow us to incrementally transition objects later. Put the new keys under a different prefix. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: report omap_allocated per-pool

e2a0717

Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: count per-pool omap capable OSDs

aa56c41

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added 6 commits August 6, 2019 09:24

mon/PGMap: add in actual omap usage into per-pool stats

b207973

This is a minimal change: we aren't separately reporting data vs omap usage (like we do in 'osd df' output for individual osds). Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: separate get_{user,allocated}_bytes() into data and om…

ab2eb6b

…ap variants Signed-off-by: Sage Weil <sage@redhat.com>

mon/PGMap: add data/omap breakouts for 'df detail' view

a076260

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: ondisk format change to 3 for per-pool omap

3cab0b3

Move to ondisk format v3. This means that per-pool omap keys may exist, but does not imply that *all* objects use the new form until the per_pool_omap=1 super key is also set. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: teach fsck to tolerate per-pool omap

b2119ff

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-per-pool-omap branch from 6a656cc to 71d3ef7 Compare August 6, 2019 15:18

jdurgin approved these changes Aug 6, 2019

View reviewed changes

liewegas force-pushed the wip-bluestore-per-pool-omap branch from c74235e to ae7003f Compare August 8, 2019 21:59

liewegas added 7 commits August 8, 2019 17:26

os/bluestore: make fsck repair convert to per-pool omap

b1e44c3

- error if objects not per-pool - error if per_pool_omap not set - convert as we go, and set the flag at the end Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fsck: only generate 1 error per omap_head

1eb10f3

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP

9bbe8d0

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: behave if we *do* set PGMETA and PERPOOL flags

52a2d4b

pgmeta trumps perpool. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: default size of 1 TB for testing

dbdd1d9

Less ENOSPC from vstart. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fsck: int64_t for error count

dee8f8c

Just to be on the safe side. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fsck: warning (not error) by default on no per-pool omap

6c690ae

Unconditionally error if the global per_pool_omap was set, though! Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-per-pool-omap branch from a3c0a6e to 35d7a2c Compare August 8, 2019 22:27

os/bluestore: warn on no per-pool omap

b850116

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-per-pool-omap branch from 35d7a2c to b850116 Compare August 9, 2019 13:21

liewegas merged commit b850116 into ceph:master Aug 9, 2019

xiexingguo mentioned this pull request Aug 17, 2019

os/bluestore: prefix omap of temp objects by real pool #29717

Merged

3 tasks

ifed01 mentioned this pull request Nov 6, 2019

nautilus: os/bluestore: shallow fsck mode and legacy statfs auto repair. #30685

Merged

ifed01 mentioned this pull request Mar 23, 2021

os/bluestore: fix bluestore per-pool omap warnings #40341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore,mon: segregate omap keys by pool; report via df #29292

os/bluestore,mon: segregate omap keys by pool; report via df #29292

liewegas commented Jul 24, 2019 •

edited

markhpc commented Jul 25, 2019

liewegas commented Jul 25, 2019 via email

ifed01 left a comment

ifed01 Jul 24, 2019

ifed01 Jul 24, 2019

ifed01 commented Jul 26, 2019

jdurgin left a comment

ifed01 Aug 1, 2019

liewegas Aug 1, 2019

ifed01 Aug 2, 2019

ifed01 Aug 1, 2019

liewegas commented Aug 1, 2019 via email

os/bluestore,mon: segregate omap keys by pool; report via df #29292

os/bluestore,mon: segregate omap keys by pool; report via df #29292

Conversation

liewegas commented Jul 24, 2019 • edited

markhpc commented Jul 25, 2019

liewegas commented Jul 25, 2019 via email

ifed01 left a comment

Choose a reason for hiding this comment

ifed01 Jul 24, 2019

Choose a reason for hiding this comment

ifed01 Jul 24, 2019

Choose a reason for hiding this comment

ifed01 commented Jul 26, 2019

jdurgin left a comment

Choose a reason for hiding this comment

ifed01 Aug 1, 2019

Choose a reason for hiding this comment

liewegas Aug 1, 2019

Choose a reason for hiding this comment

ifed01 Aug 2, 2019

Choose a reason for hiding this comment

ifed01 Aug 1, 2019

Choose a reason for hiding this comment

liewegas commented Aug 1, 2019 via email

liewegas commented Jul 24, 2019 •

edited