Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore,mon: segregate omap keys by pool; report via df #29292

Merged
merged 24 commits into from Aug 9, 2019

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Jul 24, 2019

  • prefix omap keys with the pool, so we can estimate per-pool usage
  • report this via 'ceph df detail'
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    ssd       33 GiB     21 GiB     9.2 GiB       12 GiB         37.11 
    TOTAL     33 GiB     21 GiB     9.2 GiB       12 GiB         37.11 
 
POOLS:
    POOL     ID     STORED      (DATA)      (OMAP)      OBJECTS     USED        (DATA)      (OMAP)     %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED COMPR     UNDER COMPR 
    foo       1     2.1 GiB     2.1 GiB     8.0 MiB         831     6.3 GiB     6.2 GiB     24 MiB     23.49       6.8 GiB     N/A               N/A               831            0 B             0 B 
  • fsck
  • make repair do a conversion and then set per_pool_omap=1

@markhpc
Copy link
Member

markhpc commented Jul 25, 2019

Nice! I assume this doesn't count space-amp within rocksdb?

@liewegas
Copy link
Member Author

liewegas commented Jul 25, 2019 via email

@liewegas liewegas changed the title os/bluestore,mon: segretate omap keys by pool; report via df os/bluestore,mon: segregate omap keys by pool; report via df Jul 25, 2019
Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In df output may be replace
STORED (DATA) (OMAP)
with something like
STORED= (data omap)
to be more evident in these columns' relationship

@@ -1080,6 +1080,8 @@ class BlueStore : public ObjectStore,
if (--nref == 0)
delete this;
}

const string& get_omap_prefix();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mark with const this and below?

void get_omap_key(const string& key, string *out);
void rewrite_omap_key(const string& old, string *out);
void get_omap_tail(string *out);
void decode_omap_key(const string& key, string *user_key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mark with static?

src/os/bluestore/BlueStore.cc Show resolved Hide resolved
@ifed01
Copy link
Contributor

ifed01 commented Jul 26, 2019

In general - looks good.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This makes it easier for them to depend on Onode (Onode::onode) state.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from ac261dc to a6003cf Compare July 30, 2019 02:24
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, one repair fix

@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from a6003cf to da885c7 Compare August 1, 2019 09:04
@@ -7164,6 +7164,14 @@ int BlueStore::_fsck(bool deep, bool repair)
errors += r;
}

if (!per_pool_omap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be more tolerant to this error and have configurable warning/error level here? Similarly as we do for per pool stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot about

  const auto& no_pps_mode = cct->_conf->bluestore_no_per_pool_stats_tolerance;
  bool need_per_pool_stats = no_pps_mode == "until_fsck" ||
    (no_pps_mode == "until_repair" && repair);
  bool enforce_no_per_pool_stats = no_pps_mode == "enforce";

I wonder if we could do a slightly simpler model, where we have a warnings count (in addition to error), and the repair would similarly have a warnings_repaired count to match up. The until_fsck/until_repair model is still confusing to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current bluestore_no_per_pool_stats_tolerance model allows three modes: disable new stats (enforce), error if not used (until_fsck) and upgrade when possible (until repair) . Which is quite useful/flexible IMO. Perhaps names aren't good though...
warning-only mode you're suggesting definitely lacks all the flexibility but surely is also doable..

(!o->onode.is_perpool_omap() && !o->onode.is_pgmeta_omap())) {
derr << "fsck error: " << oid << " has omap that is not per-pool or pgmeta"
<< dendl;
++errors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result in a mismatch in error count and repaired count if any. Due to the fact that repairer isn't involved in the repair below

@liewegas
Copy link
Member Author

liewegas commented Aug 1, 2019 via email

…_key()

Make this code more general and robust.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Set per-onode flag to indicate whether the object has per-pool keys or
not.  This will allow us to incrementally transition objects later.

Put the new keys under a different prefix.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is a minimal change: we aren't separately reporting data vs omap
usage (like we do in 'osd df' output for individual osds).

Signed-off-by: Sage Weil <sage@redhat.com>
The get_user_bytes() helper is a bit weird because it uses the
raw_used_rate (replication/EC factor) so that it can work *backwards*
from raw usage to normalized user usage.  However, the legacy case that
works from PG stats does not use this factor... and the stored_raw value
(in the JSON output only) was incorrectly passing in a factor of 1.0,
which meant that for legacy mode it was a bogus value.

Fix by calculating stored_raw as stored_normalized * raw_used_rate.

Signed-off-by: Sage Weil <sage@redhat.com>
…ap variants

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Move to ondisk format v3.  This means that per-pool omap keys may exist,
but does not imply that *all* objects use the new form until the
per_pool_omap=1 super key is also set.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from 6a656cc to 71d3ef7 Compare August 6, 2019 15:18
@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from c74235e to ae7003f Compare August 8, 2019 21:59
- error if objects not per-pool
- error if per_pool_omap not set
- convert as we go, and set the flag at the end

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
pgmeta trumps perpool.

Signed-off-by: Sage Weil <sage@redhat.com>
Less ENOSPC from vstart.

Signed-off-by: Sage Weil <sage@redhat.com>
Just to be on the safe side.

Signed-off-by: Sage Weil <sage@redhat.com>
Unconditionally error if the global per_pool_omap was set, though!

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from a3c0a6e to 35d7a2c Compare August 8, 2019 22:27
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-per-pool-omap branch from 35d7a2c to b850116 Compare August 9, 2019 13:21
@liewegas liewegas merged commit b850116 into ceph:master Aug 9, 2019
liewegas added a commit that referenced this pull request Aug 9, 2019
* refs/pull/29292/head:
	os/bluestore: warn on no per-pool omap
	os/bluestore: fsck: warning (not error) by default on no per-pool omap
	os/bluestore: fsck: int64_t for error count
	os/bluestore: default size of 1 TB for testing
	os/bluestore: behave if we *do* set PGMETA and PERPOOL flags
	os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP
	os/bluestore: fsck: only generate 1 error per omap_head
	os/bluestore: make fsck repair convert to per-pool omap
	os/bluestore: teach fsck to tolerate per-pool omap
	os/bluestore: ondisk format change to 3 for per-pool omap
	mon/PGMap: add data/omap breakouts for 'df detail' view
	osd/osd_types: separate get_{user,allocated}_bytes() into data and omap variants
	mon/PGMap: fix stored_raw calculation
	mon/PGMap: add in actual omap usage into per-pool stats
	osd: report per-pool omap support via store_statfs_t
	os/bluestore: set per_pool_omap key on mkfs
	osd/osd_types: count per-pool omap capable OSDs
	os/bluestore: report omap_allocated per-pool
	os/bluestore: add pool prefix to omap keys
	kv/KeyValueDB: take key_prefix for estimate_prefix_size()
	os/bluestore: fix manual omap key manipulation to use Onode::get_omap_key()
	os/bluestore: make omap key helpers Onode methods
	os/bluestore: add Onode::get_omap_prefix() helper
	os/bluestore: change _do_omap_clear() args

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
xiexingguo added a commit to ceph/ceph-ci that referenced this pull request Aug 16, 2019
For recovery or backfill, we use temp object if the whole
object context can not sent in one shot.

Since ceph/ceph#29292, we now
segregate omap keys by object's binding pool. However,
this does not work for recovery or backfill process need
to manipulate temp objects because we deliberately assign
a negative pool id for each temp object that is (obviously)
different from the corresponding target object.

Fix by prefixing all omap-related stuff of temp objects
through its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Aug 17, 2019
For recovery or backfill, we use temp object if the whole
object context can not sent in one shot. And since
ceph#29292, we now segregate
omap keys by object's binding pool.
However, ceph#29292 does not
work well for recovery or backfill process that need
to manipulate temp objects because we always (deliberately)
assign a negative pool id for each temp object which is
(obviously) different from the corresponding target object,
and we do not fix it when trying to rename the temp object
back to the target object at the end of recovery/backfill,
as a result of which we totally lose track of the omap
portion of the recovered object.

Fix by prefixing all omap-related stuff of temp objects
by using its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Aug 17, 2019
For recovery or backfill, we use temp object at destination
if the whole object context can not sent in one shot.
And since ceph#29292, we now segregate
omap keys by object's binding pool.
However, ceph#29292 does not
work well for recovery or backfill process that need
to manipulate temp objects because we always (deliberately)
assign a negative pool id for each temp object which is
(obviously) different from the corresponding target object,
and we do not fix it when trying to rename the temp object
back to the target object at the end of recovery/backfill,
as a result of which we totally lose track of the omap
portion of the recovered object.

Fix by prefixing all omap-related stuff of temp objects
by using its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
epuertat pushed a commit to rhcs-dashboard/ceph that referenced this pull request Aug 22, 2019
For recovery or backfill, we use temp object at destination
if the whole object context can not sent in one shot.
And since ceph#29292, we now segregate
omap keys by object's binding pool.
However, ceph#29292 does not
work well for recovery or backfill process that need
to manipulate temp objects because we always (deliberately)
assign a negative pool id for each temp object which is
(obviously) different from the corresponding target object,
and we do not fix it when trying to rename the temp object
back to the target object at the end of recovery/backfill,
as a result of which we totally lose track of the omap
portion of the recovered object.

Fix by prefixing all omap-related stuff of temp objects
by using its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Sep 5, 2019
For recovery or backfill, we use temp object at destination
if the whole object context can not sent in one shot.
And since ceph#29292, we now segregate
omap keys by object's binding pool.
However, ceph#29292 does not
work well for recovery or backfill process that need
to manipulate temp objects because we always (deliberately)
assign a negative pool id for each temp object which is
(obviously) different from the corresponding target object,
and we do not fix it when trying to rename the temp object
back to the target object at the end of recovery/backfill,
as a result of which we totally lose track of the omap
portion of the recovered object.

Fix by prefixing all omap-related stuff of temp objects
by using its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Sep 5, 2019
For recovery or backfill, we use temp object at destination
if the whole object context can not sent in one shot.
And since ceph#29292, we now segregate
omap keys by object's binding pool.
However, ceph#29292 does not
work well for recovery or backfill process that need
to manipulate temp objects because we always (deliberately)
assign a negative pool id for each temp object which is
(obviously) different from the corresponding target object,
and we do not fix it when trying to rename the temp object
back to the target object at the end of recovery/backfill,
as a result of which we totally lose track of the omap
portion of the recovered object.

Fix by prefixing all omap-related stuff of temp objects
by using its real(pg's) pool.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants