mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873

gregsfortytwo · 2019-03-09T00:46:01Z

…put them

We add a separate pool FLAG_QUOTA_FULL_OBJECTS, which is set only
when the FLAG_QUOTA_FULL is set AND the quota exceeded is for the object
count. This lets us output the relevant exceeded quota from
OSDMap::check_health().

Fixes: http://tracker.ceph.com/issues/38653

Signed-off-by: Greg Farnum gfarnum@redhat.com

gregsfortytwo · 2019-03-09T00:48:38Z

Marked DNM until this passes make check, as I've just compile-tested it. o_0

Anyway, this functionality was requested by one of our support teams and it turned out to be a bit difficult because the quota fullness is only detected by comparing pg stats, which of course the OSDMap doesn't have any visibility into. Adding an extra flag to indicate which quota we violated is the easiest way to solve that problem, but is definitely a bit of a hack. What do people think?

…put them We add a separate pool FLAG_QUOTA_FULL_OBJECTS, which is set only when the FLAG_QUOTA_FULL is set AND the quota exceeded is for the object count. This lets us output the relevant exceeded quota from OSDMap::check_health(). Fixes: http://tracker.ceph.com/issues/38653 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

liewegas · 2019-03-09T23:37:04Z

src/osd/osd_types.h

@@ -1137,6 +1138,7 @@ struct pg_pool_t {
    case FLAG_NOSCRUB: return "noscrub";
    case FLAG_NODEEP_SCRUB: return "nodeep-scrub";
    case FLAG_FULL_QUOTA: return "full_quota";
+    case FLAG_FULL_QUOTA_OBJECTS: return "full_quota_objects";


Should we rename FLAG_FULL_QUOTA to FLAG_FULL_QUOTA_BYTES? it will be slightly inconsistent if the flag is already set when we upgrade, but i think the mon will almost immediately readjust the flags to reflect the current stats, right? We can make the compat encoding bitwise-or the two flags into one for older clients. But new clients would need to be updated to check for both flags. :/

My thought was actually not to do that, because we don’t want to update all the different pieces that do fullness checks to examine both flags, or to deal with forward-and-backwards versioning so much. This lets us just check for the extra OBJECTS flag when doing the health outputs since in all other cases we want the same behavior anyway.

yeah, makes sense! a lot of effort for what amounts to flag cosmetics

I guess then I would just adjust hte FLAG_FULL_QUOTA comment to say the pool object or byte quota

Which comment are you looking at? I don't think FULL_QUOTA specifies anywhere since it's always been either one.

liewegas

otherwise lgtm!

yuriw · 2019-03-13T19:08:55Z

wip-yuri4-testing-2019-03-13-1908

jecluis

looks good, and far from holding it back, but have an annoying concern about lack of output for quota bytes being reached.

src/osd/OSDMap.cc

neha-ojha · 2019-03-15T19:44:03Z

@gregsfortytwo Here's the rados run http://pulpito.ceph.com/yuriw-2019-03-14_15:17:27-rados-wip-yuri4-testing-2019-03-13-2257-distro-basic-smithi/

http://pulpito.ceph.com/yuriw-2019-03-14_15:17:27-rados-wip-yuri4-testing-2019-03-13-2257-distro-basic-smithi/3721621/ looks related

This is analogous to FLAG_FULL_QUOTA_OBJECTS and behaves the same way, as having both flags lets us output if a user exceeds both thresholds. Fixes: http://tracker.ceph.com/issues/38653 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2019-03-19T08:45:15Z

@neha-ojha @yuriw I'm having trouble identifying how I could have changed things so that test would fail, since it's definitely still outputting the quota_full flags and I can't even tell where the "max_objects" that the script is looking for comes from (in 'full_quota max_objects'). I do see in the monitor log it is correctly marking the pool as full for object quota...
Anyway, with the extra patches here I've changed it slightly again so let's run it again and if there are still issues I will try and dig through it again. :/

This lets us detect when we stop being full on bytes or objects but stay or add the full state on the other. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2019-03-20T05:25:46Z

@jecluis @liewegas update look good?

liewegas · 2019-03-20T06:22:07Z

I think we also need to update pg_pool_t::encode() to filter this flag out if SERVER_NAUTILUS isn't present (if we intend to backport this asap, otherwise nautilus).

liewegas · 2019-03-20T06:22:21Z

otherwise, this looks much cleaner, yay!

liewegas · 2019-03-20T07:41:51Z

The goal is that when require_osd_release < nautilus, it doesn't have anything pre-nautilus code doesn't understand or wouldn't encode. Flags are a bit of a gray area, but it's better if we maintain hygiene here IMO.

As for the feature bit, I'm suggesting we fudge a bit since we expect to backport this immediately. Otherwise we'd need to spend a feature bit on this small feature.

We don't encode these new flags for pre-Nautilus daemons, but that's a fudge since we already have a Nautilus release without knowledge of them. So also strip these flags out of any encoding with an older struct_v. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2019-03-20T07:54:54Z

We'd still need to clean it up on both sides, then; so done.

...and it turns out we don't clean up the flags for older encoding schemes, btw...not sure if that's a problem we should worry about. :/

liewegas · 2019-03-20T08:12:17Z

src/osd/osd_types.cc

+    if (struct_v < 29) {
+      eflags &= ~(FLAG_FULL_QUOTA_OBJECTS|FLAG_FULL_QUOTA_BYTES);
+    }
+    flags = eflags;


i don't think we need this part.. if we're decoding an old struct then it won't have the flag set, right?

Not if a v14.2.1 monitor encoded FLAG_FULL_QUOTA_OBJECTS for a v14.2.0 monitor, then disappeared or restarted or something?

Oh, hmm, right. I think this will lead to more instead of less problems because simply decoding and reencoding a nautilus-era map may change it. I think instead we're best with what you started with and no filtering of flags at all! Sorry for the noise

Hrm, I take it back (again).

I still think we should drop this. For the most part 14.2.1 and 14.2.0 doesn't matter because 14.2.0 doesn't have a decode check and ignores this flag. So if your case happens, where 14.2.1 encodes the new flag, a 14.2.0 mon will silently keep it and pass it along. that's true both for a 14.2.0 peon, or a 14.2.1 leader that takes over the leader role. Same for 14.2.0 osds.. they will take the new flag and ignore it.

So we do want to drop this commit?
I mean, I think that will work. The failure case is the leader going from a 14.2.1 to a 14.2.0 monitor and incorrectly maintaining the QUOTA_FULL flags; but the next time a 14.2.1 monitor is leader it will:

notice that the flags don't match

fall into the "update" case after noticing a state mismatch between current and prior QUOTA_FULL states

erroneously set the FULL flag in that update case

...not notice that anything is wrong since the smaller quota settings will match, and never fix this.

Shoot, I thought it would converge on the correct settings. I'll have to update it a little bit.

…LL flags" This reverts commit 1156883. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

…xed versions Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2019-03-21T09:47:18Z

If this looks good I'll squash it down and we can run through testing.

liewegas

lgtm! squash it!

tchaikov · 2019-03-22T17:44:35Z

http://pulpito.ceph.com/kchai-2019-03-22_16:43:21-rados-wip-kefu-testing-2019-03-22-2235-distro-basic-mira/

liewegas · 2019-03-25T11:11:57Z

http://pulpito.ceph.com/sage-2019-03-25_09:57:19-rados-wip-sage-testing-2019-03-24-1032-distro-basic-smithi/

lots of failures like

"2019-03-25 10:11:17.741128 mon.a (mon.0) 71 : cluster [WRN] Health check failed: 1 pool(s) full (POOL_FULL)" in cluster log

stale · 2019-05-25T17:22:37Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

tchaikov · 2019-05-26T16:55:29Z

@gregsfortytwo ping?

stale · 2019-07-25T17:32:33Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

gregsfortytwo · 2019-08-30T21:11:25Z

Ugh I should look at these failures so we can at least get it in for Octopus...

stale · 2019-10-29T21:35:19Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

stale · 2020-01-27T21:42:57Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

gregsfortytwo added mon DNM labels Mar 9, 2019

gregsfortytwo requested review from liewegas, neha-ojha and jecluis March 9, 2019 00:48

gregsfortytwo force-pushed the wip-38653-quota-output branch from a1d8839 to 89b6174 Compare March 9, 2019 00:53

liewegas reviewed Mar 9, 2019

View reviewed changes

liewegas approved these changes Mar 10, 2019

View reviewed changes

gregsfortytwo added needs-qa and removed DNM labels Mar 11, 2019

gregsfortytwo changed the title ~~DNM: osdmon: Track when pool quotas exceed object counts or bytes, and out…~~ osdmon: Track when pool quotas exceed object counts or bytes, and out… Mar 11, 2019

neha-ojha added the core label Mar 13, 2019

yuriw added the wip-yuri4-testing label Mar 13, 2019

jecluis approved these changes Mar 14, 2019

View reviewed changes

src/osd/OSDMap.cc Outdated Show resolved Hide resolved

yuriw removed needs-qa wip-yuri4-testing labels Mar 16, 2019

gregsfortytwo added the needs-qa label Mar 19, 2019

gregsfortytwo force-pushed the wip-38653-quota-output branch from 4c7cbd7 to 1d0561a Compare March 19, 2019 09:43

osdmon: Switch update_pools_status to deal with changing FULL triggers

664cc21

This lets us detect when we stop being full on bytes or objects but stay or add the full state on the other. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-38653-quota-output branch from 1d0561a to 664cc21 Compare March 20, 2019 04:02

liewegas changed the title ~~osdmon: Track when pool quotas exceed object counts or bytes, and out…~~ mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… Mar 20, 2019

gregsfortytwo force-pushed the wip-38653-quota-output branch 2 times, most recently from c2505f3 to e49f995 Compare March 20, 2019 07:46

gregsfortytwo force-pushed the wip-38653-quota-output branch from e49f995 to 1156883 Compare March 20, 2019 07:54

liewegas reviewed Mar 20, 2019

View reviewed changes

gregsfortytwo added 2 commits March 21, 2019 15:02

Revert "osd: rev pg_pool_t encoding version to deal with new QUOTA_FU…

b33d1ff

…LL flags" This reverts commit 1156883. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

osdmon: make sure we converge on the correct QUOTA_FULL setting in mi…

c0dc029

…xed versions Signed-off-by: Greg Farnum <gfarnum@redhat.com>

liewegas approved these changes Mar 21, 2019

View reviewed changes

tchaikov added the wip-kefu-testing label Mar 22, 2019

tchaikov removed the wip-kefu-testing label Mar 22, 2019

liewegas added the wip-sage-testing label Mar 24, 2019

liewegas removed wip-sage-testing needs-qa labels Mar 25, 2019

stale bot added the stale label May 25, 2019

stale bot removed the stale label May 26, 2019

stale bot added the stale label Jul 25, 2019

stale bot removed the stale label Aug 30, 2019

stale bot added the stale label Oct 29, 2019

stale bot closed this Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873

mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873

gregsfortytwo commented Mar 9, 2019

gregsfortytwo commented Mar 9, 2019

liewegas Mar 9, 2019

gregsfortytwo Mar 10, 2019

liewegas Mar 10, 2019

liewegas Mar 10, 2019

gregsfortytwo Mar 11, 2019

liewegas left a comment

yuriw commented Mar 13, 2019

jecluis left a comment

neha-ojha commented Mar 15, 2019

gregsfortytwo commented Mar 19, 2019

gregsfortytwo commented Mar 20, 2019

liewegas commented Mar 20, 2019

liewegas commented Mar 20, 2019

liewegas commented Mar 20, 2019

gregsfortytwo commented Mar 20, 2019

liewegas Mar 20, 2019

gregsfortytwo Mar 20, 2019

liewegas Mar 20, 2019

liewegas Mar 20, 2019

gregsfortytwo Mar 21, 2019

gregsfortytwo commented Mar 21, 2019

liewegas left a comment

tchaikov commented Mar 22, 2019

liewegas commented Mar 25, 2019

stale bot commented May 25, 2019

tchaikov commented May 26, 2019

stale bot commented Jul 25, 2019

gregsfortytwo commented Aug 30, 2019

stale bot commented Oct 29, 2019

stale bot commented Jan 27, 2020

mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873

mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873

Conversation

gregsfortytwo commented Mar 9, 2019

gregsfortytwo commented Mar 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas left a comment

Choose a reason for hiding this comment

yuriw commented Mar 13, 2019

jecluis left a comment

Choose a reason for hiding this comment

neha-ojha commented Mar 15, 2019

gregsfortytwo commented Mar 19, 2019

gregsfortytwo commented Mar 20, 2019

liewegas commented Mar 20, 2019

liewegas commented Mar 20, 2019

liewegas commented Mar 20, 2019

gregsfortytwo commented Mar 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsfortytwo commented Mar 21, 2019

liewegas left a comment

Choose a reason for hiding this comment

tchaikov commented Mar 22, 2019

liewegas commented Mar 25, 2019

stale bot commented May 25, 2019

tchaikov commented May 26, 2019

stale bot commented Jul 25, 2019

gregsfortytwo commented Aug 30, 2019

stale bot commented Oct 29, 2019

stale bot commented Jan 27, 2020