-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon/OSDMonitor: Track when pool quotas exceed object counts or bytes, and out… #26873
Conversation
Marked DNM until this passes make check, as I've just compile-tested it. o_0 Anyway, this functionality was requested by one of our support teams and it turned out to be a bit difficult because the quota fullness is only detected by comparing pg stats, which of course the OSDMap doesn't have any visibility into. Adding an extra flag to indicate which quota we violated is the easiest way to solve that problem, but is definitely a bit of a hack. What do people think? |
…put them We add a separate pool FLAG_QUOTA_FULL_OBJECTS, which is set only when the FLAG_QUOTA_FULL is set AND the quota exceeded is for the object count. This lets us output the relevant exceeded quota from OSDMap::check_health(). Fixes: http://tracker.ceph.com/issues/38653 Signed-off-by: Greg Farnum <gfarnum@redhat.com>
a1d8839
to
89b6174
Compare
@@ -1137,6 +1138,7 @@ struct pg_pool_t { | |||
case FLAG_NOSCRUB: return "noscrub"; | |||
case FLAG_NODEEP_SCRUB: return "nodeep-scrub"; | |||
case FLAG_FULL_QUOTA: return "full_quota"; | |||
case FLAG_FULL_QUOTA_OBJECTS: return "full_quota_objects"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename FLAG_FULL_QUOTA to FLAG_FULL_QUOTA_BYTES? it will be slightly inconsistent if the flag is already set when we upgrade, but i think the mon will almost immediately readjust the flags to reflect the current stats, right? We can make the compat encoding bitwise-or the two flags into one for older clients. But new clients would need to be updated to check for both flags. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was actually not to do that, because we don’t want to update all the different pieces that do fullness checks to examine both flags, or to deal with forward-and-backwards versioning so much. This lets us just check for the extra OBJECTS flag when doing the health outputs since in all other cases we want the same behavior anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, makes sense! a lot of effort for what amounts to flag cosmetics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess then I would just adjust hte FLAG_FULL_QUOTA comment to say the pool object or byte quota
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which comment are you looking at? I don't think FULL_QUOTA specifies anywhere since it's always been either one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, and far from holding it back, but have an annoying concern about lack of output for quota bytes being reached.
This is analogous to FLAG_FULL_QUOTA_OBJECTS and behaves the same way, as having both flags lets us output if a user exceeds both thresholds. Fixes: http://tracker.ceph.com/issues/38653 Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@neha-ojha @yuriw I'm having trouble identifying how I could have changed things so that test would fail, since it's definitely still outputting the quota_full flags and I can't even tell where the "max_objects" that the script is looking for comes from (in 'full_quota max_objects'). I do see in the monitor log it is correctly marking the pool as full for object quota... |
4c7cbd7
to
1d0561a
Compare
This lets us detect when we stop being full on bytes or objects but stay or add the full state on the other. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
1d0561a
to
664cc21
Compare
I think we also need to update pg_pool_t::encode() to filter this flag out if SERVER_NAUTILUS isn't present (if we intend to backport this asap, otherwise nautilus). |
otherwise, this looks much cleaner, yay! |
The goal is that when require_osd_release < nautilus, it doesn't have anything pre-nautilus code doesn't understand or wouldn't encode. Flags are a bit of a gray area, but it's better if we maintain hygiene here IMO. As for the feature bit, I'm suggesting we fudge a bit since we expect to backport this immediately. Otherwise we'd need to spend a feature bit on this small feature. |
c2505f3
to
e49f995
Compare
We don't encode these new flags for pre-Nautilus daemons, but that's a fudge since we already have a Nautilus release without knowledge of them. So also strip these flags out of any encoding with an older struct_v. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
e49f995
to
1156883
Compare
We'd still need to clean it up on both sides, then; so done. ...and it turns out we don't clean up the flags for older encoding schemes, btw...not sure if that's a problem we should worry about. :/ |
src/osd/osd_types.cc
Outdated
if (struct_v < 29) { | ||
eflags &= ~(FLAG_FULL_QUOTA_OBJECTS|FLAG_FULL_QUOTA_BYTES); | ||
} | ||
flags = eflags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need this part.. if we're decoding an old struct then it won't have the flag set, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if a v14.2.1 monitor encoded FLAG_FULL_QUOTA_OBJECTS for a v14.2.0 monitor, then disappeared or restarted or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hmm, right. I think this will lead to more instead of less problems because simply decoding and reencoding a nautilus-era map may change it. I think instead we're best with what you started with and no filtering of flags at all! Sorry for the noise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I take it back (again).
I still think we should drop this. For the most part 14.2.1 and 14.2.0 doesn't matter because 14.2.0 doesn't have a decode check and ignores this flag. So if your case happens, where 14.2.1 encodes the new flag, a 14.2.0 mon will silently keep it and pass it along. that's true both for a 14.2.0 peon, or a 14.2.1 leader that takes over the leader role. Same for 14.2.0 osds.. they will take the new flag and ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we do want to drop this commit?
I mean, I think that will work. The failure case is the leader going from a 14.2.1 to a 14.2.0 monitor and incorrectly maintaining the QUOTA_FULL flags; but the next time a 14.2.1 monitor is leader it will:
- notice that the flags don't match
- fall into the "update" case after noticing a state mismatch between current and prior QUOTA_FULL states
- erroneously set the FULL flag in that update case
- ...not notice that anything is wrong since the smaller quota settings will match, and never fix this.
Shoot, I thought it would converge on the correct settings. I'll have to update it a little bit.
…LL flags" This reverts commit 1156883. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…xed versions Signed-off-by: Greg Farnum <gfarnum@redhat.com>
If this looks good I'll squash it down and we can run through testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! squash it!
lots of failures like
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@gregsfortytwo ping? |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
Ugh I should look at these failures so we can at least get it in for Octopus... |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
…put them
We add a separate pool FLAG_QUOTA_FULL_OBJECTS, which is set only
when the FLAG_QUOTA_FULL is set AND the quota exceeded is for the object
count. This lets us output the relevant exceeded quota from
OSDMap::check_health().
Fixes: http://tracker.ceph.com/issues/38653
Signed-off-by: Greg Farnum gfarnum@redhat.com