-
Notifications
You must be signed in to change notification settings - Fork 6.3k
src/mon/OSDMonitor.cc: [Stretch Mode] WRN non-existent CRUSH location assigned to MON #55103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/mon/MonmapMonitor.cc
Outdated
| if (monmap.stretch_mode_enabled) { | ||
| for (const auto &p : loc) { | ||
| if (!mon.osdmon()->osdmap.crush->name_exists(p.second)) { | ||
| ss << "location doesn't belong to any existing crush buckets!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very minor comment:
one long warning line? shouldn't it be broken down into multiple output lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several lines above also have one long warning line, IMHO the warning isn't so large that we need multiple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm I'm changing the output to be shorter, thank you for the review though.
|
jenkins test api |
|
jenkins test windows |
|
jenkins test api |
|
jenkins test api |
src/mon/MonmapMonitor.cc
Outdated
| if (!mon.osdmon()->osdmap.crush->name_exists(p.second)) { | ||
| ss << "location doesn't belong to any existing crush buckets!" | ||
| << " If you are trying to replace an arbiter mon, please use the command:" | ||
| << " ceph mon set_new_tiebreaker"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to add the tiebreaker before you can configure it as the tiebreaker? Or did we move that to be a special command to avoid all this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right the output is misleading, you need to do ceph mon add and then ceph mon set_new_tiebreaker. I'll just remove that part of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the problem is the tiebreaker monitor's bucket cannot exist in the crush map (or at least, there can't be OSDs there, so having the map reflect it is not idiomatic). And the documentation would need to be changed even if accepting that solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And once in stretch mode you are blocked from adding monitors without a location, if that wasn't clear. So this commit actually prevents replacing the tiebreaker at all.
src/mon/MonmapMonitor.cc
Outdated
| goto reply_no_propose; | ||
| } | ||
| if (monmap.stretch_mode_enabled) { | ||
| for (const auto &p : loc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a TODO comment several lines up that should be removed if this actually works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay will remove that
|
The documentation certainly still suggests you need to create a monitor and then set it as a tiebreaker. Which means that since this passed testing, it should also add some tests around that. 😮 |
|
@kamoltat @gregsfortytwo This was tested ref: https://trello.com/c/79noWWpu |
532dd19 to
29bddbf
Compare
@gregsfortytwo You're right, I should have added a test to at least qa/standalone. I was just testing the change in vstart.sh. Will create a test for this. |
How did you test this in vstart? Am I parsing things wrong, given my assumption it actually blocks switching to a new monitor in real deployments? |
|
@gregsfortytwo I'll test it one more time, been a while since I tested this PR. |
|
@kamoltat any update? |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
8d446e8 to
10926fd
Compare
|
@gregsfortytwo ping |
|
jenkins test make check |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Problem: In a stretch cluster, we encountered an assert failure when checking for dead crush zones when we have a none-existing CRUSH bucket. Solution: Ignore the none-existing crush bucket, instead of assert. Fixes: https://tracker.ceph.com/issues/63861 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
d4fc990 to
a9c0abc
Compare
gregsfortytwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine
Reviewed-by: Greg Farnum gfarnum@redhat.com
src/mon/HealthMonitor.cc
Outdated
| void HealthMonitor::check_mon_crush_loc_stretch_mode(health_check_map_t *checks) | ||
| { | ||
| // Check if the CRUSH location exists for all MONs | ||
| ceph_assert(mon.monmap->stretch_mode_enabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than asserting, why not just return (healthy) if stretch mode is off?
I put in a lot of asserts that caused crashes in my first stretch implementation, and assert crashes are better than data corruption, but not being able to fail is even better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I have the assert is because the callee is under the condition:
if (mon.monmap->stretch_mode_enabled)
so if we enter check_mon_crush_loc_stretch_mode and stretch mode is off, then something must have gone wrong. But I understand your point, we can potentially modify this such that we call check_mon_crush_loc_stretch_mode but then exit if stretch mode is not enabled, I can see this being a cleaner approach.
In streth mode, warn the user when we encounter a MON that has nonexistent crush location, with the tiebreaker MON being the only exception to this. Fixes: https://tracker.ceph.com/issues/63861 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
…TRETCH_MODE Added NONEXISTENT_MON_CRUSH_LOC_STRETCH_MODE to the documentation. Fixes: https://tracker.ceph.com/issues/63861 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
a9c0abc to
97b815c
Compare
|
Pushed new change regarding Greg's comment above |
|
jenkins test make check |
|
|
||
| The CRUSH location specified for the monitor must belong to one of the dividing | ||
| buckets when stretch mode is enabled. With the ``tiebreaker`` monitor being the | ||
| only exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comma not period. The latter part here is not a sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyeleven Ah okay, I guess I'll file a new PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or I can do so, let's get in it while we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthonyeleven okay I think you'll be faster, thank you!
src/mon/OSDMonitor.cc: [Stretch Mode] WRN non-existent CRUSH location assigned to MON Reviewed-by: Ronen Friedman <rfriedma@redhat.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Problem:
In stretch mode, we encountered
an assert failure when checking for
dead crush zones when we have a none-existing
CRUSH bucket.
Solution:
Ignore the non-existent crush bucket, instead
of asserting. We then warn the user about a particular
MON contains a non-existent CRUSH bucket.
The tiebreaker monitor is the only exception where
we would allow having non-existent Crush location
Fixes: https://tracker.ceph.com/issues/63861
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e