New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM][Downstream][hotfix]mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #48698
Conversation
|
I have tested this by running my own reproducer local using vstart.sh, removing and adding monitors in stretch_cluster. Please only look at the commits that have actual code changes and not just logging: Also, I was thinking of separating elector.notify_rank_removed() from notify_new_monmap(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! There's at least one edge case still not covered, and a few others I'm worried about.
Plus of course figuring out what we want to do with all the new logging — much of it probably has value but at 20 or 30 instead of 10?
src/mon/Monitor.cc
Outdated
| notify_new_monmap(false); | ||
| if (!has_ever_joined) { | ||
| dout(20) << " !has_ever_joined " << *monmap << dendl; | ||
| notify_new_monmap(false, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I forget when we start pinging and generating ConnectionTracker state. Is the following possible?
New monitor turns on
Start pinging existing 5 monitors
Start copying paxos state from mon.1
mon.4 is removed from monitor list
finish copying state
get new map with mon.4 in removed_ranks
fail to remove because we are !has_ever_joined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Greg,
How about we only allow mon remove when it has joined quorum before and only is 1 epoch different from the new map, for other cases like
never joined or joined but we have a very outdated version, we don't adjust and also make
sure to nuke peer_report, incase it has already been populated with like a removed_rank:
if (newmap->get_epoch() - monmap->get_epoch() == 1 &&
has_ever_joined) {
notify_new_monmap(false);
} else {
notify_new_monmap(false, false);
elector.notify_clear_peer_state();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: hmm not sure I understood correctly.
That code snippet...probably works? As I noted below, we probably need to toss out our state whenever we have a discontinuity in witnessed monmaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by state, do you mean peer_tracker in the elector? if so I guess you mean even if we are 1 epoch apart we should always elector.notify_clear_peer_state();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if we see a discontinuity we need to reset our scores because we don't know what might have happened to the monitor set and how the scores should align.
|
thank you for the comments, I will get to them shortly. I just pushed a commit that rewrites ConnectionTracker::notify_rank_removed(): the motivations are in the commit message. But in short, I believe by doing it this way will make the peer_tracker cleaner and less prone to I have tested this in my local vstart, removing the lowest rank, highest rank, middle and etc. |
69f9efa
to
a738e38
Compare
|
@gregsfortytwo Just updated this PR with regards to your comments, I also added the mechanism where we can reset |
fe01fbe
to
d56f185
Compare
9758948
to
c619de7
Compare
|
jenkins test api |
ae8261a
to
c70929a
Compare
|
jenkins test api |
src/mon/ConnectionTracker.cc
Outdated
| peer_reports.erase(pi++); | ||
| // Go through everyone, remove & adjust current and history map. | ||
| for (auto pi = peer_reports.begin(); pi != peer_reports.end(); pi++) { | ||
| peer_reports[pi->first].current.erase(rank_removed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pi->second.current.erase(rank_removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep good catch!
src/mon/ConnectionTracker.cc
Outdated
| // Go through everyone, remove & adjust current and history map. | ||
| for (auto pi = peer_reports.begin(); pi != peer_reports.end(); pi++) { | ||
| peer_reports[pi->first].current.erase(rank_removed); | ||
| peer_reports[pi->first].history.erase(rank_removed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pi->second.history.erase(rank_removed)
etc etc in this loop. Is there a reason you're looking up the peer_report each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, should just use pi->second
src/mon/ConnectionTracker.cc
Outdated
| peer_reports[pi->first].history[hi->first - 1] = hi->second; | ||
| peer_reports[pi->first].current.erase(ci++); | ||
| peer_reports[pi->first].history.erase(hi++); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I really, really don't like this. The peer_reports mechanism is designed so only the monitor generating a report updates the numbers, and then we assimilate them. Directly manipulating them like this is icky, and actually can lead to differing election outcomes during an upgrade from the old-way monitors to this way.
From the commit message, I think this is the real problem:
However, there is a case where if we remove the highest rank of the quorum, then we will leave a the higest rank in report.current and report.history
If we just resolve that, which sounds like a fairly simple bug in ConnectionTracker::receive_peer_reports, I don't think we need to do this twiddling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even with the old code, we are already directly manipulating peer_reports by adjusting ranks: https://github.com/ceph/ceph/blob/main/src/mon/ConnectionTracker.cc#L171-L176.
my thoughts are that since we are already adjusting the ranks of peer_reports, might as well adjust the ranks of peer_reports.current and peer_reports.history as well so that they are aligned with each other.
I understand your concerns about manipulating them, but if we go with that, then we should just only remove and not adjust the ranks in peer_reports at all correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even with the old code, we are already directly manipulating peer_reports by adjusting ranks: https://github.com/ceph/ceph/blob/main/src/mon/ConnectionTracker.cc#L171-L176. my thoughts are that since we are already adjusting the ranks of peer_reports, might as well adjust the ranks of peer_reports.current and peer_reports.history as well so that they are aligned with each other.
I understand your concerns about manipulating them, but if we go with that, then we should just only remove and not adjust the ranks in peer_reports at all correct?
That link is just removing scores, not updating the individual reports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code moves the peer_report down so in a way it is adjusting the ranks implicitly since functions like get_total_connection_score rely on the position of the peer_report rather than the actual peer_report.rank.
If we use the old code we'll have this in this scenario:
we are in mon.0 point of view.
notify_rank_removed: 2
peer_reports before remove 2: {
0=rank=0, current links: {1=1, 2=1, 3=1, 4=1}, history: {1=0.9, 2=0.9 3=0.9, 4=0.9},
1=rank=1, current links: {0=1, 2=1, 3=1, 4=1}, history: {0=0.9, 2=0.9, 3=0.9, 4=0.9},
2=rank=2, current links: {0=1, 1=1, 3=1, 4=1}, history: {0=0.9, 2=0.9, 3=0.9, 4=0.9},
3=rank=3, current links: {0=1, 1=1, 2=1, 4=1}, history: {0=0.8, 1=0.7, 2=0.6, 4=0.9},
4=rank=4, current links: {0=1, 1=1, 2=1, 3=1}, history: {0=0.9, 1=0.6, 2=0.5, 3=0.7},
}
peer_reports after remove 2: {
0=rank=0, current links: {1=1, 2=1, 3=1, 4=1}, history: {1=0.9, 2=0.9 3=0.9, 4=0.9},
1=rank=1, current links: {0=1, 2=1, 3=1, 4=1}, history: {0=0.9, 2=0.9, 3=0.9, 4=0.9},
2=rank=3, current links: {0=1, 1=1, 2=1, 4=1}, history: {0=0.8, 1=0.7, 2=0.6, 4=0.9},
3=rank=4, current links: {0=1, 1=1, 2=1, 3=1}, history: {0=0.9, 1=0.6, 2=0.5, 3=0.7},
}
So now, rank 4 becomes rank 3, and rank 3 becomes rank 2.
let's say mon.0 call get_total_connection_score peer 3, peer_report[2] would give us nothing, but it should have given us 0.9.
So what I am trying to say is that since we are moving the ranks down by 1 already with the old code. I don't think it is fair to not move the ranks in current and history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, the fact that the old code doesn't adjust the actual peer_report.rank leaves us a vulnerability as well since let's say after we remove ranks with the old code and we ping another monitor and they happened to assimilate_connection_reports of the peer_report provided by us, then with our unadjusted peer_report.rank, we will populate the incorrect information on their peer_reports and this also can lead to differing election outcomes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregsfortytwo PTAL
…ElectionLogic & Elector Problem: Currently there are not ConnectionTracker logs, therefore it is really hard to debug Solution: Enable loggings for most functions in ConnectionTracker.cc Most of the logs are in debug_mon = 30. Also Added some logs in Elector and ElectionLogic so that debugging will be easier in the future. Signed-off-by: Kamoltat <ksirivad@redhat.com>
|
LGTM! |
e3d7bda
to
856fef5
Compare
PROBLEM: In `ConnectionTracker::receive_peer_report` we loop through ranks which is bad when there is `notify_rank_removed` before this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. SOLUTION: In `ConnectionTracker::receive_peer_report` instead of passing `report.rank` in the function `ConnectionTracker::reports`, we pass `i.first` instead so that trim old ranks properly. We also added a assert in notify_rank_removed(), comparing expected rank provided by the monmap against the rank that we adjust ourself to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed(). Signed-off-by: Kamoltat <ksirivad@redhat.com>
when a new monitor joins, there is a chance that it will recive a monmap that recently removed a monitor and ``removed_rank`` will have some content in it. A new monitor that joins should never remove rank in peer_tracker but rather call ``notify_clear_peer_state()`` to reset the `peer_report`. In the case when it is a monitor that has joined quorum before and is only 1 epoch behind the newest monmap provided by the probe_replied monitor. We can actually remove and adjust ranks in `peer_report` since we are sure that if there is any content in removed_ranks, then it has to be because in the next epoch we are removing a rank, since every update of an epoch we always clear the removed_ranks. There is no point in keeping the content of ``removed_ranks`` after monmap gets updated to the epoch. Therefore, clear ``removed_ranks`` every update. When there is discontinuity between monmaps for more 1 epoch or the new monitor never joined quorum before, we always reset `peer_tracker`. Moreover, beneficial for monitor log to also log which rank has been removed at the current time of the monmap. So add removed_ranks to `print_summary` and `dump` in MonMap.cc. Signed-off-by: Kamoltat <ksirivad@redhat.com>
In `notify_clear_peer_state()` we another mechanism in reseting our `peer_tracker.rank` to match our own monitor.rank. This is added so there is a way for us to recover from a scenrio where `peer_tracker.rank` is messed up from adjusting the ranks or removing ranks. `notifiy_clear_peer_state()` can be triggered by using the command: `ceph connection scores reset` Also in `clear_peer_reports`, besides reassigning my_reports to an empty object, we also have to make `my_reports` = `rank` from `peer_tracker`, such that we don't get -1 as a rank in my_reports. Signed-off-by: Kamoltat <ksirivad@redhat.com>
…d connection report When upgrading the monitors (include booting up), we check if `peer_tracker` is dirty or not. If so, we clear it. Added some functions in `Elector` and `ConnectionTracker` class to check for clean `peer_tracker`. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check in `ConnectionTracker::report_live_connection` and `ConnectionTracker::report_dead_connection`. Signed-off-by: Kamoltat <ksirivad@redhat.com>
856fef5
to
3c49a06
Compare
|
jenkins test make check |
|
jenkins retest this please |
|
make check failure unrelated, rbd-unit-tests are known to be flaky 12 - run-rbd-unit-tests-1.sh (Failed) |
|
jenkins test make check |
|
jenkins test api |
|
@kamoltat let me know when this is ready for testing. You can tag it with |
|
@kamoltat one of the failures in the run had the following assert failure: Raised a tracker for it here: https://tracker.ceph.com/issues/58114 It looked related to this change. The coredump is available as well: |
|
@amathuria thanks for letting me know, looking at it right now |
|
I have investigated this thoroughly and I have concluded that this failure is due to how we are not properly using |
|
Analysis on my investigation: |
Problem: --mon-initial-members does nothing but causes monmap to populate ``removed_ranks`` because the way we start monitors in standalone tests uses ``run_mon $dir $id ..`` on each mon. Regardless of --mon-initial-members=a,b,c, if we set --mon-host=$MONA,$MONB,$MONC (which we do every single tests), everytime we run a monitor (e.g.,run mon.b) it will pre-build our monmap with ``` noname-a=mon.noname-a addrs v2:127.0.0.1:7127/0, b=mon.b addrs v2:127.0.0.1:7128/0, noname-c=mon.noname-c addrs v2:127.0.0.1:7129/0, ``` Now, with --mon-initial-members=a,b,c we are letting monmap know that we should have initial members name: a,b,c, which we only have `b` as a match. So what ``MonMap::set_initial_members`` do is that it will remove noname-a and noname-c which will populate `removed_ranks`. Solution: remove all instances of --mon-initial-members in the standalone test as it has no impact on the nature of the tests themselves. Fixes: https://tracker.ceph.com/issues/58132 Signed-off-by: Kamoltat <ksirivad@redhat.com>
|
@gregsfortytwo I have decided to get rid of the --mon-initial-members setting in the standalone test, as I have concluded that it does not really play a role in the tests at all except pre-populated |
|
jenkins test api |
|
Closing and moving this PR to #49312 |
Problem:
Currently, there is an issue when performing a DR test with 2 sites stretch cluster
where removing monitors and adding new ones to the cluster
causes incorrect
rankin ConnectionTracker class.This causes the monitor to think that they are someone else
in the
peer_trackercopy and will never update thecorrect field of itself, causing a deadlock in the election process (Ceph becoming unresponsive)
when using election strategy: 3 (Connectivity mode).
Solution:
It was really hard to debug the issue so the first thing we did was to add additional loggings to ConnectionTracker,
Elector and ElectionLogic Classes.
In
ConnectionTracker::receive_peer_reportwe loop through ranks which is bad when there isnotify_rank_removedbefore this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. Therefore, instead of passingreport.rankin the functionConnectionTracker::reports, we passi.firstinstead so that we trim old ranks properly. We also added an assert in notify_rank_removed(), comparing the expected rank provided by the monmap against the rank that we adjust ourselves to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed().MonMap::removed_rank does not get cleared every update of the epoch, this was the root cause of the problem. Therefore, we fix this by making MonMap clear removed_ranks every update. Moreover, When there is discontinuity between monmaps for more than 1 epoch or the new monitor never joined the quorum before,
we always reset
peer_tracker.Added a way for us to manually reset
peer_tracker.rankwhen executing the command:ceph connection scores resetfor each monitor. The peer_tracker.rank will match the current rank of the Monitor.When upgrading the monitors (including booting up), we check if
peer_trackeris dirty or not. If so, we clear it. Added some functions in theElectorandConnectionTrackerclasses to check for cleanpeer_tracker. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check inConnectionTracker::report_live_connectionandConnectionTracker::report_dead_connection.Fixes: https://tracker.ceph.com/issues/58049
Signed-off-by: Kamoltat ksirivad@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows