New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon/Elector.cc Added additional prank >= ranks_size sanity check #49259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be compressing these checks into the send_peer_ping function instead of open-coding them above? It could return success or else remove the peer from longing and return failure and have callers bail out. That way we don't need to remember this for future callers too.
src/mon/Elector.cc
Outdated
| // Monitor no longer exists in the monmap, | ||
| // therefore, we shouldn't ping this monitor | ||
| // since we cannot lookup the address! | ||
| dout(5) << __func__ << "peer >= ranks_size ... droping to prevent " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/droping/dropping/
src/mon/Elector.cc
Outdated
| @@ -496,7 +506,8 @@ void Elector::ping_check(int peer) | |||
| // Monitor no longer exists in the monmap, | |||
| // therefore, we shouldn't ping this monitor | |||
| // since we cannot lookup the address! | |||
| dout(20) << __func__ << "peer >= ranks_size" << dendl; | |||
| dout(5) << __func__ << "peer >= ranks_size ... droping to prevent " | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/droping/dropping/
Do we still want to keep this at 5? I guess it's not a common operation so no harm.
src/mon/Elector.cc
Outdated
| @@ -566,6 +577,15 @@ void Elector::handle_ping(MonOpRequestRef op) | |||
| dout(10) << __func__ << " " << *m << dendl; | |||
|
|
|||
| int prank = mon->monmap->get_rank(m->get_source_addr()); | |||
| if (prank >= ssize(mon->monmap->ranks)) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works -- we just pulled the rank out of the monmap, so if the Mon isn't valid won't we end up with -1 and pass this check, but still fail in the deeper assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this particular case isn't valid.
src/mon/Elector.cc
Outdated
| // Monitor no longer exists in the monmap, | ||
| // therefore, we shouldn't ping this monitor | ||
| // since we cannot lookup the address! | ||
| dout(5) << __func__ << "prank >= ranks_size ... droping to prevent " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also s/droping/dropping/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregsfortytwo Just pushed new changes addressing your comments, thanks!
69cb431
to
27d499f
Compare
|
Running 1 job on cephadm/workunits: test_orch_cli_mon for sanity check Update: Job passed with no crashes or errors. |
|
/a/ksirivad-2022-12-06_21:15:10-orch:cephadm-wip-ksirivad-fix-58155-distro-default-smithi/ 19/20 Jobs passed 1 Dead job --> infrastructure failure. |
|
jenkins test make check arm64 |
|
jenkins test windows |
2 similar comments
|
jenkins test windows |
|
jenkins test windows |
|
@gregsfortytwo gentle reminder, I've addressed your changes, PTAL |
|
jenkins test windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…r_ping Problem: Currently, ceph#44993 failed to completely fix: https://tracker.ceph.com/issues/50089 There are certain code paths such as Elector::handle_ping → Elector::begin_peer_ping → Elector::send_peer_ping. that when a monitor is removed before shutdown in Cephadm it can hit the assert failure. Solution: Therefore, we have to enforce sanity checks on all code paths. We do this by compressing the `peer >= rank_size` sanity check into `send_peer_ping`. We also make `send_peer_ping` return true/false caller of `send_peer_ping` would drop itself if recieves a `false`. Fixes: https://tracker.ceph.com/issues/58155 Signed-off-by: Kamoltat <ksirivad@redhat.com>
27d499f
to
f4bda36
Compare
|
pushing again to resolve merge conflict after merging of #48991, trivial fix just accepts incoming change |
|
Note that we batch the test together for the commits in this PR with #48991 25 Failures mostly Infra failures, all the other failures are known failures. FAILEDjobid: [7117874] jobid: [7117883] jobid: [7117927] jobid: [7117928] jobid: [7117939] jobid: [7117948] jobid: [7117950] jobid: [7117951] (Non-infra) jobid: [7117977] jobid: [7117979, 7118058, 7118195] (Non infra) jobid: [7117988] (Non infra) jobid: [7117990] jobid: [7118010] jobid: [7118004] (Non infra) jobid: [7118114] (Non infra) DEADAll dead jobs are infra-related behavior |
|
jenkins test api |
|
jenkins test make check |
|
jenkins test windows |
|
jenkins test api |
Problem:
Currently, #44993 failed to completely fix:
https://tracker.ceph.com/issues/50089
There are certain code paths such as
Elector::handle_ping → Elector::begin_peer_ping →
Elector::send_peer_ping.
that when a monitor is removed before shutdown in
Cephadm can hit the assert failure.
Solution:
Therefore, we have to enforce sanity checks on
all code paths leading to Elector::send_peer_ping
which are:
Fixes: https://tracker.ceph.com/issues/58155
Signed-off-by: Kamoltat ksirivad@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows