Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/Elector.cc Added additional prank >= ranks_size sanity check #49259

Merged
merged 1 commit into from Dec 22, 2022

Conversation

kamoltat
Copy link
Member

@kamoltat kamoltat commented Dec 5, 2022

Problem:

Currently, #44993 failed to completely fix:

https://tracker.ceph.com/issues/50089

There are certain code paths such as

Elector::handle_ping → Elector::begin_peer_ping →
Elector::send_peer_ping.

that when a monitor is removed before shutdown in
Cephadm can hit the assert failure.

Solution:

Therefore, we have to enforce sanity checks on
all code paths leading to Elector::send_peer_ping
which are:

  • Elector::begin_peer_ping
  • Elector::ping_check
  • Elector::handle_ping

Fixes: https://tracker.ceph.com/issues/58155

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@kamoltat kamoltat self-assigned this Dec 5, 2022
@kamoltat kamoltat requested a review from a team as a code owner December 5, 2022 19:19
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be compressing these checks into the send_peer_ping function instead of open-coding them above? It could return success or else remove the peer from longing and return failure and have callers bail out. That way we don't need to remember this for future callers too.

// Monitor no longer exists in the monmap,
// therefore, we shouldn't ping this monitor
// since we cannot lookup the address!
dout(5) << __func__ << "peer >= ranks_size ... droping to prevent "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/droping/dropping/

@@ -496,7 +506,8 @@ void Elector::ping_check(int peer)
// Monitor no longer exists in the monmap,
// therefore, we shouldn't ping this monitor
// since we cannot lookup the address!
dout(20) << __func__ << "peer >= ranks_size" << dendl;
dout(5) << __func__ << "peer >= ranks_size ... droping to prevent "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/droping/dropping/

Do we still want to keep this at 5? I guess it's not a common operation so no harm.

@@ -566,6 +577,15 @@ void Elector::handle_ping(MonOpRequestRef op)
dout(10) << __func__ << " " << *m << dendl;

int prank = mon->monmap->get_rank(m->get_source_addr());
if (prank >= ssize(mon->monmap->ranks)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works -- we just pulled the rank out of the monmap, so if the Mon isn't valid won't we end up with -1 and pass this check, but still fail in the deeper assert?

Copy link
Member Author

@kamoltat kamoltat Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this particular case isn't valid.

// Monitor no longer exists in the monmap,
// therefore, we shouldn't ping this monitor
// since we cannot lookup the address!
dout(5) << __func__ << "prank >= ranks_size ... droping to prevent "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also s/droping/dropping/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregsfortytwo Just pushed new changes addressing your comments, thanks!

@kamoltat kamoltat force-pushed the wip-ksirivad-fix-58155 branch 2 times, most recently from 69cb431 to 27d499f Compare December 5, 2022 21:30
@kamoltat
Copy link
Member Author

kamoltat commented Dec 6, 2022

Running 1 job on cephadm/workunits: test_orch_cli_mon for sanity check
https://pulpito.ceph.com/ksirivad-2022-12-06_20:03:39-orch:cephadm-wip-ksirivad-fix-58155-distro-default-smithi/

Update:

Job passed with no crashes or errors.

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

/a/ksirivad-2022-12-06_21:15:10-orch:cephadm-wip-ksirivad-fix-58155-distro-default-smithi/

19/20 Jobs passed

1 Dead job --> infrastructure failure.

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

jenkins test make check arm64

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

jenkins test windows

2 similar comments
@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test windows

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

kamoltat commented Dec 13, 2022

@gregsfortytwo gentle reminder, I've addressed your changes, PTAL

@kamoltat
Copy link
Member Author

jenkins test windows

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

…r_ping

Problem:

Currently, ceph#44993
failed to completely fix:

https://tracker.ceph.com/issues/50089

There are certain code paths such as

Elector::handle_ping → Elector::begin_peer_ping →
Elector::send_peer_ping.

that when a monitor is removed before shutdown in
Cephadm it can hit the assert failure.

Solution:

Therefore, we have to enforce sanity checks on
all code paths.

We do this by compressing the `peer >= rank_size`
sanity check into `send_peer_ping`. We also make
`send_peer_ping` return true/false

caller of `send_peer_ping` would drop itself if
recieves a `false`.

Fixes: https://tracker.ceph.com/issues/58155

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat
Copy link
Member Author

kamoltat commented Dec 15, 2022

pushing again to resolve merge conflict after merging of #48991, trivial fix just accepts incoming change

@kamoltat
Copy link
Member Author

Note that we batch the test together for the commits in this PR with #48991
The merge conflict was resolved during the batching process for testing

25 Failures mostly Infra failures, all the other failures are known failures.


FAILED

jobid: [7117874]
description: rados/singleton-nomsgr/{all/osd_stale_reads mon_election/classic rados supported-random-distro$/{rhel_8}}
failure_reason: Server connection dropped:
traceback:
tracker:
created_tracker: https://tracker.ceph.com/issues/58283

jobid: [7117883]
description: rados/cephadm/osds/{0-distro/ubuntu_20.04 0-nvme-loop 1-start 2-ops/repave-all}
failure_reason: Command failed on smithi094 with status 100: 'sudo apt-get clean'
traceback:
tracker:
created_tracker:

jobid: [7117927]
description: rados/multimon/{clusters/9 mon_election/connectivity msgr-failures/many msgr/async no_pools objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8} tasks/mon_clock_no_skews}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117928]
description: rados/singleton/{all/lost-unfound-delete mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-hybrid rados supported-random-distro$/{centos_8}}
failure_reason: {'smithi093.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi093.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117939]
description:
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': "failed to transfer file to /home/teuthworker/.ansible/tmp/ansible-local-12547kwq9iumd/tmpb7be8wx6 /home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py:\n\nWarning: Permanently added 'smithi089.front.sepia.ceph.com,172.21.15.89' (ECDSA) to the list of known hosts.\r\ndd: failed to open '/home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py': No such file or directory\n"}}
traceback:
tracker:
created_tracker:

jobid: [7117948]
description: rados/thrash-erasure-code-shec/{ceph clusters/{fixed-4 openstack} mon_election/classic msgr-failures/osd-dispatch-delay objectstore/filestore-xfs rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/ec-rados-plugin=shec-k=4-m=3-c=2}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117950]
description: rados/cephadm/osds/{0-distro/centos_8.stream_container_tools_crun 0-nvme-loop 1-start 2-ops/rm-zap-flag}
failure_reason: {'smithi016.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi016.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117951] (Non-infra)
description: rados/singleton-nomsgr/{all/ceph-post-file mon_election/connectivity rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test post-file.sh) on smithi093 with status 255: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/post-file.sh'
traceback: connect to host drop.ceph.com port 22: Connection timed out
tracker: https://tracker.ceph.com/issues/58097
created_tracker:

jobid: [7117977]
description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-fast-read-overwrites}
failure_reason: [Errno None] Unable to connect to port 22 on 172.21.15.89
traceback:
tracker: https://tracker.ceph.com/issues/47343
created_tracker:

jobid: [7117979, 7118058, 7118195] (Non infra)
description: rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench cluster/3-node k8s/1.21 net/calico rook/1.7.2}
failure_reason: Command failed on smithi037 with status 1: 'sudo kubeadm init --node-name smithi037 --token abcdef.q9zcptfltqth0cc4 --pod-network-cidr 10.249.32.0/21'
traceback:
tracker: https://tracker.ceph.com/issues/58258
created_tracker:

jobid: [7117988] (Non infra)
description: rados/objectstore/{backends/objectstore-bluestore-a supported-random-distro$/{rhel_8}}
failure_reason: Command failed on smithi174 with status 1: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c 'mkdir $TESTDIR/archive/ostest && cd $TESTDIR/archive/ostest && ulimit -Sn 16384 && CEPH_ARGS="--no-log-to-stderr --log-file $TESTDIR/archive/ceph_test_objectstore.log --debug-bluestore 20" ceph_test_objectstore --gtest_filter=*/2:-SyntheticMatrixC --gtest_catch_exceptions=0''
traceback: [ FAILED ] ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2, where GetParam() = "bluestore"
tracker: https://tracker.ceph.com/issues/58256
created_tracker:

jobid: [7117990]
description: rados/perf/{ceph mon_election/classic objectstore/bluestore-basic-min-osd-mem-target openstack scheduler/dmclock_1Shard_16Threads settings/optimized ubuntu_latest workloads/radosbench_4M_write}
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi089.front.sepia.ceph.com port 22: No route to host'}}
traceback:
tracker:
created_tracker:

jobid: [7118010]
description: rados/cephadm/workunits/{0-distro/rhel_8.6_container_tools_3.0 agent/off mon_election/classic task/test_iscsi_pids_limit/{centos_8.stream_container_tools test_iscsi_pids_limit}}
failure_reason: package container-selinux-2:2.189.0-1.module_el8.7.0+1217+ea57d1f1.noarch conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1.module_el8.7.0+1217+ea57d1f1.noarch
tracker:
created_tracker:

jobid: [7118004] (Non infra)
description: rados/standalone/{supported-random-distro$/{ubuntu_latest} workloads/osd}
failure_reason: Command failed (workunit test osd/divergent-priors.sh) on smithi040 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh'
traceback: 2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:818: TEST_divergent_3: echo failure
2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:819: TEST_divergent_3: return 1
2022-12-15T08:34:31.501 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:46: run: return 1
tracker: https://tracker.ceph.com/issues/56034
created_tracker:

jobid: [7118114] (Non infra)
description: rados/singleton/{all/test-crash mon_election/classic msgr-failures/few msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test rados/test_crash.sh) on smithi055 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_crash.sh'
traceback: [ 0 = 4 ]
tracker: https://tracker.ceph.com/issues/58098
created_tracker:


DEAD

All dead jobs are infra-related behavior

@kamoltat
Copy link
Member Author

jenkins test api

@kamoltat
Copy link
Member Author

jenkins test make check

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

jenkins test api

@kamoltat kamoltat merged commit d6b08ad into ceph:main Dec 22, 2022
11 checks passed
@kamoltat kamoltat added needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels May 15, 2023
@kamoltat kamoltat removed needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants