Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quincy: mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #49311

Merged

Conversation

kamoltat
Copy link
Member

@kamoltat kamoltat commented Dec 7, 2022

Problem:

Currently, there is an issue when performing a DR test with 2 sites stretch cluster
where removing monitors and adding new ones to the cluster
causes incorrect rank in ConnectionTracker class.

This causes the monitor to think that they are someone else
in the peer_tracker copy and will never update the
correct field of itself, causing a deadlock in the election process (Ceph becoming unresponsive)
when using election strategy: 3 (Connectivity mode).

Solution:

  1. It was really hard to debug the issue so the first thing we did was to add additional loggings to ConnectionTracker,
    Elector and ElectionLogic Classes.

  2. In ConnectionTracker::receive_peer_report we loop through ranks which is bad when there is notify_rank_removed before this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. Therefore, instead of passing report.rank in the function
    ConnectionTracker::reports, we pass i.first instead so that we trim old ranks properly. We also added an assert in notify_rank_removed(), comparing the expected rank provided by the monmap against the rank that we adjust ourselves to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed().

  3. MonMap::removed_rank does not get cleared every update of the epoch, this was the root cause of the problem. Therefore, we fix this by making MonMap clear removed_ranks every update. Moreover, When there is discontinuity between monmaps for more than 1 epoch or the new monitor never joined the quorum before,
    we always reset peer_tracker.

  4. Added a way for us to manually reset peer_tracker.rank when executing the command: ceph connection scores reset for each monitor. The peer_tracker.rank will match the current rank of the Monitor.

  5. When upgrading the monitors (including booting up), we check if peer_tracker is dirty or not. If so, we clear it. Added some functions in the Elector and ConnectionTracker classes to check for clean peer_tracker. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check in ConnectionTracker::report_live_connection and ConnectionTracker::report_dead_connection.

  6. --mon-initial-members does nothing but cause monmap
    to populate removed_ranks, therefore, remove all instances of --mon-initial-members in the standalone test as it has no impact on the nature of the tests themselves.

  7. Monitor::notify_new_monmap() skips removal of non-exist rank
    In RHCS the user can choose to manually remove a monitor rank
    before shutting the monitor down. Causing inconsistency in monmap.
    Therefore, in Monitor::notify_new_monmap() we prevent the function from going into removing our own rank or ranks that don't exist in monmap.

Fixes: https://tracker.ceph.com/issues/58049
Fixes: https://tracker.ceph.com/issues/58132

Backporting relevant commits from the main PR:

#48991

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@kamoltat kamoltat added this to the quincy milestone Dec 7, 2022
@kamoltat kamoltat requested a review from a team as a code owner December 7, 2022 19:49
@kamoltat kamoltat self-assigned this Dec 7, 2022
@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test make check

@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test windows

@kamoltat
Copy link
Member Author

kamoltat commented Dec 9, 2022

jenkins test make check

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; the right commits are included and the conflicts were just changing unsigned->int. [Please note that in future; it makes checking easier :) ]

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

0 untracked failures, RADOS approved

Failures

jobid: [7123713]
description: rados/upgrade/parallel/{0-random-distro$/{rhel_8.4_container_tools_3.0} 0-start 1-tasks mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}
failure_reason: qa/workunits/cls/test_cls_rbd.sh'
traceback: TestClsRbd.group_snap_list_max_read failure
tracker: https://tracker.ceph.com/issues/58265
created_tracker:

jobid: [7123747, 7123902]
description: rados/singleton-nomsgr/{all/ceph-post-file mon_election/classic rados supported-random-distro$/{ubuntu_latest}}
failure_reason: /home/ubuntu/cephtest/clone.client.0/qa/workunits/post-file.sh
traceback: tasks.workunit:Stopping ['post-file.sh'] on client.0...
tracker: https://tracker.ceph.com/issues/58097
created_tracker:

jobid: [7123774, 7123853, 7123929, 7124006]
description: rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench 3-final cluster/3-node k8s/1.21 net/host rook/1.7.2}
failure_reason: sudo kubeadm init --node-name smithi008 --token abcdef.og2z04kks4haenhg --pod-network-cidr 10.248.56.0/21'
traceback: rook: kubelet fails from connection refused
tracker: https://tracker.ceph.com/issues/58258
created_tracker:

jobid: [7123797, 7123925]
description: rados/cephadm/workunits/{0-distro/centos_8.stream_container_tools agent/off mon_election/classic task/test_cephadm}
failure_reason: /home/ubuntu/cephtest/clone.client.0/qa/workunits/cephadm/test_cephadm.sh
traceback: quay.ceph.io/ceph-ci/ceph: manifest unknown
tracker: https://tracker.ceph.com/issues/58140
created_tracker:

jobid: [7123808, 7123867, 7123962]
description: rados/singleton-nomsgr/{all/librados_hello_world mon_election/connectivity rados supported-random-distro$/{ubuntu_latest}}
failure_reason: /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_librados_build.sh'
traceback:
tracker: https://tracker.ceph.com/issues/58117
created_tracker:


DEAD
All 8 failures are infrastructure known failures: error reimaging machines

…ElectionLogic & Elector

Problem:

Currently there are not ConnectionTracker logs,
therefore it is really hard to debug

Solution:

Enable loggings for most functions in ConnectionTracker.cc
Most of the logs are in debug_mon = 30.

Also Added some logs in Elector and ElectionLogic
so that debugging will be easier in the future.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 58f2bd4)
PROBLEM:

In `ConnectionTracker::receive_peer_report`
we loop through ranks which is bad when
there is `notify_rank_removed` before this and
the ranks are not adjusted yet. When we rely
on the rank in certain scenarios, we end up
with extra peer_report copy which we don't
want.

SOLUTION:

In `ConnectionTracker::receive_peer_report`
instead of passing `report.rank` in the function
`ConnectionTracker::reports`, we pass `i.first`
instead so that trim old ranks properly.

We also added a assert in notify_rank_removed(),
comparing expected rank provided by the monmap
against the rank that we adjust ourself to as
a sanity check.

We edited test/mon/test_election.cc
to reflect the changes made in notify_rank_removed().

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 7c52cce)

Conflicts:
	src/mon/Elector.cc - trivial fix
	src/mon/Elector.h - trivial fix
when a new monitor joins, there is a chance that
it will recive a monmap that recently removed
a monitor and ``removed_rank`` will have some
content in it. A new monitor that joins
should never remove rank in peer_tracker but
rather call ``notify_clear_peer_state()``
to reset the `peer_report`.

In the case when it is a monitor that
has joined quorum before and is only 1
epoch behind the newest monmap provided
by the probe_replied monitor. We can
actually remove and adjust ranks in `peer_report`
since we are sure that if there is any content in
removed_ranks, then it has to be because in the
next epoch we are removing a rank, since every
update of an epoch we always clear the removed_ranks.

There is no point in keeping the content
of ``removed_ranks`` after monmap gets updated
to the epoch.

Therefore, clear ``removed_ranks`` every update.

When there is discontinuity between
monmaps for more 1 epoch or the new monitor never joined quorum before,
we always reset `peer_tracker`.

Moreover, beneficial for monitor log to also log
which rank has been removed at the current time
of the monmap. So add removed_ranks to `print_summary`
and `dump` in MonMap.cc.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 0440257)

Conflicts:
	src/mon/Monitor.cc - trivial fix
In `notify_clear_peer_state()` we another
mechanism in reseting our `peer_tracker.rank`
to match our own monitor.rank.

This is added so there is a way for us
to recover from a scenrio where `peer_tracker.rank`
is messed up from adjusting the ranks or removing
ranks.

`notifiy_clear_peer_state()` can be triggered
by using the command:

`ceph connection scores reset`

Also in `clear_peer_reports`, besides
reassigning my_reports to an empty object,
we also have to make `my_reports` = `rank`
from `peer_tracker`, such that we don't get
-1 as a rank in my_reports.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 55cf717)
…d connection report

When upgrading the monitors (include booting up),
we check if `peer_tracker` is dirty or not. If
so, we clear it. Added some functions in `Elector` and
`ConnectionTracker` class to
check for clean `peer_tracker`.

Moreover, there could be some cases where due
to startup weirdness or abnormal circumstances,
we might get a report from our own rank. Therefore,
it doesn't hurt to add a sanity check in
`ConnectionTracker::report_live_connection` and
`ConnectionTracker::report_dead_connection`.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 25ce77c)
Problem:

--mon-initial-members does nothing but causes monmap
to populate ``removed_ranks`` because the way we start
monitors in standalone tests uses ``run_mon $dir $id ..``
on each mon. Regardless of --mon-initial-members=a,b,c, if
we set --mon-host=$MONA,$MONB,$MONC (which we do every single tests),
everytime we run a monitor (e.g.,run mon.b) it will pre-build
our monmap with

```
noname-a=mon.noname-a addrs v2:127.0.0.1:7127/0,
b=mon.b addrs v2:127.0.0.1:7128/0,
noname-c=mon.noname-c addrs v2:127.0.0.1:7129/0,
```

Now, with --mon-initial-members=a,b,c we are letting
monmap know that we should have initial members name:
a,b,c, which we only have `b` as a match. So what
``MonMap::set_initial_members`` do is that it will
remove noname-a and noname-c which will
populate `removed_ranks`.

Solution:

remove all instances of --mon-initial-members
in the standalone test as it has no impact on
the nature of the tests themselves.

Fixes: https://tracker.ceph.com/issues/58132

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit e1c095d)
Problem:
In RHCS the user can choose to manually remove a monitor rank
before shutting the monitor down. Causing inconsistency in monmap.
for example we remove mon.a from the monmap, there is a short period
where mon.a is still operational and will try to remove itself from
monmap but we will run into an assertion in
ConnectionTracker::notify_ranks_removed().

Solution:
In Monitor::notify_new_monmap() we prevent the func
from going into removing our own rank, or
ranks that doesn't exists in monmap.

FYI: this is an RHCS problem only, in ODF,
we never remove a monitor from monmap
before shutting it down.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 924e7ec)
@kamoltat kamoltat force-pushed the wip-ksirivad-backport-quincy-bz-2121452 branch from 4386d18 to 45bbede Compare December 21, 2022 21:57
@kamoltat
Copy link
Member Author

jenkins test make check

@kamoltat kamoltat merged commit 93c5cfa into ceph:quincy Dec 22, 2022
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants