Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific:mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() #48803

Merged
merged 2 commits into from
Dec 9, 2022

Conversation

kamoltat
Copy link
Member

@kamoltat kamoltat commented Nov 8, 2022

Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function Monitor::go_recovery_stretch_mode()
that will lead to a ceph_assert.

Solution:
Make sure dead_mon_buckets.size() == 0
in OSDMonitor:update_from_paxos()
before going into Monitor::go_recovery_stretch_mode().

Fixes:
https://tracker.ceph.com/issues/57017

Backporting relevant commits from main PR:

#47340

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Added bug reproducer for
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Added more logs in MON.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 62fe3cb)
Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function ``Monitor::go_recovery_stretch_mode()``
that will lead to a `ceph_assert`.

Solution:
Make sure ``dead_mon_buckets.size() == 0``
in ``OSDMonitor:update_from_paxos()``
before going into ``Monitor::go_recovery_stretch_mode()``.

Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit d95c41a)
@kamoltat kamoltat added this to the pacific milestone Nov 8, 2022
@kamoltat kamoltat self-assigned this Nov 8, 2022
@kamoltat kamoltat requested a review from a team as a code owner November 8, 2022 21:22
@kamoltat
Copy link
Member Author

kamoltat commented Nov 9, 2022

jenkins test make check

@ljflores
Copy link
Contributor

ljflores commented Dec 1, 2022

@kamoltat I found a failure that looks related. Can you take a look?

/a/yuriw-2022-11-30_15:10:52-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7098562

2022-11-30T16:20:43.717 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-stretched-cluster.sh:138: TEST_stretched_cluster_failover_add_three_osds:  run_osd td/mon-stretched-cluster 8
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:633: run_osd:  local dir=td/mon-stretched-cluster
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:634: run_osd:  shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:635: run_osd:  local id=8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:636: run_osd:  shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:637: run_osd:  local osd_data=td/mon-stretched-cluster/8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:639: run_osd:  local 'ceph_args=--fsid=dedcc9bc-5228-4881-99b7-42ec5a2680d8 --auth-supported=none  --mon-host=127.0.0.1:7139,127.0.0.1:7141,127.0.0.1:7142,127.0.0.1:7143,127.0.0.1:7144'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:640: run_osd:  ceph_args+=' --osd-failsafe-full-ratio=.99'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:641: run_osd:  ceph_args+=' --osd-journal-size=100'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:642: run_osd:  ceph_args+=' --osd-scrub-load-threshold=2000'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:643: run_osd:  ceph_args+=' --osd-data=td/mon-stretched-cluster/8'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:644: run_osd:  ceph_args+=' --osd-journal=td/mon-stretched-cluster/8/journal'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:645: run_osd:  ceph_args+=' --chdir='
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:646: run_osd:  ceph_args+=
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:647: run_osd:  ceph_args+=' --run-dir=td/mon-stretched-cluster'
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd:  get_asok_path
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path:  local name=
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path:  '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  get_asok_dir
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir:  '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir:  echo /tmp/ceph-asok.99133
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  echo '/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd:  ceph_args+=' --admin-socket=/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:649: run_osd:  ceph_args+=' --debug-osd=20'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:650: run_osd:  ceph_args+=' --debug-ms=1'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:651: run_osd:  ceph_args+=' --debug-monc=20'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:652: run_osd:  ceph_args+=' --log-file=td/mon-stretched-cluster/$name.log'

...

2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:929: wait_for_osd:  sleep 1
2022-12-01T03:36:44.129 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd:  (( i++ ))
2022-12-01T03:36:44.130 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd:  (( i < 300 ))
2022-12-01T03:36:44.131 INFO:tasks.workunit.client.0.smithi038.stdout:7
2022-12-01T03:36:44.132 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:927: wait_for_osd:  echo 7
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd:  ceph osd dump
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd:  grep 'osd.8 up'
2022-12-01T03:38:58.027 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2022-12-01T03:38:58.034 DEBUG:teuthology.task.console_log:Killing console logger for smithi038
2022-12-01T03:38:58.038 DEBUG:teuthology.exit:Finished running handlers

@ljflores
Copy link
Contributor

ljflores commented Dec 1, 2022

This one also might be related:

/a/yuriw-2022-11-29_15:35:32-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7097000/

2022-11-29T19:43:49.963 INFO:teuthology.orchestra.run.smithi085.stderr:Generating new minimal ceph.conf...
2022-11-29T19:43:51.565 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:51 smithi085 bash[23745]: audit 2022-11-29T19:43:51.146567+0000 mon.a (mon.0) 10 : audit [DBG] from='client.? 172.21.15.85:0/1007346656' entity='client.admin' cmd=[{"prefix": "config generate-minimal-conf"}]: dispatch
2022-11-29T19:43:51.671 INFO:teuthology.orchestra.run.smithi085.stderr:Restarting the monitor...
2022-11-29T19:43:52.065 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:51 smithi085 systemd[1]: Stopping Ceph mon.a for 0b7cfb4a-701e-11ed-843b-001a4aab830c...
2022-11-29T19:43:52.065 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:51 smithi085 bash[24278]: Error response from daemon: No such container: ceph-0b7cfb4a-701e-11ed-843b-001a4aab830c-mon.a
2022-11-29T19:43:52.065 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:51 smithi085 bash[23745]: debug 2022-11-29T19:43:51.815+0000 7f5f25a97700 -1 received  signal: Terminated from /sbin/docker-init -- /usr/bin/ceph-mon -n mon.a -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug  --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true  (PID: 1) UID: 0
2022-11-29T19:43:52.066 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:51 smithi085 bash[23745]: debug 2022-11-29T19:43:51.815+0000 7f5f25a97700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2022-11-29T19:43:52.592 INFO:teuthology.orchestra.run.smithi085.stderr:Setting mon public_network to 172.21.0.0/20,172.21.15.254/32
2022-11-29T19:43:52.815 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:52 smithi085 bash[24278]: ceph-0b7cfb4a-701e-11ed-843b-001a4aab830c-mon-a
2022-11-29T19:43:52.815 INFO:journalctl@ceph.mon.a.smithi085.stdout:Nov 29 19:43:52 smithi085 bash[24278]: Error response from daemon: No such container: ceph-0b7cfb4a-701e-11ed-843b-001a4aab830c-mon.a

@ljflores
Copy link
Contributor

ljflores commented Dec 1, 2022

And this one:

/a/yuriw-2022-11-29_15:35:32-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7096933

It seems like there's a problem when the mons restart in some of the jobs.

@ljflores
Copy link
Contributor

ljflores commented Dec 9, 2022

@yuriw merged this before I had a chance to investigate (my fault). I had seen some failures I thought were related in a previous test batch, and put it through another batch since I thought there were updates. But somehow this run looked fine...

Here is the review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2022-12-07-0821-pacific

Failures, unrelated:
1. https://tracker.ceph.com/issues/57311
2. https://tracker.ceph.com/issues/58140
3. https://tracker.ceph.com/issues/58046
4. https://tracker.ceph.com/issues/58097
5. https://tracker.ceph.com/issues/54071
6. https://tracker.ceph.com/issues/53501
7. https://tracker.ceph.com/issues/56770
8. https://tracker.ceph.com/issues/54992
9. https://tracker.ceph.com/issues/58232 - new tracker created; unrelated to PRs in this batch
10. https://tracker.ceph.com/issues/56028

Details:
1. rook: ensure CRDs are installed first - Ceph - Orchestrator
2. quay.ceph.io/ceph-ci/ceph: manifest unknown - Ceph - Orchestrator
3. qa/workunits/rados/test_librados_build.sh: specify redirect in curl command - Ceph - RADOS
4. qa/workunits/post-file.sh: kex_exchange_identification: read: Connection reset by peer - Infrastructure
5. rados/cephadm/osds: Invalid command: missing required parameter hostname() - Ceph - Orchestrator
6. Exception when running 'rook' task. - Ceph - Orchestrator
7. crash: void OSDShard::register_and_wake_split_child(PG*): assert(p != pg_slots.end()) - Ceph - RADOS
8. pacific: rados/dashboard: tasks/dashboard: cannot stat '/etc/containers/registries.conf': No such file or directory - Ceph - Mgr - Dashboard
9. Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable) - Infrastructure
10. thrash_cache_writeback_proxy_none: FAILED ceph_assert(version == old_value.version) in src/test/osd/RadosModel.h - Ceph - RADOS

@kamoltat
Copy link
Member Author

kamoltat commented Dec 9, 2022

This PR: #47340
Introduced:
https://tracker.ceph.com/issues/58239

We are in the process of fixing it

kamoltat added a commit to kamoltat/ceph that referenced this pull request Dec 13, 2022
…tch_mode()"

This commit belongs to ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 94dc970.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to kamoltat/ceph that referenced this pull request Dec 13, 2022
This commit belongs to ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 025d3fa.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to kamoltat/ceph that referenced this pull request Dec 13, 2022
…tch_mode()"

This commit belongs to ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 94dc970.

Fixes: https://tracker.ceph.com/issues/58239

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to kamoltat/ceph that referenced this pull request Dec 13, 2022
This commit belongs to ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 025d3fa.

Fixes: https://tracker.ceph.com/issues/58239

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to ceph/ceph-ci that referenced this pull request Dec 13, 2022
This commit belongs to ceph/ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 025d3fa.

Fixes: https://tracker.ceph.com/issues/58239

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to ceph/ceph-ci that referenced this pull request Dec 14, 2022
…tch_mode()"

This commit belongs to ceph/ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 94dc970.

Fixes: https://tracker.ceph.com/issues/58239

Signed-off-by: Kamoltat <ksirivad@redhat.com>
kamoltat added a commit to ceph/ceph-ci that referenced this pull request Dec 14, 2022
This commit belongs to ceph/ceph#48803 which
introduced https://tracker.ceph.com/issues/58239.
Therefore, we are reverting it.

This reverts commit 025d3fa.

Fixes: https://tracker.ceph.com/issues/58239

Signed-off-by: Kamoltat <ksirivad@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants