Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/OSDMap: Add health warning if 'require-osd-release' != current release #44090

Merged
merged 2 commits into from
Dec 8, 2021

Conversation

sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Nov 24, 2021

After all OSDs are upgraded to a new release, generate a health warning if
the 'require-osd-release' flag doesn't match the the new release version.
This will result in the cluster showing a warning in the health state until
the flag is set properly.

Fixes: https://tracker.ceph.com/issues/51984
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sseshasa
Copy link
Contributor Author

NOTE: I will run the existing teuthology upgrade tests using a private branch and verify that the health warning is generated if the 'require-osd-release' is not set properly.

src/osd/OSDMap.cc Outdated Show resolved Hide resolved
@ronen-fr
Copy link
Contributor

LGTM - apart from what might be a missing '*'

@ronen-fr ronen-fr self-requested a review November 25, 2021 13:50
src/osd/OSDMap.cc Outdated Show resolved Hide resolved
@sseshasa
Copy link
Contributor Author

jenkins test api

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, needs to through rados and upgrade suites (since the cephadm based tests already set this, you can use the octopus-x/ *-no-cephadm tests for this purpose)

@neha-ojha
Copy link
Member

jenkins test api

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 1, 2021

jenkins test make check arm64

@ronen-fr
Copy link
Contributor

ronen-fr commented Dec 1, 2021

jenkins test make check arm64

@sseshasa : note that this one fails constantly (and is not, I think, a blocker)

@sebastian-philipp
Copy link
Contributor

can we also run this through the orch/cephadm/upgrade suite?

@neha-ojha
Copy link
Member

can we also run this through the orch/cephadm/upgrade suite?

running this through the rados suite, which includes https://github.com/ceph/ceph/blob/master/qa/suites/rados/cephadm should cover orch/cephadm/upgrade

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 2, 2021

The teuthology upgrade test was modified to generate the health warning. As expected, most of the tests failed due to time out waiting for cluster to be healthy.

See: https://pulpito.ceph.com/sseshasa-2021-12-01_19:19:56-upgrade-wip-fix-require-osd-release-testing-5-distro-default-smithi/

…lease

After all OSDs are upgraded to a new release, generate a health warning if
the 'require-osd-release' flag doesn't match the the new release version.
This will result in the cluster showing a warning in the health state until
the flag is set properly.

Fixes: https://tracker.ceph.com/issues/51984
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 3, 2021

jenkins test make check

@sseshasa
Copy link
Contributor Author

sseshasa commented Dec 8, 2021

Teuthology Test Results:

http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/

Unrelated Failures:

  1. 16/22 jobs related to mds_upgrade_sequence failed with the errors:
    Predominant Failure reason:
    Command failed on smithi001 with status 32: 'sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,conf=/etc/ceph/ceph.conf,norbytes,name=0,mds_namespace=cephfs,nofallback'

    Existing tracker: https://tracker.ceph.com/issues/53487

  2. http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/6550482
    http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/6550909
    Command failed (workunit test cephadm/test_dashboard_e2e.sh) on smithi101 with status 1
    New tracker created: https://tracker.ceph.com/issues/53499

  3. http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/6550484
    [Errno 2] Cannot find file on the remote 'ubuntu@smithi027.front.sepia.ceph.com': 'rook/cluster/examples/kubernetes/ceph/operator.yaml'
    New tracker created: https://tracker.ceph.com/issues/53501

  4. http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/6550873
    Valgrind error during handle_recovery_delete()
    Existing tracker: https://tracker.ceph.com/issues/52124

  5. http://pulpito.front.sepia.ceph.com/yuriw-2021-12-07_16:02:55-rados-wip-yuri11-testing-2021-12-06-1619-distro-default-smithi/6550960
    ceph_abort_msg("block checksum mismatch: stored = 632344828, computed = 928675071 in db/000047.sst offset 264241 size 3745")
    Existing tracker: https://tracker.ceph.com/issues/47453

@batrick
Copy link
Member

batrick commented Dec 15, 2021

For next time, please run through the upgrade suite. Causes regression: https://tracker.ceph.com/issues/53615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants