Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/cephadm: start upgrade tests from quincy #52881

Merged
merged 3 commits into from
Sep 11, 2023

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Aug 8, 2023

Now that reef is released, on main we should only need to start our upgrade tests from quincy. This PR is for changing the starting version of upgrade tests that run as part of the orch/cephadm suite (although one is symlinked from fs suite)

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@adk3798 adk3798 added cephfs Ceph File System tests cephadm labels Aug 8, 2023
@adk3798 adk3798 requested a review from a team as a code owner August 8, 2023 16:37
@adk3798
Copy link
Contributor Author

adk3798 commented Aug 8, 2023

@dparmar18 @batrick I made a best effort attempt at updating the start version for the mds upgrade sequence tests. Would appreciate feedback on the commit updating those tests.

@adk3798 adk3798 requested a review from dparmar18 August 8, 2023 16:39
@batrick batrick requested a review from a team August 8, 2023 19:21
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add qa/suites/fs/upgrade/mds_upgrade_sequence/tasks/0-from/reef.yaml too.

We should have had quincy already but it was forgotten. Let's not forget reef :)

@adk3798
Copy link
Contributor Author

adk3798 commented Aug 9, 2023

Please also add qa/suites/fs/upgrade/mds_upgrade_sequence/tasks/0-from/reef.yaml too.

We should have had quincy already but it was forgotten. Let's not forget reef :)

added in a reef yaml

@adk3798
Copy link
Contributor Author

adk3798 commented Aug 15, 2023

The regular upgrade and mgr-nfs-upgrade are working here but mds_upgrade_sequence is not. The one from reef is something on the cephadm side. We need to implement a way to pull in the new compiled cephadm package as the current curl from git doesn't work for reef onward. For the quincy start point it seems the umount it does at the end (e.g. sudo umount /home/ubuntu/cephtest/mnt.0) is hanging the same way it did when the start point of the test was pacific. That will have to be looked at and fixed as a separate issue to this PR I think.

Some of the failed quincy start point mds upgrade sequence runs
https://pulpito.ceph.com/adking-2023-08-15_13:44:09-orch:cephadm-wip-adk-testing-2023-08-14-1902-distro-default-smithi/7368074/
https://pulpito.ceph.com/adking-2023-08-15_13:44:09-orch:cephadm-wip-adk-testing-2023-08-14-1902-distro-default-smithi/7368099

@dparmar18
Copy link
Contributor

For the quincy start point it seems the umount it does at the end (e.g. sudo umount /home/ubuntu/cephtest/mnt.0) is hanging the same way it did when the start point of the test was pacific.

Do we have any tracker for this or any logs to look at?

@adk3798
Copy link
Contributor Author

adk3798 commented Aug 16, 2023

For the quincy start point it seems the umount it does at the end (e.g. sudo umount /home/ubuntu/cephtest/mnt.0) is hanging the same way it did when the start point of the test was pacific.

Do we have any tracker for this or any logs to look at?

Haven't made a tracker yet. The only logs would be what you can get from the runs I linked.

- cephadm:
image: quay.ceph.io/ceph-ci/ceph:reef
roleless: true
compiled_cephadm_branch: reef
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the good news is this is actually working in terms of pulling in the reef binary. The bad news is the upgrade still fails with

Upgrade: Paused due to UPGRADE_BAD_TARGET_VERSION: Upgrade: cannot upgrade/downgrade to 18.0.0-5596-gdb1309a8

I think because it considers an upgrade from 18.2.0 -> 18.0.0 an unsupported downgrade. Pretty sure we didn't ad quincy as the new basedpoint for upgrades in the reef cycle until main was already reporting v18 so we didn't have this issue. Will need to see if there's some workaround we can use for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mds_upgrade_sequence from quincy passed as well https://pulpito.ceph.com/adking-2023-08-19_17:52:39-orch:cephadm-wip-adk-testing-2023-08-19-1107-distro-default-smithi/7373907. Could also consider dropping the reef start point temporarily to get the rest of this in and then come back when we either have a good workaround or main starts reporting v19

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to move the reef start point work into #53105 so that we can get this through to at least get the upgrades to start from quincy instead of pacific

We're now past the reef release, so main is now what
will become squid and we should only be testing upgrades
to squid from quincy onward

Signed-off-by: Adam King <adking@redhat.com>
Now that we're post reef release, the upgrade tests
on main should be starting their upgrades from quincy
rather than pacific

Signed-off-by: Adam King <adking@redhat.com>
Now that reef has been released, on main we
only need to test upgrades starting from quincy
and upgrades from pacific are no longer valid

Signed-off-by: Adam King <adking@redhat.com>
@adk3798 adk3798 removed the DNM label Aug 23, 2023
@adk3798
Copy link
Contributor Author

adk3798 commented Sep 11, 2023

https://pulpito.ceph.com/adking-2023-09-07_12:40:41-orch:cephadm-wip-adk-testing-2023-09-06-1611-distro-default-smithi/

3 failures

  • 1 failure in test_nfs task. This test had been blocked from running properly for a while due to https://tracker.ceph.com/issues/55986 which was recently resolved. It seems that it's just generally a bit broken currently and will need some more work. But shouldn't block merging the set of PRs in the run.
  • 1 failure deploying jaeger-tracing. Known issue https://tracker.ceph.com/issues/59704
  • 1 strange failure in the mgr-nfs-upgrade sequence. It was failing redeploying the first mgr as part of the upgrade. Interactive reruns allowed me to find the issue was
2023-09-08 19:03:19,673 7f017b1f1b80 DEBUG Determined image: 'quay.ceph.io/ceph-ci/ceph@sha256:29eb1b22bdc86e11facd8e3b821e546994d614ae2a0aec9d47234c7aede558d5'
2023-09-08 19:03:19,693 7f017b1f1b80 INFO Redeploy daemon mgr.smithi012.wqsagl ...
2023-09-08 19:06:22,875 7f017b1f1b80 INFO Non-zero exit code 1 from systemctl daemon-reload
2023-09-08 19:06:22,875 7f017b1f1b80 INFO systemctl: stderr Failed to reload daemon: Connection timed out

which is particularly odd because systemctl daemon-reload isn't even a command specific to the mgr's systemd unit. If it had been starting the systemd unit for the mgr, it could maybe be traced back to something with the mgr in the current build, but for whatever reason it was timing out during the daemon-reload. I would have considered it a weird one off if it wasn't for the fact that it reproduced 3 times in a row. Not really sure what to make of it. But either way I don't think we should hold up other PRs merging for it. Will just need some more investigation in the future.

Overall, I think we can merge the PRs from the run.

@adk3798 adk3798 merged commit 2b83983 into ceph:main Sep 11, 2023
10 of 12 checks passed
@ljflores
Copy link
Contributor

@adk3798 can you check if this PR is causing this bug? https://tracker.ceph.com/issues/63778

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants