Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/suites/fs: stop looping in mds upgrade test if upgrade failed #45361

Merged
merged 1 commit into from Mar 30, 2022

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Mar 11, 2022

Signed-off-by: Adam King adking@redhat.com

Testing for https://tracker.ceph.com/issues/54419

Also possibly something we want to actually merge depending on how the results go here. Having the test fail faster if the upgrade is failed for whatever reason would be a big improvement over running for 6 hours until the job is marked dead.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the cephfs Ceph File System label Mar 11, 2022
Copy link
Contributor

@kotreshhr kotreshhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. lgtm

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 14, 2022

http://pulpito.front.sepia.ceph.com/adking-2022-03-11_23:05:59-orch:cephadm:mds_upgrade_sequence-wip-adk2-testing-2022-03-11-1538-distro-basic-smithi/

1 failure due to machine not being locked, unrelated infra issue
1 failure by https://tracker.ceph.com/issues/49287 (unrelated to this test in particular)

Didn't hit the actual upgrade failure I was looking for. Need another run.

@@ -15,7 +15,7 @@ upgrade-tasks:
- cephadm.shell:
env: [sha1]
host.a:
- while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done
- while ceph orch upgrade status | jq '.in_progress' | grep true && ! ceph orch upgrade status | jq '.message' | grep Error ; do ceph orch ps ; ceph versions ; ceph fs dump; ceph orch upgrade status ; sleep 30 ; done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest also just exit 1 after 30 minutes too. No reason to let this run for 12h if something gets stuck.

Copy link
Contributor Author

@adk3798 adk3798 Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a 30 minute timeout on the command

EDIT: later removed it, as it was causing the test to be run with the old while statement for whatever reason. Will need to figure out a new way to do the timeout at some point.

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 15, 2022

jenkins test make check

1 similar comment
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 15, 2022

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 15, 2022

@vshankar I haven't been able to actually hit the upgrade issue while testing this. What do you think about just merging it in then if you see the test fail again you can ping me so I can take a look? It will have more info I can use for debugging and will at least fail in 30 minutes rather than 6 hours.

@vshankar
Copy link
Contributor

@vshankar I haven't been able to actually hit the upgrade issue while testing this. What do you think about just merging it in then if you see the test fail again you can ping me so I can take a look? It will have more info I can use for debugging and will at least fail in 30 minutes rather than 6 hours.

Absolutely. Let's merge this. I'll let you know how further tests look...

@vshankar
Copy link
Contributor

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 16, 2022

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 16, 2022

Just noticing in this run http://pulpito.front.sepia.ceph.com/adking-2022-03-15_17:49:30-orch:cephadm:mds_upgrade_sequence-wip-adk2-testing-2022-03-15-0949-distro-basic-smithi/ after the timeout was added it seems to just be running the unchanged test

- cephadm.shell:
        env:
        - sha1
        host.a:
        - while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch

but pre timeout it took the changes http://pulpito.front.sepia.ceph.com/adking-2022-03-14_12:43:51-orch:cephadm:mds_upgrade_sequence-wip-adk2-testing-2022-03-11-1538-distro-basic-smithi/

    - cephadm.shell:
        env:
        - sha1
        host.a:
        - while ceph orch upgrade status | jq '.in_progress' | grep true && ! ceph orch
          upgrade status | jq '.message' | grep Error ; do ceph orch ps ; ceph versions
          ; ceph fs dump; ceph orch upgrade status ; sleep 30 ; done

It looks like with the timeout added to the front this doesn't actually work? It just ignored the changes entirely. Not sure how this works but at least it doesn't seem to like using timeout like this.

@adk3798 adk3798 added the DNM label Mar 16, 2022
@vshankar
Copy link
Contributor

@adk3798 Noticed that just now -- https://pulpito.ceph.com/vshankar-2022-03-16_09:42:54-fs:upgrade-wip-vshankar-testing-20220316-102808-testing-default-smithi/6739031/

I can't even see (in teuthology log) the grep Error string you introduced :/

@vshankar
Copy link
Contributor

@adk3798 any progress on this?

Signed-off-by: Adam King <adking@redhat.com>
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 22, 2022

@adk3798 any progress on this?

@vshankar I removed the timeout bit for now. I think what's here will cover most of the cases where it's running for a long time anyhow and at the least it will give the info needed to figure out why it isn't finishing. I'd run this through tests again as it is now.

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 30, 2022

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 30, 2022

@vshankar do we want to merge this?

@vshankar
Copy link
Contributor

@vshankar do we want to merge this?

I'm ok with this change if the run fine.

@adk3798 adk3798 merged commit 6e4dd0e into ceph:master Mar 30, 2022
@adk3798 adk3798 mentioned this pull request Mar 30, 2022
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants