qa/tasks/daemonwatchdog: Stop thrashers when barking#65978
qa/tasks/daemonwatchdog: Stop thrashers when barking#65978
Conversation
|
Not really Crimson specific but Crimson would largely benefit from this as out thrash tests keep running for 8 hrs even when the osds are failing. Results in large logs files and makes it harder to find the issue. |
|
Probably related to #65067 that was posted in https://ibm-systems-storage.slack.com/archives/C051Z80F25P/p1755274223074229 -- proposes similar changes to |
The watchdog would bark if: - daemon is failed for more than daemon_timeout - any thrasher had an exception In the later case, we would also stop and join the thrashers. For an osd failing for more than the set timeout, we wouldn't try to stop the thrashers (even though we BARKED). This would result in failed test jobs running for over 8 hours until job timeout is hit with failed osds. See, not timestamps: ``` 2025-09-03T01:32:00.471 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons teuthology.exceptions.CommandFailedError: Command failed on smithi094 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json' 2025-09-03T09:01:39.050 DEBUG:teuthology.exit:Finished running handlers ``` Fixes: https://tracker.ceph.com/issues/71417 Signed-off-by: Matan Breizman <mbreizma@redhat.com>
31b5b24 to
bf40efe
Compare
The PR above proposes a similar change to timeout and many more improvements. I don't think merging this PR is conflicting - as all the other improvements would still be valid. I'll go ahead and merge the PR here so the nightlies would be easier to debug. I'll try also to review Bill's PR. Thanks for sharing it here! added logs: |
|
Process should also be killed and it seems #64889 has already implemented this well! Closing |
The watchdog would bark if:
In the later case, we would also stop and join the thrashers. For an osd failing for more than the set timeout, we wouldn't try to stop the thrashers (even though we BARKED). This would result in failed test jobs running for over 8 hours until job timeout is hit with failed osds.
See, not timestamps:
Fixes: https://tracker.ceph.com/issues/71417
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.