New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822
base: main
Are you sure you want to change the base?
Conversation
6062f5f
to
8841006
Compare
Have you considered the possibility of stale disk usage stats getting reported which resolves itself after the sleep it added (giving it enough time to sync up the stats)? |
8841006
to
c3e18f9
Compare
With sleep of 10 seconds this test is running fine. |
@rishabh-d-dave The stale stats can be due to an issue (race) in the mds and introducing a sleep (which makes the test pass) isn't the correct approach till the underlying issue is RCA'd. |
Yes, I digging into underlying code that fetches stats. I've looked into |
c3e18f9
to
ca20108
Compare
I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context? |
My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower. |
Would it be a nice idea to merge this ticket for now (so that this failure stops showing up in our QA runs) and open a ticket for deeper investigation that can be carried out little later? |
Frankly, I'd rather have the test fail so that it bothers us :) |
I agree but constant failures would block any cephfs-shell PRs from getting merged. |
Don't we note the known issues and proceed with merging PRs? |
We do, but, when a failure occurs, the test suite is run only partially. In such a case, a PR is not tested properly and a buggy PR might get merged. |
Sure. But this failures ins't seen in each run which mean other tests do run to completion, so we aren't really missing test coverage with other runs, unless this test fails on every run, in which case the severity would be high anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a sleep, what about adding some sort of wait_until_true
to wait until it is valid or timeout?
There probably is some kind of race condition due to which issue described in tracker ticket https://tracker.ceph.com/issues/47292 only reproduces sometimes. This has been the case in tests for "du" command in cephfs-shell before. Since a sleep was added to those tests, the error never reproduced again, Let's add a sleep to this test to. Hopefully, this issue will stop reproducing altogether. Maybe Fixes: https://tracker.ceph.com/issues/47292 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
ca20108
to
8d4ac8b
Compare
My concern is in this comment - #52822 (comment). Its good to know which change (if any) caused the stat propagation to slow down and I'd rather have the test fail so that we know something isn't working as expected in the MDS. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
There probably is some kind of race condition due to which issue
described in tracker ticket https://tracker.ceph.com/issues/47292 only
reproduces sometimes. This has been the case in tests for "du" command
in cephfs-shell before. Since a sleep was added to those tests, the
error never reproduced again,
Let's add a sleep to this test to. Hopefully, this issue will stop
reproducing altogether.
Maybe Fixes: https://tracker.ceph.com/issues/47292
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows