qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

rishabh-d-dave · 2023-08-04T21:16:41Z

There probably is some kind of race condition due to which issue
described in tracker ticket https://tracker.ceph.com/issues/47292 only
reproduces sometimes. This has been the case in tests for "du" command
in cephfs-shell before. Since a sleep was added to those tests, the
error never reproduced again,

Let's add a sleep to this test to. Hopefully, this issue will stop
reproducing altogether.

Maybe Fixes: https://tracker.ceph.com/issues/47292

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

rishabh-d-dave · 2023-08-04T21:26:18Z

http://pulpito.front.sepia.ceph.com/rishabh-2023-08-04_21:18:22-fs:shell-wip-rishabh-2023aug3-testing-default-smithi/

vshankar · 2023-08-16T07:34:51Z

There probably is some kind of race condition due to which issue described in tracker ticket https://tracker.ceph.com/issues/47292 only reproduces sometimes. This has been the case in tests for "du" command in cephfs-shell before. Since a sleep was added to those tests, the error never reproduced again,

Let's add a sleep to this test to. Hopefully, this issue will stop reproducing altogether.

Have you considered the possibility of stale disk usage stats getting reported which resolves itself after the sleep it added (giving it enough time to sync up the stats)?

rishabh-d-dave · 2023-09-04T14:03:07Z

https://pulpito.ceph.com/rishabh-2023-09-04_11:25:31-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi/

With sleep of 10 seconds this test is running fine.

rishabh-d-dave · 2023-09-04T15:07:43Z

https://pulpito.ceph.com/rishabh-2023-09-04_14:03:35-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi/

rishabh-d-dave · 2023-09-04T17:13:16Z

https://pulpito.ceph.com/rishabh-2023-09-04_15:08:04-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi/
https://pulpito.ceph.com/rishabh-2023-09-04_18:52:29-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi
https://pulpito.ceph.com/rishabh-2023-09-04_18:52:44-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi
https://pulpito.ceph.com/rishabh-2023-09-04_18:52:52-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi
https://pulpito.ceph.com/rishabh-2023-09-04_18:52:59-fs:shell-wip-rishabh-2023aug3-b5-testing-default-smithi

vshankar · 2023-09-05T10:28:41Z

@rishabh-d-dave The stale stats can be due to an issue (race) in the mds and introducing a sleep (which makes the test pass) isn't the correct approach till the underlying issue is RCA'd.

rishabh-d-dave · 2023-09-05T10:39:06Z

@vshankar

@rishabh-d-dave The stale stats can be due to an issue (race) in the mds and introducing a sleep (which makes the test pass) isn't the correct approach till the underlying issue is RCA'd.

Yes, I digging into underlying code that fetches stats. I've looked into cephfs-shell's do_df(), cephfs.pyx's statfs(), libcephfs.cc's ceph_statfs() and now finally into Client::statfs().

vshankar · 2023-10-16T13:14:34Z

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

gregsfortytwo · 2023-10-17T15:15:21Z

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

vshankar · 2023-10-18T07:27:18Z

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

rishabh-d-dave · 2023-10-18T07:49:35Z

@vshankar

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

Would it be a nice idea to merge this ticket for now (so that this failure stops showing up in our QA runs) and open a ticket for deeper investigation that can be carried out little later?

vshankar · 2023-10-18T08:50:30Z

@vshankar

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

Would it be a nice idea to merge this ticket for now (so that this failure stops showing up in our QA runs) and open a ticket for deeper investigation that can be carried out little later?

Frankly, I'd rather have the test fail so that it bothers us :)

rishabh-d-dave · 2023-10-25T18:09:52Z

@vshankar

Frankly, I'd rather have the test fail so that it bothers us :)

I agree but constant failures would block any cephfs-shell PRs from getting merged.

vshankar · 2023-10-27T01:17:23Z

@vshankar

Frankly, I'd rather have the test fail so that it bothers us :)

I agree but constant failures would block any cephfs-shell PRs from getting merged.

Don't we note the known issues and proceed with merging PRs?

rishabh-d-dave · 2023-10-27T11:38:24Z

@vshankar

Don't we note the known issues and proceed with merging PRs?

We do, but, when a failure occurs, the test suite is run only partially. In such a case, a PR is not tested properly and a buggy PR might get merged.

vshankar · 2023-11-02T07:29:36Z

@vshankar

Don't we note the known issues and proceed with merging PRs?

We do, but, when a failure occurs, the test suite is run only partially. In such a case, a PR is not tested properly and a buggy PR might get merged.

Sure. But this failures ins't seen in each run which mean other tests do run to completion, so we aren't really missing test coverage with other runs, unless this test fails on every run, in which case the severity would be high anyway.

chrisphoffman

Instead of a sleep, what about adding some sort of wait_until_true to wait until it is valid or timeout?

There probably is some kind of race condition due to which issue described in tracker ticket https://tracker.ceph.com/issues/47292 only reproduces sometimes. This has been the case in tests for "du" command in cephfs-shell before. Since a sleep was added to those tests, the error never reproduced again, Let's add a sleep to this test to. Hopefully, this issue will stop reproducing altogether. Maybe Fixes: https://tracker.ceph.com/issues/47292 Signed-off-by: Rishabh Dave <ridave@redhat.com>

Signed-off-by: Rishabh Dave <ridave@redhat.com>

vshankar · 2023-11-30T09:45:04Z

Instead of a sleep, what about adding some sort of wait_until_true to wait until it is valid or timeout?

My concern is in this comment - #52822 (comment).

Its good to know which change (if any) caused the stat propagation to slow down and I'd rather have the test fail so that we know something isn't working as expected in the MDS.

github-actions · 2024-01-29T12:02:04Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-02-28T13:01:29Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions · 2024-04-28T18:01:31Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

rishabh-d-dave added bug-fix cephfs Ceph File System tests needs-review labels Aug 4, 2023

rishabh-d-dave requested a review from a team August 4, 2023 21:16

rishabh-d-dave force-pushed the cephfs-shell-df-tests branch from 6062f5f to 8841006 Compare August 4, 2023 21:17

rishabh-d-dave force-pushed the cephfs-shell-df-tests branch from 8841006 to c3e18f9 Compare September 4, 2023 11:25

rishabh-d-dave force-pushed the cephfs-shell-df-tests branch from c3e18f9 to ca20108 Compare October 9, 2023 12:13

chrisphoffman reviewed Nov 7, 2023

View reviewed changes

rishabh-d-dave added 2 commits November 30, 2023 15:10

DNM: test if this passes with sleeping for 10 secs

8d4ac8b

Signed-off-by: Rishabh Dave <ridave@redhat.com>

rishabh-d-dave force-pushed the cephfs-shell-df-tests branch from ca20108 to 8d4ac8b Compare November 30, 2023 09:40

github-actions bot added the stale label Jan 29, 2024

github-actions bot closed this Feb 28, 2024

rishabh-d-dave reopened this Feb 28, 2024

github-actions bot removed the stale label Feb 28, 2024

github-actions bot added the stale label Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

rishabh-d-dave commented Aug 4, 2023

rishabh-d-dave commented Aug 4, 2023

vshankar commented Aug 16, 2023

rishabh-d-dave commented Sep 4, 2023

rishabh-d-dave commented Sep 4, 2023

rishabh-d-dave commented Sep 4, 2023 •

edited

vshankar commented Sep 5, 2023

rishabh-d-dave commented Sep 5, 2023 •

edited

vshankar commented Oct 16, 2023

gregsfortytwo commented Oct 17, 2023

vshankar commented Oct 18, 2023

rishabh-d-dave commented Oct 18, 2023

vshankar commented Oct 18, 2023

rishabh-d-dave commented Oct 25, 2023

vshankar commented Oct 27, 2023

rishabh-d-dave commented Oct 27, 2023

vshankar commented Nov 2, 2023

chrisphoffman left a comment

vshankar commented Nov 30, 2023

github-actions bot commented Jan 29, 2024

github-actions bot commented Feb 28, 2024

github-actions bot commented Apr 28, 2024

qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

Are you sure you want to change the base?

qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

Conversation

rishabh-d-dave commented Aug 4, 2023

Contribution Guidelines

Checklist

rishabh-d-dave commented Aug 4, 2023

vshankar commented Aug 16, 2023

rishabh-d-dave commented Sep 4, 2023

rishabh-d-dave commented Sep 4, 2023

rishabh-d-dave commented Sep 4, 2023 • edited

vshankar commented Sep 5, 2023

rishabh-d-dave commented Sep 5, 2023 • edited

vshankar commented Oct 16, 2023

gregsfortytwo commented Oct 17, 2023

vshankar commented Oct 18, 2023

rishabh-d-dave commented Oct 18, 2023

vshankar commented Oct 18, 2023

rishabh-d-dave commented Oct 25, 2023

vshankar commented Oct 27, 2023

rishabh-d-dave commented Oct 27, 2023

vshankar commented Nov 2, 2023

chrisphoffman left a comment

Choose a reason for hiding this comment

vshankar commented Nov 30, 2023

github-actions bot commented Jan 29, 2024

github-actions bot commented Feb 28, 2024

github-actions bot commented Apr 28, 2024

rishabh-d-dave commented Sep 4, 2023 •

edited

rishabh-d-dave commented Sep 5, 2023 •

edited