Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/cephfs: add sleep to test_cephfs_shell.py to avoid race #52822

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rishabh-d-dave
Copy link
Contributor

There probably is some kind of race condition due to which issue
described in tracker ticket https://tracker.ceph.com/issues/47292 only
reproduces sometimes. This has been the case in tests for "du" command
in cephfs-shell before. Since a sleep was added to those tests, the
error never reproduced again,

Let's add a sleep to this test to. Hopefully, this issue will stop
reproducing altogether.

Maybe Fixes: https://tracker.ceph.com/issues/47292

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@vshankar
Copy link
Contributor

There probably is some kind of race condition due to which issue described in tracker ticket https://tracker.ceph.com/issues/47292 only reproduces sometimes. This has been the case in tests for "du" command in cephfs-shell before. Since a sleep was added to those tests, the error never reproduced again,

Let's add a sleep to this test to. Hopefully, this issue will stop reproducing altogether.

Have you considered the possibility of stale disk usage stats getting reported which resolves itself after the sleep it added (giving it enough time to sync up the stats)?

@rishabh-d-dave
Copy link
Contributor Author

@vshankar
Copy link
Contributor

vshankar commented Sep 5, 2023

@rishabh-d-dave The stale stats can be due to an issue (race) in the mds and introducing a sleep (which makes the test pass) isn't the correct approach till the underlying issue is RCA'd.

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Sep 5, 2023

@vshankar

@rishabh-d-dave The stale stats can be due to an issue (race) in the mds and introducing a sleep (which makes the test pass) isn't the correct approach till the underlying issue is RCA'd.

Yes, I digging into underlying code that fetches stats. I've looked into cephfs-shell's do_df(), cephfs.pyx's statfs(), libcephfs.cc's ceph_statfs() and now finally into Client::statfs().

@vshankar
Copy link
Contributor

@gregsfortytwo
Copy link
Member

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

@vshankar
Copy link
Contributor

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

@rishabh-d-dave
Copy link
Contributor Author

@vshankar

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

Would it be a nice idea to merge this ticket for now (so that this failure stops showing up in our QA runs) and open a ticket for deeper investigation that can be carried out little later?

@vshankar
Copy link
Contributor

@vshankar

@rishabh-d-dave - https://tracker.ceph.com/issues/47292#note-9

I don't generally like sleeps in testing, but I also don't think we have any meaningful guarantees around either rstat speed propagation or rados pgstat reports (which are the potential sources of df's data), so this kind of wait seems fine in this context?

My worry is that the failures are being seen recently and I'm not sure if its due to some change that could possibly have made the stat propagation slower.

Would it be a nice idea to merge this ticket for now (so that this failure stops showing up in our QA runs) and open a ticket for deeper investigation that can be carried out little later?

Frankly, I'd rather have the test fail so that it bothers us :)

@rishabh-d-dave
Copy link
Contributor Author

@vshankar

Frankly, I'd rather have the test fail so that it bothers us :)

I agree but constant failures would block any cephfs-shell PRs from getting merged.

@vshankar
Copy link
Contributor

@vshankar

Frankly, I'd rather have the test fail so that it bothers us :)

I agree but constant failures would block any cephfs-shell PRs from getting merged.

Don't we note the known issues and proceed with merging PRs?

@rishabh-d-dave
Copy link
Contributor Author

@vshankar

Don't we note the known issues and proceed with merging PRs?

We do, but, when a failure occurs, the test suite is run only partially. In such a case, a PR is not tested properly and a buggy PR might get merged.

@vshankar
Copy link
Contributor

vshankar commented Nov 2, 2023

@vshankar

Don't we note the known issues and proceed with merging PRs?

We do, but, when a failure occurs, the test suite is run only partially. In such a case, a PR is not tested properly and a buggy PR might get merged.

Sure. But this failures ins't seen in each run which mean other tests do run to completion, so we aren't really missing test coverage with other runs, unless this test fails on every run, in which case the severity would be high anyway.

Copy link
Contributor

@chrisphoffman chrisphoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a sleep, what about adding some sort of wait_until_true to wait until it is valid or timeout?

There probably is some kind of race condition due to which issue
described in tracker ticket https://tracker.ceph.com/issues/47292 only
reproduces sometimes. This has been the case in tests for "du" command
in cephfs-shell before. Since a sleep was added to those tests, the
error never reproduced again,

Let's add a sleep to this test to. Hopefully, this issue will stop
reproducing altogether.

Maybe Fixes: https://tracker.ceph.com/issues/47292
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
@vshankar
Copy link
Contributor

Instead of a sleep, what about adding some sort of wait_until_true to wait until it is valid or timeout?

My concern is in this comment - #52822 (comment).

Its good to know which change (if any) caused the stat propagation to slow down and I'd rather have the test fail so that we know something isn't working as expected in the MDS.

Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 29, 2024
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Feb 28, 2024
@github-actions github-actions bot removed the stale label Feb 28, 2024
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants