New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qa/cephfs: clean up evicted client in 4-compat_client.yaml #46988
qa/cephfs: clean up evicted client in 4-compat_client.yaml #46988
Conversation
jenkins test api |
@vshankar @lxbsz The fix and the issue have been verified by running the failing job again. http://pulpito.front.sepia.ceph.com/rishabh-2022-07-06_08:49:09-fs-wip-vshankar-testing-20220527-073645-distro-default-smithi/ |
@@ -13,3 +13,11 @@ tasks: | |||
clients: | |||
client.0: False | |||
client.1: True | |||
# cleanup evicted client so there's no trouble later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment that only client.0 is upgraded and client.1 is evicted by the mds due to missing feature compat set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested PR #45036 and PR #46988. Ran fine - http://pulpito.front.sepia.ceph.com/rishabh-2022-07-06_11:37:52-fs-wip-vshankar-testing-20220527-073645-distro-default-smithi/. This PR is ready for QA now. |
49e51ba
to
9ee8f27
Compare
- cat mntpt.txt | ||
- sudo umount -f $(cat mntpt.txt) | ||
- sudo rmdir $(cat mntpt.txt) | ||
- rm mntpt.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about doing this in clients_evicted()
in tasks/fs.py
instead ? Then it should be very simple by:
mount.umount_wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it lead to a problem for teuthology jobs that wants to keep evicted clients as it is around for some time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two places using this, I didn't see it could lead potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And from my understanding the fs.clients_evicted:
is where needs the evicted clients as it is around for some time. So after this or in this to unmount them should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll make the change in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd only need to make sure that the findmnt
(from your changes) does not interleave anywhere, which then could lead to a test hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL.
c25d3e1
to
41938cc
Compare
Tested new change, it work fine - http://pulpito.front.sepia.ceph.com/rishabh-2022-07-07_17:52:35-fs-main-distro-default-smithi/ |
jenkins test make check |
84a2841
to
f3857ec
Compare
jenkins test api |
jenkins test api |
jenkins test api |
@rishabh-d-dave ping? |
Elaborating on the problem: The Python code hangs when a blocked client is being operated on even when The way around this is (which is currently on this PR) to set Noting the client socket in class constructor and checking if it is blocked in class destructor should be a better way to deal with this issue. I'll try this out and post the result. |
604c3ce
to
d8416c8
Compare
f51e1ec
to
79bb6a7
Compare
Add a note explaining the reason behind the eviction of "client.1" during this test. Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
79bb6a7
to
737f978
Compare
Tested this PR individually, testing was successful - http://pulpito.front.sepia.ceph.com/rishabh-2022-08-19_14:15:36-fs-wip-rishabh-client-evict-distro-default-smithi/. Adding it for QA run. |
Before unmounting check if the client has been evicted and, if so, run "umount -f -l" for the mount point of the client and cleanup the mount right after it. Attempting to unmount, cleanup or operate in any way over mount point of a evicted client will hang the operation (and thereby our Python code too). Lazy-force unmount prevents such hangs for our Python code and also frees the mount point. This commit also adds code to gather session info for kernel mounts after mounting is successful. This is a necessity since network address of session is needed to check if it is blocked by Ceph cluster. Fixes: https://tracker.ceph.com/issues/56476 Signed-off-by: Rishabh Dave <ridave@redhat.com>
737f978
to
c279b47
Compare
In last
Fixed now. |
jenkins test api |
jenkins test make check arm64 |
jenkins test windows |
QA was successful - https://tracker.ceph.com/projects/cephfs/wiki/Main#2022-Aug-26. The QA job that this PR fixes didn't got executed during QA run, so I ran it myself. The job ran successfully - http://pulpito.front.sepia.ceph.com/rishabh-2022-08-26_12:11:39-fs-wip-rishabh-testing-2022Aug19-distro-default-smithi/detail. Waiting on CI now. |
jenkins test api |
jenkins test make check arm64 |
jenkins test windows |
Requested changes were added.
Related PR - PR #45036
4-compat_client.yaml in creates two clients and evicts one of them. The
evicted client is not cleaned up later, that is it's left unmounted and
the mount point is left undeleted. This doesn't cause failure during
final teardown for main branch but with PR #45036 it does lead to
failure every time.
PR #45036 changes the fact that CephFS code in directory "qa" depends on
value of attribute "is_mounted" to check if a CephFS has been unmounted
or not. Instead, it runs "findmnt" command to check if the client was
actually unmounted.
Operating on a CephFS mountpoint after the client has been evicted
causes the operation to hang. Thus with PR #45036 the final teardown
for teuthology job fails every time.
Fixes: https://tracker.ceph.com/issues/56476
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows