Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/cephfs: clean up evicted client in 4-compat_client.yaml #46988

Merged
merged 3 commits into from Sep 2, 2022

Conversation

rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Jul 6, 2022

Related PR - PR #45036

4-compat_client.yaml in creates two clients and evicts one of them. The
evicted client is not cleaned up later, that is it's left unmounted and
the mount point is left undeleted. This doesn't cause failure during
final teardown for main branch but with PR #45036 it does lead to
failure every time.

PR #45036 changes the fact that CephFS code in directory "qa" depends on
value of attribute "is_mounted" to check if a CephFS has been unmounted
or not. Instead, it runs "findmnt" command to check if the client was
actually unmounted.

Operating on a CephFS mountpoint after the client has been evicted
causes the operation to hang. Thus with PR #45036 the final teardown
for teuthology job fails every time.

Fixes: https://tracker.ceph.com/issues/56476

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@rishabh-d-dave rishabh-d-dave requested review from vshankar, lxbsz and a team July 6, 2022 10:21
@github-actions github-actions bot added the cephfs Ceph File System label Jul 6, 2022
@rishabh-d-dave rishabh-d-dave changed the title qa/cephfs: improvements for xfstests_dev.py qa/cephfs: clean up evicted client in 4-compat_client.yaml Jul 6, 2022
@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@@ -13,3 +13,11 @@ tasks:
clients:
client.0: False
client.1: True
# cleanup evicted client so there's no trouble later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment that only client.0 is upgraded and client.1 is evicted by the mds due to missing feature compat set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Jul 6, 2022

- cat mntpt.txt
- sudo umount -f $(cat mntpt.txt)
- sudo rmdir $(cat mntpt.txt)
- rm mntpt.txt
Copy link
Member

@lxbsz lxbsz Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about doing this in clients_evicted() in tasks/fs.py instead ? Then it should be very simple by:

mount.umount_wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it lead to a problem for teuthology jobs that wants to keep evicted clients as it is around for some time?

Copy link
Member

@lxbsz lxbsz Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two places using this, I didn't see it could lead potential issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from my understanding the fs.clients_evicted: is where needs the evicted clients as it is around for some time. So after this or in this to unmount them should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll make the change in that case.

Copy link
Contributor

@vshankar vshankar Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd only need to make sure that the findmnt (from your changes) does not interleave anywhere, which then could lead to a test hang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL.

@rishabh-d-dave rishabh-d-dave force-pushed the qa-suites-upgraded-client branch 4 times, most recently from c25d3e1 to 41938cc Compare July 7, 2022 17:41
@rishabh-d-dave
Copy link
Contributor Author

@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@vshankar
Copy link
Contributor

@rishabh-d-dave ping?

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Aug 10, 2022

Spoke to @rishabh-d-dave today. Seems like the pending issue with this PR is to identify if the client has been blocklisted and then (force) unmount it. I suggested that the we save the client address after a successful mount (mount.py) and then during class destruction (or in some helper) check if the address is part of the blocklisted client and then do the needful.

Does that make sense?

Elaborating on the problem: The Python code hangs when a blocked client is being operated on even when timeout option is passed to run() method, timeout option is passed to findmnt command and/or timeout command is used to kill findmnt or any other command. All of these methods have been tried in previous version of this PR. (One more option was tried besides these: run umount -fl <mntpt> for blocked client's mountpoint from within the 4-compat_client.yaml)

The way around this is (which is currently on this PR) to set wait=False and treat client as blocked when no response is received for X seconds. This approach too is not so nice since it involves waiting for X to 2X or 3X seconds (depending on how many times is_stuck() was called) for every mount.

Noting the client socket in class constructor and checking if it is blocked in class destructor should be a better way to deal with this issue. I'll try this out and post the result.

@rishabh-d-dave rishabh-d-dave force-pushed the qa-suites-upgraded-client branch 2 times, most recently from 604c3ce to d8416c8 Compare August 17, 2022 15:26
@rishabh-d-dave rishabh-d-dave force-pushed the qa-suites-upgraded-client branch 2 times, most recently from f51e1ec to 79bb6a7 Compare August 18, 2022 09:05
Add a note explaining the reason behind the eviction of "client.1"
during this test.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rishabh-d-dave
Copy link
Contributor Author

Tested this PR individually, testing was successful - http://pulpito.front.sepia.ceph.com/rishabh-2022-08-19_14:15:36-fs-wip-rishabh-client-evict-distro-default-smithi/.

Adding it for QA run.

Before unmounting check if the client has been evicted and, if so, run
"umount -f -l" for the mount point of the client and cleanup the mount
right after it.

Attempting to unmount, cleanup or operate in any way over mount point
of a evicted client will hang the operation (and thereby our Python
code too). Lazy-force unmount prevents such hangs for our Python code
and also frees the mount point.

This commit also adds code to gather session info for kernel mounts
after mounting is successful. This is a necessity since network address
of session is needed to check if it is blocked by Ceph cluster.

Fixes: https://tracker.ceph.com/issues/56476
Signed-off-by: Rishabh Dave <ridave@redhat.com>
@rishabh-d-dave
Copy link
Contributor Author

In last make check job flake8 had failed -

flake8 run-test: commands[0] | flake8 --select=F,E9 --exclude=venv,.tox
./tasks/cephfs/fuse_mount.py:15:1: F401 'tasks.cephfs.filesystem.Filesystem' imported but unused

Fixed now.

@rishabh-d-dave rishabh-d-dave added the wip-rishabh-testing Rishabh's testing label label Aug 19, 2022
@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check arm64

@rishabh-d-dave
Copy link
Contributor Author

jenkins test windows

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Aug 26, 2022

QA was successful - https://tracker.ceph.com/projects/cephfs/wiki/Main#2022-Aug-26.

The QA job that this PR fixes didn't got executed during QA run, so I ran it myself. The job ran successfully - http://pulpito.front.sepia.ceph.com/rishabh-2022-08-26_12:11:39-fs-wip-rishabh-testing-2022Aug19-distro-default-smithi/detail.

Waiting on CI now.

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check arm64

@rishabh-d-dave
Copy link
Contributor Author

jenkins test windows

@rishabh-d-dave
Copy link
Contributor Author

#46988 (comment)

@rishabh-d-dave rishabh-d-dave dismissed batrick’s stale review September 2, 2022 12:19

Requested changes were added.

@rishabh-d-dave rishabh-d-dave merged commit d5ab255 into ceph:main Sep 2, 2022
@rishabh-d-dave rishabh-d-dave deleted the qa-suites-upgraded-client branch November 29, 2022 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System tests wip-rishabh-testing Rishabh's testing label
Projects
None yet
6 participants