Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/tasks/quiescer: dump ops in parallel #57302

Merged
merged 1 commit into from
May 17, 2024
Merged

qa/tasks/quiescer: dump ops in parallel #57302

merged 1 commit into from
May 17, 2024

Conversation

batrick
Copy link
Member

@batrick batrick commented May 6, 2024

Since this --flags=locks takes the mds_lock and dumps thousands of ops, this may take a long time to complete for each individual MDS. The entire quiesce set may timeout (and all q ops killed) before we finish dumping ops.

Fixes: https://tracker.ceph.com/issues/65823

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@batrick batrick added cephfs Ceph File System needs-review labels May 6, 2024
@batrick batrick requested a review from leonid-s-usov May 6, 2024 18:04
@github-actions github-actions bot added the tests label May 6, 2024
@batrick
Copy link
Member Author

batrick commented May 6, 2024

Not yet tested.

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this, Patrick!

qa/tasks/quiescer.py Outdated Show resolved Hide resolved
qa/tasks/quiescer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! It should be easy to test this with any of the upcoming teuthology batches by using this branch as the suite

@batrick
Copy link
Member Author

batrick commented May 8, 2024

jenkins test api

@batrick batrick closed this May 8, 2024
@batrick batrick reopened this May 8, 2024
@batrick
Copy link
Member Author

batrick commented May 8, 2024

jenkins test make check arm64

2 similar comments
@batrick
Copy link
Member Author

batrick commented May 8, 2024

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented May 8, 2024

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented May 8, 2024

This PR is under test in https://tracker.ceph.com/issues/65867.

@batrick
Copy link
Member Author

batrick commented May 8, 2024

jenkins test make check arm64

@@ -186,14 +186,15 @@ def dump_ops_all_ranks(self, dump_tag):

self.logger.debug(f"Dumping ops on rank {rank} ({name}) to a remote file {remote_path}")
try:
_ = self.fs.rank_tell(['ops', '--flags=locks', f'--path={daemon_path}'], rank=rank)
remote_dumps.append((info, remote_path))
p = self.fs.rank_tell(['ops', '--flags=locks', f'--path={daemon_path}'], rank=rank, wait=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I cannot do this:

https://pulpito.ceph.com/pdonnell-2024-05-08_22:06:20-fs-wip-pdonnell-testing-20240508.183908-debug-distro-default-smithi/7698868/

2024-05-09T00:35:43.460 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't pull ops dump at '/var/run/ceph/b96c13bc-0d98-11ef-bc97-c7b262605968/ops-7749c26b-1-mds.i.json' on rank 2 (i), error: 'dict' object has no attribute 'wait'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:'( I was looking forward to the PR... How hard is it to add the async capability to the rank_tell?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the helper rank_tell, probably just manually submit the command instead.

Since this --flags=locks takes the mds_lock and dumps thousands of ops, this
may take a long time to complete for each individual MDS. The entire quiesce
set may timeout (and all q ops killed) before we finish dumping ops.

Fixes: https://tracker.ceph.com/issues/65823
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick batrick marked this pull request as ready for review May 16, 2024 16:12
@batrick
Copy link
Member Author

batrick commented May 16, 2024

https://pulpito.ceph.com/pdonnell-2024-05-16_16:19:21-fs:workload-main-distro-default-smithi/

@batrick
Copy link
Member Author

batrick commented May 16, 2024

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented May 17, 2024

https://pulpito.ceph.com/pdonnell-2024-05-16_16:19:21-fs:workload-main-distro-default-smithi/

seems to work as advertised now: /teuthology/pdonnell-2024-05-16_16:19:21-fs:workload-main-distro-default-smithi/7709343/teuthology.log

@batrick batrick merged commit bfe574c into ceph:main May 17, 2024
10 of 11 checks passed
@batrick batrick deleted the i65823 branch May 17, 2024 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System tests
Projects
None yet
2 participants