Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: Add one finisher thread per module #47893

Merged
merged 2 commits into from Apr 6, 2023

Conversation

kotreshhr
Copy link
Contributor

@kotreshhr kotreshhr commented Sep 1, 2022

Adds a finisher thread for each module. Each module commands are
executed via corresponding finisher thread. There is generic
finisher thread via which all other commands like notify, config
change are run.

Fixes: https://tracker.ceph.com/issues/51177
Signed-off-by: Kotresh HR khiremat@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@kotreshhr kotreshhr requested a review from a team as a code owner September 1, 2022 06:37
@kotreshhr kotreshhr marked this pull request as draft September 1, 2022 06:38
@vshankar
Copy link
Contributor

vshankar commented Sep 7, 2022

@kotreshhr @batrick I'm not against the approach taken, however, I would like to check if the point made here(*) is worth considering?

(*): https://tracker.ceph.com/issues/51177#note-4

@batrick
Copy link
Member

batrick commented Sep 7, 2022

@kotreshhr @batrick I'm not against the approach taken, however, I would like to check if the point made here(*) is worth considering?

(*): https://tracker.ceph.com/issues/51177#note-4

I think the issue with that approach is you'd still need to somehow queue other events like mdsmap/osdmap notifications or even subsequent CLI commands. How many async commands should run at once? Leave it to the modules to decide? It gets messy quickly, especially with ordering guarantees.

This approach keeps the sequential nature of commands/events but prevents one module from blocking others from operating. I think it's a nice compromise.

@vshankar
Copy link
Contributor

vshankar commented Sep 8, 2022

@kotreshhr @batrick I'm not against the approach taken, however, I would like to check if the point made here() is worth considering?
(
): https://tracker.ceph.com/issues/51177#note-4

I think the issue with that approach is you'd still need to somehow queue other events like mdsmap/osdmap notifications or even subsequent CLI commands. How many async commands should run at once? Leave it to the modules to decide? It gets messy quickly, especially with ordering guarantees.

This approach keeps the sequential nature of commands/events but prevents one module from blocking others from operating. I think it's a nice compromise.

I'm a tiny bit worried about bloating up ceph-mgr's memory usage with this approach. It solves the issue in hand, however, we are adding up a finite memory (queue, locks, etc..) for each mgr module. IIRC, there was some discussion (somewhere) about trimming ceph-mgr's memory footprint.

@badone ^^

src/mgr/Mgr.cc Outdated Show resolved Hide resolved
@badone
Copy link
Contributor

badone commented Sep 8, 2022

I'm a tiny bit worried about bloating up ceph-mgr's memory usage with this approach. It solves the issue in hand, however, we are adding up a finite memory (queue, locks, etc..) for each mgr module. IIRC, there was some discussion (somewhere) about trimming ceph-mgr's memory footprint.

@badone ^^

I don't recall. Are you sure it was me involved (did you mean Boris maybe)?

@vshankar
Copy link
Contributor

vshankar commented Sep 9, 2022

I'm a tiny bit worried about bloating up ceph-mgr's memory usage with this approach. It solves the issue in hand, however, we are adding up a finite memory (queue, locks, etc..) for each mgr module. IIRC, there was some discussion (somewhere) about trimming ceph-mgr's memory footprint.
@badone ^^

I don't recall. Are you sure it was me involved (did you mean Boris maybe)?

Maybe, I'm not sure. It was a while back...

Anyway, we need more eyes on this change.

@badone
Copy link
Contributor

badone commented Sep 9, 2022

@vshankar Do you see it adding significantly to the memory usage of the manager? Seems to me the additional overhead of having a finisher for each module plus a generic one shouldn't be massively different to having just a generic one since the content of the combined finisher queues should be about the same size as the queue of the finisher in the one finisher approach right?

@vshankar
Copy link
Contributor

vshankar commented Sep 9, 2022

@vshankar Do you see it adding significantly to the memory usage of the manager? Seems to me the additional overhead of having a finisher for each module plus a generic one shouldn't be massively different to having just a generic one since the content of the combined finisher queues should be about the same size as the queue of the finisher in the one finisher approach right?

There would be one thread per mgr plugin. That's different that having one thread which is the case now. And then, there is a (finisher) thread even for modules that are not enabled (which can be worked around). But still...

How many plugins do we have now in ceph-mgr?

@badone
Copy link
Contributor

badone commented Sep 9, 2022

In the mid-to-high thirties if my quick count is correct but they are not all enabled/active of course.

@kotreshhr
Copy link
Contributor Author

@vshankar Do you see it adding significantly to the memory usage of the manager? Seems to me the additional overhead of having a finisher for each module plus a generic one shouldn't be massively different to having just a generic one since the content of the combined finisher queues should be about the same size as the queue of the finisher in the one finisher approach right?

Yes, there is nothing changed w.r.t queue size. With this approach, the queue size is distributed among multiple threads which was with the single thread before. The added memory usage is the new Finisher object per new thread. I am not very sure yet how much that would add up to resource consumption.

@kotreshhr
Copy link
Contributor Author

@vshankar
Copy link
Contributor

I guess this change is fine since noone has shown objection on the approach taken here.

@badone - @kotreshhr has run additionally this through rados:mgr. Would that be enough as a source of validation?

@vshankar
Copy link
Contributor

ping?

@vshankar vshankar marked this pull request as ready for review October 6, 2022 11:22
@vshankar
Copy link
Contributor

I guess this change is fine since noone has shown objection on the approach taken here.

@badone - @kotreshhr has run additionally this through rados:mgr. Would that be enough as a source of validation?

ping?

@badone
Copy link
Contributor

badone commented Oct 12, 2022

I guess this change is fine since noone has shown objection on the approach taken here.
@badone - @kotreshhr has run additionally this through rados:mgr. Would that be enough as a source of validation?

ping?

Have Patrick's concerns been addressed? It's still showing a requested change.

@kotreshhr
Copy link
Contributor Author

I guess this change is fine since noone has shown objection on the approach taken here.
@badone - @kotreshhr has run additionally this through rados:mgr. Would that be enough as a source of validation?

ping?

Have Patrick's concerns been addressed? It's still showing a requested change.

@badone I had missed noticing Patrick comment. I am working on it. I will refresh the PR once done. Are there any additional testing that you think which needs to be run ?

@badone
Copy link
Contributor

badone commented Oct 17, 2022

@kotreshhr I think you can just run the same suites again on our updated PR and that should be sufficient.

@kotreshhr
Copy link
Contributor Author

jenkins test windows

@kotreshhr
Copy link
Contributor Author

jenkins test api

@kotreshhr
Copy link
Contributor Author

Can we simply put the new task in rados/mgr/tasks instead of defining an entire new suite?

Yes, Done. Thank you the review. It was added as a separate suite initially as it required a single mgr cluster. I fixed it down the line but the new suite remained.

@kotreshhr
Copy link
Contributor Author

@kotreshhr pls add needs-qa when fixed

Sure, the changes are in place but I couldn't do sanity test since teuthology is down. I will do it once it's back and add needs-qa. Thank you.

@kotreshhr kotreshhr force-pushed the ceph-mgr-finisher-block branch 3 times, most recently from af37e52 to 929eb07 Compare March 2, 2023 18:38
@kotreshhr
Copy link
Contributor Author

@yuriw marked needs-qa. The PR is ready for testing.

@kotreshhr
Copy link
Contributor Author

jenkins test windows

1 similar comment
@kotreshhr
Copy link
Contributor Author

jenkins test windows

@kotreshhr
Copy link
Contributor Author

jenkins test api

@kotreshhr
Copy link
Contributor Author

jenkins test windows

2 similar comments
@kotreshhr
Copy link
Contributor Author

jenkins test windows

@kotreshhr
Copy link
Contributor Author

jenkins test windows

@kotreshhr
Copy link
Contributor Author

@athanatos PTAL. Addressed your comments

@kotreshhr kotreshhr removed the TESTED label Mar 12, 2023
@rzarzynski
Copy link
Contributor

Repinging @athanatos.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suite changes look reasonable to me now -- as long as tests pass LGTM

if val >= minval:
if isinstance(minval, int) and val >= minval:
seen.add(name)
elif isinstance(expected_val, int) and val == expected_val:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kotreshhr This caused the following errors:

2023-03-28T10:34:36.302 ERROR:teuthology.run_tasks:Manager failed: check-counter
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_main/teuthology/run_tasks.py", line 192, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_teuthology_main/teuthology/task/__init__.py", line 132, in __exit__
    self.end()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_9d12d2cb27d4a45c58218390bfbb56820b010b77/qa/tasks/check_counter.py", line 117, in end
    elif isinstance(expected_val, int) and val == expected_val:
UnboundLocalError: local variable 'expected_val' referenced before assignment

More detail please see https://pulpito.ceph.com/yuriw-2023-03-28_00:15:34-fs-wip-yuri5-testing-2023-03-27-1315-distro-default-smithi/.

Could you have a look ? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Xiubo. I will take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it. Ran the failed one with the fix. It works. Hopefully this is the last failure.
http://pulpito.front.sepia.ceph.com/khiremat-2023-03-29_06:17:28-fs-wip-yuri5-testing-2023-03-27-1315-distro-default-smithi/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuriw This needs a qa re-run after this fix.

Fixes: https://tracker.ceph.com/issues/51177
Signed-off-by: Kotresh HR <khiremat@redhat.com>
@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work.

@yuriw yuriw merged commit 23a958d into ceph:main Apr 6, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants