Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle clog slow requests sent to monitors #39199

Closed
wants to merge 1 commit into from

Conversation

gerald-yang
Copy link

@gerald-yang gerald-yang commented Feb 1, 2021

monclinet: Throttle clog slow requests sent to monitors

A recent change https://tracker.ceph.com/issues/43975 logs details for each slow request and sends to monitors
But on large cluster, it could overwhelm monitors with spurious logs when performance issue happens
and cause further instability in the cluster

In our case, ceph.log growed to more than 14GB quickly, and we need to restart all monitors to recover
This patch throttles clog slow requests instead of sending every slow request details to monitors
and also sends out a summary of how many slow requests an OSD has and the oldest slow request info

Fixes: https://tracker.ceph.com/issues/48909

Signed-off-by: Gerald Yang gerald.yang@canonical.com

Signed-off-by: Gerald Yang <gerald.yang@canonical.com>
@tchaikov
Copy link
Contributor

@neha-ojha @sseshasa ping?

Copy link
Contributor

@sseshasa sseshasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging of slow ops had been disabled prior to again getting enabled as part of https://tracker.ceph.com/issues/43975. The intent was to help debug slow ops issues. Throttling does help to mitigate the logs getting overwhelming but we could end up missing some crucial log.

For the short term, these changes look fine but we probably need a configurable throttling mechanism going forward.

Longer term this does call for a mechanism to log slow ops with some throttling parameter (for e.g. terse, normal, verbose) where one can either increase or decrease the level depending on the situation. @neha-ojha @tchaikov what do you think?

@tchaikov
Copy link
Contributor

tchaikov commented Mar 1, 2021

i think the change (#33328) to send the slow request to monitor defeats the purpose of #18614. i'd suggest find a better way to make the slow request more visible to administrators.

@tchaikov
Copy link
Contributor

i sent a mail to the dev mailing list for more inputs.

@gerald-yang
Copy link
Author

@tchaikov

Hi Kefu,
I saw there were some discussions about this issue
https://pad.ceph.com/p/cds-quincy
https://pad.ceph.com/p/cluster_log_cds_quincy

Just would like to check if there is a better way or idea to fix this?

Thanks,
Gerald

@gerald-yang
Copy link
Author

Close this PR per discussion with Kefu, will open another one for better handling slow requests info

@gerald-yang gerald-yang closed this May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants