New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/OSD: Log aggregated slow ops detail to cluster logs #43732
Conversation
|
The current logging of every slow op to cluster log can be logged by setting The slow ops will be reported in below format by default (osd_log_slow_op_to_clog=false) |
acb57bb
to
a3fae2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments added, overall it looks good.
a3fae2e
to
479db9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the notes
Thanks Ronen. Let me work on your review comments. |
|
IMHO, monitor is supposed to keep track of critical information on which the cluster needs to have a consensus. but slow ops warning does not fall into this category. i'd suggest stop patching a solution sending the information to the wrong place, please let mgr to collect the metrics and structured warning reports. mgr was created to offload the burden like this from monitor couple years ago. and if we continue doing in this way, that'd be a step back. |
Totally agree! This needs incorporating the cluster log changes in mgr and extending the health metric to report aggregated slow requests. This PR is based on current implementation of cluster logging. To reduce burden on monitor we are aggregating slow ops on OSD side before sending it to the cluster log. We will be improving the cluster logging through mgr with new RFE in near future. |
479db9c
to
7def867
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much improved. Let's get this done in time for this version.
Slow requests can overwhelm a cluster log with every slow op in detail and also fills up the monitor db. Instead, log slow ops details in aggregated format. Fixes: https://tracker.ceph.com/issues/52424 Signed-off-by: Prashant D <pdhange@redhat.com>
7def867
to
9319dc9
Compare
|
@ronen-fr Kindly review the latest changes. Thanks! |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
Make check error does not seem related to this PR. I re-ran make check on an old draft PR of mine, and it failed there too: https://jenkins.ceph.com/job/ceph-pull-requests/88381/consoleFull#231334942c19247c4-fcb7-4c61-9a5d-7e2b9731c678 |
|
jenkins test make check |
|
http://pulpito.front.sepia.ceph.com/yuriw-2022-01-14_23:22:09-rados-wip-yuri6-testing-2022-01-14-1207-distro-default-smithi/ Failures, unrelated: Details: |
Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.
Fixes: https://tracker.ceph.com/issues/52424
Signed-off-by: Prashant D pdhange@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox