Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/OSD: Log aggregated slow ops detail to cluster logs #43732

Merged
merged 1 commit into from Jan 18, 2022

Conversation

pdvian
Copy link

@pdvian pdvian commented Oct 29, 2021

Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.

Fixes: https://tracker.ceph.com/issues/52424

Signed-off-by: Prashant D pdhange@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@pdvian
Copy link
Author

pdvian commented Oct 29, 2021

The current logging of every slow op to cluster log can be logged by setting osd_log_slow_op_to_clog=true (default is false).

The slow ops will be reported in below format by default (osd_log_slow_op_to_clog=false)
2021-10-29T09:35:48.104306-0400 osd.0 (osd.0) 3 : cluster [WRN] 10 slow requests (by type [ 'delayed' : 4 'queued for pg' : 2 'started' : 3 'waiting for sub ops' : 1 ] most affected pool [ 'scbench' : 10 ])

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments added, overall it looks good.

src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OpRequest.cc Outdated Show resolved Hide resolved
src/osd/OpRequest.cc Outdated Show resolved Hide resolved
src/osd/OpRequest.h Outdated Show resolved Hide resolved
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the notes

@pdvian
Copy link
Author

pdvian commented Dec 22, 2021

Please see the notes

Thanks Ronen. Let me work on your review comments.

@tchaikov
Copy link
Contributor

tchaikov commented Dec 22, 2021

IMHO, monitor is supposed to keep track of critical information on which the cluster needs to have a consensus. but slow ops warning does not fall into this category. i'd suggest stop patching a solution sending the information to the wrong place, please let mgr to collect the metrics and structured warning reports. mgr was created to offload the burden like this from monitor couple years ago. and if we continue doing in this way, that'd be a step back.

@pdvian
Copy link
Author

pdvian commented Jan 10, 2022

IMHO, monitor is supposed to keep track of critical information on which the cluster needs to have a consensus. but slow ops warning does not fall into this category. i'd suggest stop patching a solution sending the information to the wrong place, please let mgr to collect the metrics and structured warning reports. mgr was created to offload the burden like this from monitor couple years ago. and if we continue doing in this way, that'd be a step back.

Totally agree! This needs incorporating the cluster log changes in mgr and extending the health metric to report aggregated slow requests. This PR is based on current implementation of cluster logging. To reduce burden on monitor we are aggregating slow ops on OSD side before sending it to the cluster log. We will be improving the cluster logging through mgr with new RFE in near future.

src/osd/OSD.cc Show resolved Hide resolved
src/common/options/osd.yaml.in Outdated Show resolved Hide resolved
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much improved. Let's get this done in time for this version.

src/common/options/osd.yaml.in Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OpRequest.cc Outdated Show resolved Hide resolved
src/osd/OpRequest.cc Outdated Show resolved Hide resolved
Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.

Fixes: https://tracker.ceph.com/issues/52424

Signed-off-by: Prashant D <pdhange@redhat.com>
@pdvian
Copy link
Author

pdvian commented Jan 12, 2022

@ronen-fr Kindly review the latest changes. Thanks!

@neha-ojha
Copy link
Member

jenkins test make check

1 similar comment
@ljflores
Copy link
Contributor

jenkins test make check

@ljflores
Copy link
Contributor

Make check error does not seem related to this PR. I re-ran make check on an old draft PR of mine, and it failed there too: https://jenkins.ceph.com/job/ceph-pull-requests/88381/consoleFull#231334942c19247c4-fcb7-4c61-9a5d-7e2b9731c678

@neha-ojha
Copy link
Member

jenkins test make check

@ljflores
Copy link
Contributor

http://pulpito.front.sepia.ceph.com/yuriw-2022-01-14_23:22:09-rados-wip-yuri6-testing-2022-01-14-1207-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/yuriw-2022-01-17_17:05:17-rados-wip-yuri6-testing-2022-01-14-1207-distro-default-smithi/

Failures, unrelated:
https://tracker.ceph.com/issues/53843
https://tracker.ceph.com/issues/53872
https://tracker.ceph.com/issues/45721
https://tracker.ceph.com/issues/53807

Details:
Bug_#53843: mgr/dashboard: Error - yargs parser supports a minimum Node.js version of 12. - Ceph - Mgr - Dashboard
Bug_#53872: Errors detected in generated GRUB config file
Bug_#45721: CommandFailedError: Command failed (workunit test rados/test_python.sh) FAIL: test_rados.TestWatchNotify.test
Bug_#53807: Dead jobs in rados/cephadm/smoke-roleless{...} - Ceph - Orchestrator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants