Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/telemetry: handle empty device report when "send" is triggered #44994

Merged

Conversation

ljflores
Copy link
Contributor

@ljflores ljflores commented Feb 11, 2022

On certain environments, such as the "ceph-dev-docker" environment
(https://github.com/ricardoasmarques/ceph-dev-docker), the mgr
module is unable to fetch device metrics. As a result, the device
report generated by "gather_device_report()" in telemetry returns an empty dict.
This causes an AssertionError when the "send" function is triggered
(i.e. by running ceph telemetry status or ceph telemetry send),
and the module crashes.

The fix in this commit checks that the generated device report
contains metrics before trying to send it. If the device report
does not contain metrics (it returns an empty dict), the module
will log an appropriate message in the mgr log and not send the
device report.

If this scenario happens when running the ceph telemetry send command,
the user will additionally see this message:

Ceph report sent to https://telemetry.ceph.com/report
Unable to send device report: channel is on, but generated report was empty.

Fixes: https://tracker.ceph.com/issues/54250
Signed-off-by: Laura Flores lflores@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@ljflores ljflores force-pushed the wip-telemetry-device-assertion-failure branch from 3953a08 to d44a98b Compare February 11, 2022 19:59
@ljflores ljflores added telemetry needs-quincy-backport backport required for quincy labels Feb 11, 2022
@ljflores
Copy link
Contributor Author

Looks like the assert devices line in send() was added in 964dd93 when typing annotations were introduced, but it was never backported. It's not present in Pacific, so I think this only needs a Quincy backport. I'll still check to see if it reproduces on Pacific though.

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ljflores it works now!

src/pybind/mgr/telemetry/module.py Show resolved Hide resolved
Copy link
Contributor

@yaarith yaarith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljflores looks good; I had a couple of comments

src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
@yaarith
Copy link
Contributor

yaarith commented Feb 16, 2022

adding a note here that this should not be backported since the assert was introduced with the typing changes to the mgr (not available prior to Quincy)

Copy link
Contributor

@yaarith yaarith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's squash the fixes

@ljflores ljflores force-pushed the wip-telemetry-device-assertion-failure branch from ea14350 to ff53805 Compare February 16, 2022 17:46
On certain environments, such as the "ceph-dev-docker" environment
(https://github.com/ricardoasmarques/ceph-dev-docker), the mgr
module is unable to fetch device metrics. As a result, the device
report generated by "gather_device_report()" returns an empty dict.
This causes an AssertionError when the "send" function is triggered
(i.e. by running `ceph telemetry status` or `ceph telemetry send`),
and the module crashes.

The fix in this commit checks that the generated device report
contains metrics before trying to send it. If the device report
does not contain metrics (it returns an empty dict), the module
will log an appropriate message in the mgr log and not send the
device report.

If this scenario happens when running the `ceph telemetry send` command,
the user will additionally see this message:
```
Ceph report sent to https://telemetry.ceph.com/report
Unable to send device report: channel is on, but generated report was empty.
```

I also added a few more debug messages in gather_device_report() to make
future debugging easier.

Fixes: https://tracker.ceph.com/issues/54250
Signed-off-by: Laura Flores <lflores@redhat.com>
@ljflores ljflores force-pushed the wip-telemetry-device-assertion-failure branch from ff53805 to 54e0e58 Compare February 16, 2022 19:39
@ljflores
Copy link
Contributor Author

Failures, unrelated:
https://tracker.ceph.com/issues/54307
https://tracker.ceph.com/issues/54306
https://tracker.ceph.com/issues/52124

Details:
Bug_#54307: test_cls_rgw.sh: 'index_list_delimited' test times out - Ceph - RGW
Bug_#54306: tasks.cephfs.test_nfs.TestNFS.test_create_multiple_exports: AssertionError: NFS Ganesha cluster deployment failed - Ceph - Orchestrator
Bug_#52124: Invalid read of size 8 in handle_recovery_delete() - Ceph - RADOS

@ljflores ljflores merged commit b62d712 into ceph:master Feb 21, 2022
@ljflores ljflores deleted the wip-telemetry-device-assertion-failure branch February 21, 2022 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants