Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific: os/bluestore: update perf counter priorities #47095

Merged
merged 1 commit into from Jul 20, 2022

Conversation

aaSharma14
Copy link
Contributor

@aaSharma14 aaSharma14 commented Jul 14, 2022

⚠️ Note: This backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus. Given Pacific is a stable release and not to add that much extra load we are adapting this backport and only exposing the 2 required perf-counters


partial backport tracker: https://tracker.ceph.com/issues/56559


partial backport of #43405
parent tracker: https://tracker.ceph.com/issues/56558

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@aaSharma14 aaSharma14 requested a review from a team as a code owner July 14, 2022 09:55
@aaSharma14 aaSharma14 added this to the pacific milestone Jul 14, 2022
@github-actions github-actions bot added the core label Jul 14, 2022
@epuertat epuertat requested a review from ljflores July 14, 2022 10:21
@epuertat
Copy link
Member

@ljflores we need to backport this PR to expose bluestore_onode_hits and bluestore_onode_misses for #44650 (that PR is broken without this backport).

However, I counted that this backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus, which for the 8k OSD stretch goal means +272,000 new samples/scrape. Given Pacific is a stable release, I wouldn't want to add that much extra load, so what about adapting this backport and only exposing the 2 required perf-counters?

@aaSharma14 aaSharma14 added the backport: no-conflicts Backport without conflicts label Jul 14, 2022
@ljflores
Copy link
Contributor

ljflores commented Jul 15, 2022

@ljflores we need to backport this PR to expose bluestore_onode_hits and bluestore_onode_misses for #44650 (that PR is broken without this backport).

However, I counted that this backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus, which for the 8k OSD stretch goal means +272,000 new samples/scrape. Given Pacific is a stable release, I wouldn't want to add that much extra load, so what about adapting this backport and only exposing the 2 required perf-counters?

@epuertat That approach sounds okay to me. @aaSharma14 we should be sure to write an explanation for this in the commit message.

@aaSharma14 aaSharma14 force-pushed the wip-56559-pacific branch 3 times, most recently from 8ae578e to a48a4c2 Compare July 15, 2022 09:24
@aaSharma14
Copy link
Contributor Author

@ljflores we need to backport this PR to expose bluestore_onode_hits and bluestore_onode_misses for #44650 (that PR is broken without this backport).
However, I counted that this backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus, which for the 8k OSD stretch goal means +272,000 new samples/scrape. Given Pacific is a stable release, I wouldn't want to add that much extra load, so what about adapting this backport and only exposing the 2 required perf-counters?

@epuertat That approach sounds okay to me. @aaSharma14 we should be sure to write an explanation for this in the commit message.

Thank you @ljflores , I have added a note to the commit message for the change.

@aaSharma14 aaSharma14 removed the backport: no-conflicts Backport without conflicts label Jul 15, 2022
@aaSharma14
Copy link
Contributor Author

jenkins test api

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT. Thanks @aaSharma14 !

@ljflores
Copy link
Contributor

jenkins test api

@epuertat epuertat added this to In progress in Dashboard via automation Jul 15, 2022
@epuertat
Copy link
Member

API tests are failing because of this pesky issue with the python-common/ceph package:

ImportError: cannot import name 'OSDMethod' from 'ceph.deployment.drive_group' (/usr/lib/python3/dist-packages/ceph/deployment/drive_group.py)

@yuriw I don't think we have any QA test that covers this code, so it won't make any difference.

@epuertat
Copy link
Member

jenkins test api

@ljflores ljflores requested a review from neha-ojha July 15, 2022 18:56
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial backport of 8790f04, but nothing in the commit message explains that. Can we please add an explanation in the commit message to explain why we only care about these two perf counters in pacific?

@aaSharma14
Copy link
Contributor Author

This is a partial backport of 8790f04, but nothing in the commit message explains that. Can we please add an explanation in the commit message to explain why we only care about these two perf counters in pacific?

Thanks @neha-ojha , I have put a note in the commit message for the change.

@epuertat
Copy link
Member

@neha-ojha @ljflores can we have this merged plz? It has no impact (just 2 new bluestore perfcounters exported/OSD) and it's required for unblocking a downstream release. Thanks!

@neha-ojha
Copy link
Member

This is a partial backport of 8790f04, but nothing in the commit message explains that. Can we please add an explanation in the commit message to explain why we only care about these two perf counters in pacific?

Thanks @neha-ojha , I have put a note in the commit message for the change.

@aaSharma14 I meant adding it to the git commit message, something you can see using git log

@epuertat
Copy link
Member

This is a partial backport of 8790f04, but nothing in the commit message explains that. Can we please add an explanation in the commit message to explain why we only care about these two perf counters in pacific?

Thanks @neha-ojha , I have put a note in the commit message for the change.

@aaSharma14 I meant adding it to the git commit message, something you can see using git log

@neha-ojha it's in the commit (probably a blank line in between would have helped too):

Note: This backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus. Given Pacific is a stable release and not to add that much extra load we are adapting this backport and only exposing the 2 required perf-counters

@neha-ojha
Copy link
Member

@neha-ojha @ljflores can we have this merged plz? It has no impact (just 2 new bluestore perfcounters exported/OSD) and it's required for unblocking a downstream release. Thanks!

This is a partial backport of 8790f04, but nothing in the commit message explains that. Can we please add an explanation in the commit message to explain why we only care about these two perf counters in pacific?

Thanks @neha-ojha , I have put a note in the commit message for the change.

@aaSharma14 I meant adding it to the git commit message, something you can see using git log

@neha-ojha it's in the commit (probably a blank line in between would have helped too):

yeah, something to differentiate between the original commit message and this addition will be helpful, maybe add it after the "cherry picked from commit" line like we call out conflicts?

@ljflores how is testing looking on this PR?

Note: This backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus. Given Pacific is a stable release and not to add that much extra load we are adapting this backport and only exposing the 2 required perf-counters

@ljflores
Copy link
Contributor

ljflores commented Jul 19, 2022

@ljflores how is testing looking on this PR?

The Trello card for it is here: https://trello.com/c/62n7lFew/1583-wip-yuri2-testing-2022-07-15-0755-pacific. Yuri has not pasted a link to the test runs yet, but I have taken it on to review the card as soon as it's ready.

@neha-ojha I don't think this would break anything given that it is only changing perf counter priorities, but would you still recommend waiting for the tests to run?

@ljflores
Copy link
Contributor

@epuertat Yuri has already triggered rados suite tests, so they should be done running in several hours. I will prioritize reviewing it so the downstream release isn't blocked for too much longer.

I'm pretty sure this wouldn't have an impact on anything as you said, but I'd feel better confirming it so we're not introducing any breakages in pacific.

@epuertat
Copy link
Member

@epuertat Yuri has already triggered rados suite tests, so they should be done running in several hours. I will prioritize reviewing it so the downstream release isn't blocked for too much longer.

I'm pretty sure this wouldn't have an impact on anything as you said, but I'd feel better confirming it so we're not introducing any breakages in pacific.

@ljflores ok, let's wait till then. Thanks a lot!

These perf counters do not show up in telemetry unless they are set to a "useful" priority or higher. Fetching these counters in telemetry may help to diagnose problems with RocksDB / BlueFS prefetching / insufficient cache sizes.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 8790f04)

Note: This backport (and the original PR) exposed an extra 34 perf counters/OSD to Prometheus. Given Pacific is a stable release and not to add that much extra load we are adapting this backport and only exposing the 2 required perf-coun>
@aaSharma14
Copy link
Contributor Author

jenkins test dashboard cephadm

@ljflores
Copy link
Contributor

@epuertat we are good to go!

Rados suite results: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2022-07-15-0755-pacific

Failures, unrelated:
1. https://tracker.ceph.com/issues/52124
2. https://tracker.ceph.com/issues/53294
3. https://tracker.ceph.com/issues/53501
4. https://tracker.ceph.com/issues/55347
5. https://tracker.ceph.com/issues/53827
6. https://tracker.ceph.com/issues/49287
7. https://tracker.ceph.com/issues/53939
8. https://tracker.ceph.com/issues/43584
9. https://tracker.ceph.com/issues/52321
10. https://tracker.ceph.com/issues/56652 -- new Tracker; unrelated to PRs in this run
11. https://tracker.ceph.com/issues/49754
12. https://tracker.ceph.com/issues/55809
13. https://tracker.ceph.com/issues/54992
14. https://tracker.ceph.com/issues/54071

Details:
1. Invalid read of size 8 in handle_recovery_delete() - Ceph - RADOS
2. rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush - Ceph - RADOS
3. Exception when running 'rook' task. - Ceph - Orchestrator
4. SELinux Denials during cephadm/workunits/test_cephadm - Ceph - Orchestrator
5. cephadm exited with error code when creating osd: Input/Output error. Faulty NVME? - Ceph - Orchestrator
6. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator
7. ceph-nfs-upgrade, pacific: Upgrade Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.0 on host smithi103 failed - Ceph - Orchestrator
8. MON_DOWN during mon_join process - Ceph - RADOS -- pending pacific backport
9. qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Ceph - Rook
10. pacific: cephadm/test_repos.sh: rllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Ceph - Orchestrator
11. osd/OSD.cc: ceph_abort_msg("abort() called") during OSD::shutdown() - Ceph - RADOS
12. "Leak_IndirectlyLost" valgrind report on mon.c - Ceph - RADOS
13. cannot stat '/etc/containers/registries.conf': No such file or directory - Ceph - Orchestrator
14. rados/cephadm/osds: Invalid command: missing required parameter hostname() - Ceph - Orchestrator

@epuertat
Copy link
Member

Thanks @ljflores. I'll proceed to merge this.

@epuertat epuertat merged commit c1e9a28 into ceph:pacific Jul 20, 2022
10 checks passed
Dashboard automation moved this from In progress to Done Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
6 participants