Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: counter dump command revision #51947

Merged
merged 2 commits into from Jul 14, 2023

Conversation

alimaredia
Copy link
Contributor

@alimaredia alimaredia commented Jun 6, 2023

counter dump now emits an array of <labels,counters> pairs for each individual key.

Fixes: https://tracker.ceph.com/issues/61587

Example of format change:

   "rgw-put-op": {
      "labels": {
          "Bucket": "bkt"
      },
      "counters": {
          "put_b": 23436,
      }
  },
  "rgw-put-op": {
      "labels": {
          "Bucket": "bkt2"
      },
      "counters": {
          "put_b": 17577,
      }
  }

Becomes:

  "rgw-put-op": [
      {
          "labels": {
              "Bucket": "bkt"
          },
          "counters": {
              "put_b": 23436,
          }
      },
      {
          "labels": {
              "Bucket": "bkt2"
          },
          "counters": {
              "put_b": 17577,
          }
      }
  ],

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@alimaredia alimaredia requested a review from a team as a code owner June 6, 2023 19:44
@alimaredia alimaredia changed the title common: counter dump command revision [DRAFT] common: counter dump command revision Jun 6, 2023
doc/dev/perf_counters.rst Outdated Show resolved Hide resolved
doc/dev/perf_counters.rst Outdated Show resolved Hide resolved
doc/dev/perf_counters.rst Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
src/common/perf_counters.cc Outdated Show resolved Hide resolved
@alimaredia
Copy link
Contributor Author

@idryomov thank you for the review and the patch to refactor the change I made in src/common/perf_counters.cc. I applied that patch and made the other doc related change in a follow-up commit.

If everything looks fine I'll squash the newest commit and add the exporter patch @avanthakkar is working on when it's done. We want that change in this PR right?

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You adjusted existing unit tests, but a case with more than one item in the array -- the entire point of this PR -- is not covered at all...

Also, see the conversation I left unresolved regarding SimpleTest.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit, looks good to squash! @avanthakkar How is the ceph-exporter adjustment coming?

src/test/perf_counters.cc Outdated Show resolved Hide resolved
@avanthakkar
Copy link
Contributor

Just one nit, looks good to squash! @avanthakkar How is the ceph-exporter adjustment coming?

I have the patch ready, should I commit it to this PR itself? @idryomov @alimaredia

@idryomov
Copy link
Contributor

I have the patch ready, should I commit it to this PR itself? @idryomov @alimaredia

I'd just point Ali at it, he would be making the final fixup/squashing and can pick your commit then.

`counter dump` now emits an array of
<labels,counters> pairs for each individual key.

Commit includes revisions to perf counters unit
test.

Fixes: https://tracker.ceph.com/issues/61587
Signed-off-by: Ali Maredia <amaredia@redhat.com>
src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
@alimaredia
Copy link
Contributor Author

@avanthakkar for any modifications to your exporter commit you can just push them to this PR.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit "exporter: adopt counter dump command output" has a stray "cherry picked from" annotation -- this is the original submission to main, not a backport.

src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus endpoint output seems wrong, plus the stray "cherry picked from" annotation is still there.

src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
src/exporter/DaemonMetricCollector.cc Outdated Show resolved Hide resolved
@avanthakkar
Copy link
Contributor

Prometheus endpoint output seems wrong, plus the stray "cherry picked from" annotation is still there.

Yeah seems like when refactoring it missed the extra "_" here
std::string counter_name = perf_group + "_" + counter_name_init;

@avanthakkar
Copy link
Contributor

>> 
counter dump:
"rbd_mirror_snapshot_image": [
        {
            "labels": {
                "image": "image1",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 1,
                "sync_time": {
                    "avgcount": 1,
                    "sum": 0.961036955,
                    "avgtime": 0.961036955
                },
                "sync_bytes": 52428800,
                "remote_timestamp": 1686811917.260234241,
                "local_timestamp": 1686811917.260234241,
                "last_sync_time": 0.961036955,
                "last_sync_bytes": 52428800
            }
        },
        {
            "labels": {
                "image": "image2",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 1,
                "sync_time": {
                    "avgcount": 1,
                    "sum": 1.398823385,
                    "avgtime": 1.398823385
                },
                "sync_bytes": 52428800,
                "remote_timestamp": 1686811921.763860194,
                "local_timestamp": 1686811921.763860194,
                "last_sync_time": 1.398823385,
                "last_sync_bytes": 52428800
            }
        }
    ],

/metrics endpoint:
https://paste.sh/UKXuJg7p#9_0-fEVZkjKrddlUOVqfpQVs

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good but I don't understand why

                "remote_timestamp": 1686811917.260234241,
                "local_timestamp": 1686811917.260234241,
                "last_sync_time": 0.961036955,

and

                "remote_timestamp": 1686811921.763860194,
                "local_timestamp": 1686811921.763860194,
                "last_sync_time": 1.398823385,

became

ceph_rbd_mirror_snapshot_image_local_timestamp{ceph_daemon="client.admin.322498",image="image1",namespace="",pool="data"} 1.686812
ceph_rbd_mirror_snapshot_image_remote_timestamp{ceph_daemon="client.admin.322498",image="image1",namespace="",pool="data"} 1.686812
ceph_rbd_mirror_snapshot_image_last_sync_time{ceph_daemon="client.admin.322498",image="image1",namespace="",pool="data"} 0.000000

and

ceph_rbd_mirror_snapshot_image_local_timestamp{ceph_daemon="client.admin.322498",image="image2",namespace="",pool="data"} 1.686812
ceph_rbd_mirror_snapshot_image_remote_timestamp{ceph_daemon="client.admin.322498",image="image2",namespace="",pool="data"} 1.686812
ceph_rbd_mirror_snapshot_image_last_sync_time{ceph_daemon="client.admin.322498",image="image2",namespace="",pool="data"} 0.000000

respectively. Note the last_sync_time discrepancy for image2 in particular.

@idryomov
Copy link
Contributor

idryomov commented Jun 15, 2023

@avanthakkar This is a third time in a row that you are presenting something that is obviously broken as ready to merge. Please review the code and compare the outputs yourself carefully before pushing/posting.

@idryomov
Copy link
Contributor

I suspect that this

  } else if (type & PERFCOUNTER_TIME) {
    if (perf_values.is_int64()) {
      double value = perf_values.as_int64() / 1000000000.0f;
      add_metric(builder, value, name, description, metric_type, labels);
    } else if (perf_values.is_double()) {
      double value = perf_values.as_double() / 1000000000.0f;
      add_metric(builder, value, name, description, metric_type, labels);
    }

is the problem. Why is the value getting divided by 10^9 here?

@avanthakkar
Copy link
Contributor

avanthakkar commented Jun 15, 2023

1000000000

Well it's converting ns to s, similar to what mgr_module is doing here ,
https://github.com/ceph/ceph/blob/main/src/pybind/mgr/mgr_module.py#L1411
so I don't see a issue here. The output seemed fine to me so I pasted!

@avanthakkar
Copy link
Contributor

jenkins test api

@idryomov
Copy link
Contributor

Well it's converting ns to s, similar to what mgr_module is doing here , https://github.com/ceph/ceph/blob/main/src/pybind/mgr/mgr_module.py#L1411

The manager is getting perf counters over the wire, so there is binary encoding involved. Here, counter dump emits text and does that transformation internally:

} else if (d->type & PERFCOUNTER_TIME) {
f->dump_format_unquoted(d->name, "%" PRId64 ".%09" PRId64,
v / 1000000000ull,
v % 1000000000ull);
} else {

so I don't see a issue here. The output seemed fine to me so I pasted!

If you are assuming a nanosecond unit for "time" values, then the output is telling you that rbd-mirror daemon managed to transfer 50M of data in 0.961 nanoseconds for image1 and 1.399 nanoseconds for image2. That would be very nice but such networks haven't been invented yet :-)

@avanthakkar
Copy link
Contributor

Well it's converting ns to s, similar to what mgr_module is doing here , https://github.com/ceph/ceph/blob/main/src/pybind/mgr/mgr_module.py#L1411

The manager is getting perf counters over the wire, so there is binary encoding involved. Here, counter dump emits text and does that transformation internally:

Yeah makes sense. I'll remove the transformation. Thanks for pointing out

} else if (d->type & PERFCOUNTER_TIME) {
f->dump_format_unquoted(d->name, "%" PRId64 ".%09" PRId64,
v / 1000000000ull,
v % 1000000000ull);
} else {

so I don't see a issue here. The output seemed fine to me so I pasted!

If you are assuming a nanosecond unit for "time" values, then the output is telling you that rbd-mirror daemon managed to transfer 50M of data in 0.961 nanoseconds for image1 and 1.399 nanoseconds for image2. That would be very nice but such networks haven't been invented yet :-)

@avanthakkar
Copy link
Contributor

>> counter dump

"rbd_mirror_snapshot_image": [
        {
            "labels": {
                "image": "image1",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 1,
                "sync_time": {
                    "avgcount": 1,
                    "sum": 0.696087588,
                    "avgtime": 0.696087588
                },
                "sync_bytes": 52428800,
                "remote_timestamp": 1687208448.863646796,
                "local_timestamp": 1687208448.863646796,
                "last_sync_time": 0.696087588,
                "last_sync_bytes": 52428800
            }
        },
        {
            "labels": {
                "image": "image2",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 1,
                "sync_time": {
                    "avgcount": 1,
                    "sum": 0.612760269,
                    "avgtime": 0.612760269
                },
                "sync_bytes": 52428800,
                "remote_timestamp": 1687208451.804576802,
                "local_timestamp": 1687208451.804576802,
                "last_sync_time": 0.612760269,
                "last_sync_bytes": 52428800
            }
        }
    ],

>>> /metrics endpoint
https://paste.sh/SMKBCQNM#itjl3NIGp16-ptiyUoVusXKc

@avanthakkar avanthakkar force-pushed the wip-counter-dump-array-add branch 2 times, most recently from a3ef4ca to 8013655 Compare June 19, 2023 21:09
@avanthakkar
Copy link
Contributor

Well it's converting ns to s, similar to what mgr_module is doing here , https://github.com/ceph/ceph/blob/main/src/pybind/mgr/mgr_module.py#L1411

The manager is getting perf counters over the wire, so there is binary encoding involved. Here, counter dump emits text and does that transformation internally:

} else if (d->type & PERFCOUNTER_TIME) {
f->dump_format_unquoted(d->name, "%" PRId64 ".%09" PRId64,
v / 1000000000ull,
v % 1000000000ull);
} else {

so I don't see a issue here. The output seemed fine to me so I pasted!

If you are assuming a nanosecond unit for "time" values, then the output is telling you that rbd-mirror daemon managed to transfer 50M of data in 0.961 nanoseconds for image1 and 1.399 nanoseconds for image2. That would be very nice but such networks haven't been invented yet :-)

I've removed the condition now and pasted the output for metrics exported.

Copy link
Contributor

@avanthakkar avanthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format change for counter dump/schema LGTM!

Adopt the counter dump format changes in exporter for extracting the counters.
Removed the condition for `PERFCOUNTER_TIME` as counter dump already does
tranformation internally.

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
@avanthakkar avanthakkar changed the title [DRAFT] common: counter dump command revision common: counter dump command revision Jun 26, 2023
@avanthakkar
Copy link
Contributor

>> counter dump

"rbd_mirror_snapshot_image": [
        {
            "labels": {
                "image": "image1",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 2,
                "sync_time": {
                    "avgcount": 2,
                    "sum": 1.272142924,
                    "avgtime": 0.636071462
                },
                "sync_bytes": 73220096,
                "remote_timestamp": 1687797229.414632918,
                "local_timestamp": 1687797229.414632918,
                "last_sync_time": 0.934794481,
                "last_sync_bytes": 20791296
            }
        },
        {
            "labels": {
                "image": "image2",
                "namespace": "",
                "pool": "data"
            },
            "counters": {
                "snapshots": 2,
                "sync_time": {
                    "avgcount": 2,
                    "sum": 1.309814403,
                    "avgtime": 0.654907201
                },
                "sync_bytes": 73207808,
                "remote_timestamp": 1687797234.506100594,
                "local_timestamp": 1687797234.506100594,
                "last_sync_time": 0.926469400,
                "last_sync_bytes": 20779008
            }
        }
    ],

>> /metrics endpoint
https://paste.sh/u_kJWGFx#P9UbPHmZlyowuF_pK2wMmRrW

@idryomov
Copy link
Contributor

jenkins test windows

@avanthakkar
Copy link
Contributor

@idryomov Are you running qa job?

@idryomov
Copy link
Contributor

No, I just set needs-qa label -- it's now in @yuriw's hands.

@avanthakkar
Copy link
Contributor

No, I just set needs-qa label -- it's now in @yuriw's hands.

@idryomov Shouldn't shaman build be enough for testing these changes?

@idryomov
Copy link
Contributor

idryomov commented Jul 3, 2023

@idryomov Shouldn't shaman build be enough for testing these changes?

This is for @ceph/core to decide.

@yuriw yuriw merged commit c57faa6 into ceph:main Jul 14, 2023
12 checks passed
@yuriw
Copy link
Contributor

yuriw commented Jul 14, 2023

From @kamoltat ref: https://trello.com/c/7y6uj4bo

RADOS approved, all failures/dead jobs are unrelated and known failures.

Failures:

7332480 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard
7332558 -> Bug #57302: ERROR: test_get_status (tasks.mgr.dashboard.test_cluster.ClusterTest) - Dashboard - Ceph -> Test failure: test_create_access_permissions (tasks.mgr.dashboard.test_pool.PoolTest)
7332565 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332612 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332613 -> Bug #55347: SELinux Denials during cephadm/workunits/test_cephadm - Infrastructure - Ceph -> SELinux Denials during cephadm/workunits/test_cephadm
7332636 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard

Deads:

7332357 -> Bug #61164: Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds - Infrastructure - Ceph -> Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds
7332405 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"
7332559 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants