Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collectd 6] disk plugin: Align metrics with OpenTelemetry recommendations. #4217

Merged
merged 11 commits into from
Jan 22, 2024

Conversation

octo
Copy link
Member

@octo octo commented Dec 29, 2023

  • Rename labels to system.device and disk.io.direction.
  • Rename system.disk.time to system.disk.operation_time.
  • Add descriptions and units to all metric families.
  • Add the "utilization" metric to FreeBSD.

ChangeLog: Disk plugin: The metric schema has been aligned with OpenTelemetry semantic conventions.

@octo octo requested a review from a team as a code owner December 29, 2023 09:08
@collectd-bot collectd-bot added this to the 6.0 milestone Dec 29, 2023
@octo octo added this to In progress in collectd 6 via automation Dec 29, 2023
@octo octo modified the milestones: 6.0, 6 MVP Jan 3, 2024
@octo octo added Feature Automerge Labels PRs to be merged by a bot once approved labels Jan 11, 2024
@octo octo force-pushed the 6/disk branch 2 times, most recently from f78bdb7 to 96a53a6 Compare January 15, 2024 19:25
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general code changes look OK, but:

  • I think it would be better to handle disk: add /proc/diskstats fields 15 to 20 in KERNEL_LINUX #4242 in main, and merge related main changes to disk.c before this PR.

  • Reporting (some) time units as seconds (s) is IMHO not good while collectd counters are integers i.e. do not support subsecond timings. Meaning that one does not see from counters very small usage spikes i.e. is disk completely idle or not.

  • Time manipulation for values appended to fam_ops_time is confusing as they can be multiplied or divided by different values, and variable names do not indicate their time units (like they IMHO should).

src/disk.c Outdated Show resolved Hide resolved
src/disk.c Outdated Show resolved Hide resolved
src/disk.c Outdated Show resolved Hide resolved
src/disk.c Outdated Show resolved Hide resolved
@octo
Copy link
Member Author

octo commented Jan 19, 2024

The v6 code is so different that I don't think resolving the resulting merge conflict is any easier than reimplementing the feature. Writing the metric families is more effort than plumbing the counters through.

Ok, fair enough.

I have changed all platforms to report microseconds, and changed the metric family unit to reflect that.
...
I have added appropriate suffixes to all variable and field names.

Thanks, it's much more readable now!

(I.e. only remaining item is odd fam_disk_io_time -> fam_ops_time change.)

src/disk.c Outdated Show resolved Hide resolved
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value wrap-up calculations in disk.c are broken, see: grep UINT_MAX disk.c

They deduct 64-bit (signed) derive_t value from 32-bit (unsigned) UINT_MAX value.

Maybe collectd should have DERIVE_T_MAX define for this (both in v5 & v6 branches)?

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I'll file separate bug about the UINT_MAX vs derive_t issue, as it's already in the v5 version of disk plugin, and not introduced by these changes.

@octo
Copy link
Member Author

octo commented Jan 22, 2024

Value wrap-up calculations in disk.c are broken, see: grep UINT_MAX disk.c

They deduct 64-bit (signed) derive_t value from 32-bit (unsigned) UINT_MAX value.

Maybe collectd should have DERIVE_T_MAX define for this (both in v5 & v6 branches)?

That's an excellent callout, thank you! We actually have a function that is doing this right, counter_diff(). I ended up refactoring a lot of the Linux code; being one of the oldest pieces of code in collectd it was quite in need of a facelift.

After a lot of fiddling I figured out that we calculated the average time per operation … just for that variable never to be read. I think the existing metrics are much better at conveying useful information about the disk, so I remove the calculcation and made the plugin a good deal simpler.

We also unnecessarily stored a bunch of stuff in global state, which I have now removed.

@eero-t
Copy link
Contributor

eero-t commented Jan 22, 2024

Gah, read Linux diskstats documentation, and counters use native unsigned long (see the new bug).

I.e. they are indeed 64-bit, but only on 64-bit host. For 32-bit hosts, UINT_MAX was actually the correct maximum value.

@eero-t
Copy link
Contributor

eero-t commented Jan 22, 2024

Nice, counter_diff() already handles transparently unsigned 32-bit vs 64-bit wrap up => it will fix #4245 for v6.

The very best changes are ones that remove lots of code, and still manage retain the (relevant) functionality. :-)

On (very) quick check of the changes, they seemed OK, i.e. from my side this still seems fine for merging.

@eero-t
Copy link
Contributor

eero-t commented Jan 22, 2024

As to adding tests for disk.c (in another PR)...

I think that could be done for the Linux functionality by extracting that from disk_read() to another function, one that takes path to /proc/diskstats as an argument. Test code can then ask it to read pre-prepared regular file(s) instead, to verify that reported metric updates are as expected, and to verify that invalid file content is handled gracefully.

(CPU plugin test coverage could be similarly extended to parsing of /proc/stats file. At that point some helper functions for dealing with temporary test files could be useful.)

@octo
Copy link
Member Author

octo commented Jan 22, 2024

As to adding tests for disk.c (in another PR)...

I think we need to separate I/O and parsing. Maybe a schema along the lines of:

int parse(char const *data, metric_family_t families[]);

int read_callback(void) {
  char *buffer = read_text_file_contents();
  parse(buffer);
}

This will allow us to test parse() without using I/O.

We should also split plugins into multiple files, one for each operating system we support. E.g. disk.c, disk_linux.c, disk_freebsd.c etc. The current approach is hard to read and hard to maintain.

octo and others added 9 commits January 22, 2024 19:28
* Rename labels to `system.device` and `disk.io.direction`.
* Rename `system.disk.time` to `system.disk.operation_time`.
* Add descriptions and units to all metric families.
* Add the "utilization" metric to FreeBSD.
Also detone the time scale in the variable names.
The fields do not mean what we thought they meant. "wtime" means "wait time",
"rtime" means "run time".

Fixes: collectd#3875 (for collectd 6)
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
…`, and `system.disk.pending_operations` for Solaris.
*   Use `counter_diff` to calculate counter differences.
*   Use `counter_t` as we actually want the counter overflow behavior for these metrics.
*   Remove the `has_...` fields from the global data struct.
*   Use `value_to_rate()` to calculate disk busyness.
*   Use `strtoull()` to parse counter values.
@octo octo merged commit 008ef61 into collectd:collectd-6.0 Jan 22, 2024
27 checks passed
collectd 6 automation moved this from In progress to Done Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automerge Labels PRs to be merged by a bot once approved Feature
Projects
collectd 6
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants