Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit metrics for distribution of number of rows per segment #12730

Merged

Conversation

TSFenwick
Copy link
Contributor

@TSFenwick TSFenwick commented Jul 1, 2022

Adds the ability to see the number of rows per segment.

Description

Adds metrics to help users of druid to understand how many rows there are in a segment. As part of this we report the average num of rows in a segment and also a distribution of segments in predefined buckets.

Adding metrics

The use of buckets was chosen because average wasn't good enough to understand the number of rows that are there in segments since its possible to have outliers and such. I looked into using min/max/median but that would involve too much memory keeping track of the number of rows in every segment for every datasource. Using buckets gives us a good idea of the distribution of rows in segments that only requires keeping track of a fixed number of values.

doing this at SegmentManager whenever we load or drop a segment allows for not counting the number of rows(which can be an expensive operation as it scans the segment) more than once. The downside to this is handling lazily loaded segments. These metrics will not make sense if lazy loading is enabled.

We chose to not handle lazy loading with this as it can be hard to understand if a segment is loaded 2x once lazily then actually how to keep track of that. And also it is hard to know if a dropped segment was originally lazy loaded or not. So instead this will throw an exception if SegmentStatsMonitor is enabled and lazyLoadOnStart is true


Key changed/added classes in this PR
  • SegmentRowCountDistribution.java
  • SegmentStatsMonitor.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

return counts of segments that have rowcount in a bucket size for a datasource
return average value of rowcount per segment in a datasource
added unit test
naming could use a lot of work
buckets right now are not finalized
added javadocs
altered metrics.md
@TSFenwick
Copy link
Contributor Author

The bucket sizing hasn't been decided yet so please give lots of input on those

@TSFenwick TSFenwick changed the title initial commit of bucket dimensions for number of rows in a segment num of rows in a segment dimensions Jul 1, 2022
docs/operations/metrics.md Outdated Show resolved Hide resolved
Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this bucket idea that shows the distribution of row count in segments!

I've provided some naming suggestions, but I'm not married to those names - however I do think the current metric names need some workshopping.

Since this metric is not applicable in all deployments (clusters with lazy loading) and is new - what are your thoughts on introducing this as a new monitor so that Druid operators can opt in to this new metric. As written, there is no escape hatch in case it starts producing too many metrics for clusters with thousands of very small datasources where this metric is maybe not as important.

EDIT: As an alternative to making this a separate monitor - we could introduce a property that defines the buckets, so that cluster operators have flexibility in choosing the bucket sizes and by default, we set it to null. So a cluster operator add something like

druid.historical.metrics.segments.ranges=0,10000,200000,4000000,6000000,10000000 to create buckets < 0, 0-10k, 10k-2M, 2M-4M, 4M-6M, 6M-10M, 10M+

docs/operations/metrics.md Outdated Show resolved Hide resolved
docs/operations/metrics.md Outdated Show resolved Hide resolved
add monitor test
move added functionality to new monitor
update docs
Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the overall design. I like the idea of creating a property that defines the buckets, and by default can be set to null. Please also verify that this works if encountering tombstone segments.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI

docs/operations/metrics.md Outdated Show resolved Hide resolved
@TSFenwick
Copy link
Contributor Author

@suneet-s made changes to get tests to pass. Please take a look

@suneet-s
Copy link
Contributor

Thanks @TSFenwick !

@suneet-s suneet-s changed the title num of rows in a segment dimensions Emit metrics for distribution of number of rows per segment Jul 12, 2022
@suneet-s suneet-s merged commit 8c02880 into apache:master Jul 12, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants