-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Sampling] Improving random sampling performance by lazily calling getSamplingConfiguration() #137223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Sampling] Improving random sampling performance by lazily calling getSamplingConfiguration() #137223
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the performance of random sampling by reducing unnecessary calls to getSamplingConfiguration(). The optimization addresses two common scenarios: (1) when an index has reached its sample limit, and (2) when an index has no sampling configuration. Performance improvements show a reduction from ~1000ns/doc to ~55ns/doc in testing.
Key changes:
- Defers
getSamplingConfiguration()call until after theisFullcheck to avoid lookups when sample limits are reached - Introduces a
SampleInfo.NONEmarker to cache the absence of sampling configuration, eliminating repeated lookups for unconfigured indexes - Updates
clusterChanged()to properly remove the NONE marker when a new sampling configuration is added
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| SamplingService.java | Implements lazy configuration lookup with NONE marker caching and moves expensive configuration call after capacity checks |
| SamplingServiceTests.java | Adds test coverage for NONE marker caching behavior before cluster state notification |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
seanzatzdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
In doing some profiling and looking at sampling stats, I realized that calls to getSamplingConfiguration() were relatively expensive. I'm assuming the most common cases by far are:
For 1, I moved the call to getSamplingConfiguration() to after the
isFullcheck. So when we're not getting past the isFull check, we just never call getSamplingConfiguration().For 2, I added a NONE SampleInfo marker, letting us know that there is no need to lookup the configuration for this index. If a configuration is later added for the index, the clusterChanged() method removes this NONE sample.
The script I was running to profile this does the following:
Creates a sampling configuration for index1
Bulk loads dozens of entries into index1 and index2 tens of thousands of times
Creates a sampling configuration for index2
Bulk loads some entries into index2.
With this change, it looks much better in the profiler -- I never see calls to getSamplingConfiguration() now. And the reported stats from sampling go from ~1000ns/doc to ~55ns/doc (on my laptop with this particular test).