Skip to content

Conversation

@masseyke
Copy link
Member

In doing some profiling and looking at sampling stats, I realized that calls to getSamplingConfiguration() were relatively expensive. I'm assuming the most common cases by far are:

  1. An index is configured for a sample, but it has already reached it's max samples limit
  2. An index is not configured for sampling at all but some other index is

For 1, I moved the call to getSamplingConfiguration() to after the isFull check. So when we're not getting past the isFull check, we just never call getSamplingConfiguration().
For 2, I added a NONE SampleInfo marker, letting us know that there is no need to lookup the configuration for this index. If a configuration is later added for the index, the clusterChanged() method removes this NONE sample.

The script I was running to profile this does the following:
Creates a sampling configuration for index1
Bulk loads dozens of entries into index1 and index2 tens of thousands of times
Creates a sampling configuration for index2
Bulk loads some entries into index2.

With this change, it looks much better in the profiler -- I never see calls to getSamplingConfiguration() now. And the reported stats from sampling go from ~1000ns/doc to ~55ns/doc (on my laptop with this particular test).

@masseyke masseyke added >non-issue :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v9.3.0 labels Oct 27, 2025
@masseyke masseyke requested a review from seanzatzdev October 27, 2025 21:00
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 27, 2025
@masseyke masseyke requested a review from Copilot October 27, 2025 22:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the performance of random sampling by reducing unnecessary calls to getSamplingConfiguration(). The optimization addresses two common scenarios: (1) when an index has reached its sample limit, and (2) when an index has no sampling configuration. Performance improvements show a reduction from ~1000ns/doc to ~55ns/doc in testing.

Key changes:

  • Defers getSamplingConfiguration() call until after the isFull check to avoid lookups when sample limits are reached
  • Introduces a SampleInfo.NONE marker to cache the absence of sampling configuration, eliminating repeated lookups for unconfigured indexes
  • Updates clusterChanged() to properly remove the NONE marker when a new sampling configuration is added

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
SamplingService.java Implements lazy configuration lookup with NONE marker caching and moves expensive configuration call after capacity checks
SamplingServiceTests.java Adds test coverage for NONE marker caching behavior before cluster state notification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@seanzatzdev seanzatzdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@masseyke masseyke merged commit 3432472 into elastic:main Oct 29, 2025
34 checks passed
@masseyke masseyke deleted the random-sampling-minor-performance-2 branch October 29, 2025 16:31
@masseyke masseyke changed the title Improving random sampling performance by lazily calling getSamplingConfiguration() [Sampling] Improving random sampling performance by lazily calling getSamplingConfiguration() Oct 29, 2025
shmuelhanoch pushed a commit to shmuelhanoch/elasticsearch that referenced this pull request Oct 29, 2025
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >non-issue Team:Data Management Meta label for data/management team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants