Rebuild bloom filter index when fpp config changes by Akanksha-kedia · Pull Request #18898 · apache/pinot

Akanksha-kedia · 2026-07-01T09:26:21Z

Description

When a user changes the fpp (false positive probability) config for a bloom filter index, Pinot previously did NOT detect the change and would not rebuild the index. Users had to:

Remove the bloom filter config
Reload table
Re-add bloom filter config with new fpp
Reload table again

This PR adds fpp change detection to BloomFilterHandler, following the same pattern used for H3 index resolution detection (PR #16953). The detection works by comparing the number of hash functions stored in the existing bloom filter with what the new fpp config would produce (given the column's cardinality). If they differ, the bloom filter is removed and recreated with the updated config.

Changes Made

Modified BloomFilterHandler.needUpdateIndices() to check for fpp config changes on existing bloom filter columns
Modified BloomFilterHandler.updateIndices() to remove and rebuild bloom filters when fpp config has changed
Added isFppChanged() helper that reads numHashFunctions from the existing bloom filter data buffer
Added computeExpectedNumHashFunctions() that mirrors Guava's BloomFilter formula to compute the expected number of hash functions from fpp and cardinality

Related Issue

Fixes #17137

Upgrade Notes

None. This is a purely additive behavior change - bloom filter indexes will now be automatically rebuilt when fpp config changes, instead of silently keeping the old index.

Testing Done

Unit tests added: SegmentPreProcessorTest#testBloomFilterFppUpdate (tests both v1 and v3 segment formats)
Test verifies: creating bloom filter with fpp=0.1, confirming no processing needed, changing fpp to 0.01, confirming processing IS needed, rebuilding, and confirming no further processing needed
All existing bloom filter tests pass
Checkstyle, spotless, and license checks pass

When a user changes the fpp (false positive probability) config for a bloom filter index, Pinot now detects the change and rebuilds the index on segment reload. Previously, users had to remove the bloom filter config, reload, re-add with the new fpp, and reload again. The detection works by comparing the number of hash functions stored in the existing bloom filter with the expected number computed from the new fpp config and column cardinality. If they differ, the bloom filter is removed and recreated with the updated config.

codecov-commenter · 2026-07-01T10:22:31Z

Codecov Report

❌ Patch coverage is 71.42857% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.81%. Comparing base (7fe517a) to head (074f4cb).
⚠️ Report is 308 commits behind head on master.

Files with missing lines	Patch %	Lines
...t/index/loader/bloomfilter/BloomFilterHandler.java	71.42%	7 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18898      +/-   ##
============================================
+ Coverage     63.68%   64.81%   +1.13%     
+ Complexity     1684     1347     -337     
============================================
  Files          3262     3392     +130     
  Lines        199826   211675   +11849     
  Branches      31031    33307    +2276     
============================================
+ Hits         127264   137207    +9943     
- Misses        62414    63398     +984     
- Partials      10148    11070     +922

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.81% <71.42%> (+1.13%)`	⬆️
temurin	`64.81% <71.42%> (+1.13%)`	⬆️
unittests	`64.81% <71.42%> (+1.13%)`	⬆️
unittests1	`56.98% <28.57%> (+1.22%)`	⬆️
unittests2	`37.19% <71.42%> (+2.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Akanksha-kedia · 2026-07-01T11:04:00Z

cc @Jackie-Jiang @klsince @xiangfu0 — requesting your review.

What this PR does

Implements bloom filter rebuild detection when the fpp (false positive probability) config changes (#17137).

Problem: When a user updates fpp in their bloom filter config, Pinot's segment pre-processor silently kept the old — now misconfigured — bloom filter. Other index types (H3, range, text) already detect config changes and trigger rebuilds; bloom filters were the odd one out.

Approach: Follows the H3 index resolution change-detection pattern.

Reads the numHashFunctions stored in the existing bloom filter's data buffer (byte offset 9)
Computes the expected value from the new fpp config: ceil(-ln(fpp) / ln(2))
If they differ → marks the index for rebuild

Files changed:

BloomFilterHandler.java — added isFppChanged() and computeExpectedNumHashFunctions() hooked into needUpdateIndices() / updateIndices()
SegmentPreProcessorTest.java — added testBloomFilterFppUpdate covering both v1 and v3 segment formats

Change scope: Purely additive — existing segments with unchanged fpp are unaffected.

Akanksha-kedia · 2026-07-02T03:43:00Z

Review: Rebuild bloom filter on fpp change

The overall approach is sound and consistent with how RangeIndexHandler detects version changes. 3 MAJOR issues need to be addressed before merging, including one that causes an infinite rebuild loop.

CRITICAL — Formula integer truncation causes infinite rebuild loop

BloomFilterHandler.java, computeExpectedNumHashFunctions()

The PR computes k via an intermediate (long) truncation of optimalNumOfBits, but Guava's optimalNumOfHashFunctions computes k directly from fpp only:

// Guava BloomFilter.java
static int optimalNumOfHashFunctions(double p) {
    return max(1, (int) Math.round(-Math.log(p) / LOG_TWO));
}

The integer truncation causes mismatches at small cardinalities: cardinality=1, fpp=0.01 → Guava writes k=7, PR computes expected k=6 → every reload triggers rebuild → infinite rebuild loop.

Fix — use Guava's direct formula:

return Math.max(1, (int) Math.round(-Math.log(fpp) / Math.log(2)));

MAJOR — `updateIndices` duplicates ~20 lines from `isFppChanged` instead of calling the helper

BloomFilterHandler.java, lines 158–183. RangeIndexHandler passes segmentWriter (a Writer extends Reader) directly to the helper that accepts a Reader — same pattern should be used here to avoid duplicated logic that can silently diverge.

MAJOR — Version guard missing before reading at byte offset 9

Without first verifying VERSION == OnHeapGuavaBloomFilterCreator.VERSION, a future format change silently reads garbage. Add:

int version = dataBuffer.getInt(4);
if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
    LOGGER.warn("Unexpected bloom filter version {} for segment/column {}/{}", version, segmentName, column);
    return false;
}

MAJOR — Test does not exercise the formula bug; needs a small-cardinality case

SegmentPreProcessorTest.java, testBloomFilterFppUpdate uses column3 which has high cardinality. The truncation mismatch only manifests for cardinality ∈ {1, 3} with fpp=0.01, so the test passes today despite the bug. Please add a test with a cardinality-1 column verifying: (1) unchanged fpp → no rebuild, (2) changed fpp → rebuild triggered, (3) after rebuild → no rebuild (idempotency).

MINOR — `maxSizeInBytes > 0` path not covered by the new test

New test uses BloomFilterConfig(0.1, 0, false) (no size cap). A test that changes maxSizeInBytes such that the effective fpp and k change would cover the GuavaBloomFilterReaderUtils.computeFPP branch.

Byte-offset correctness (verified — no issue)

NUM_HASH_FUNCTIONS_OFFSET = 9 is correct: 4 bytes TYPE_VALUE + 4 bytes VERSION + 1 byte Guava strategy ordinal + 1 byte numHashFunctions. Consistent with BloomFilterReaderFactory.HEADER_SIZE=8 and BaseGuavaBloomFilterReader.NUM_HASH_FUNCTIONS_OFFSET=1.

…, add version guard and small-cardinality test - Fix CRITICAL formula bug: replace the two-step formula (optimalNumOfBits via long-cast then k = round(m/n * ln2)) with Guava's direct formula k = max(1, round(-ln(p) / ln(2))). The old formula produced k=6 for cardinality=1 and fpp=0.01 but Guava writes k=7, causing an infinite rebuild loop on every segment reload. - Deduplicate fpp-change detection: replace the duplicated 25-line inline block in updateIndices with a call to isFppChanged (which accepts SegmentDirectory.Reader, satisfied by both Reader and Writer). - Add version guard: verify the bloom filter file version matches OnHeapGuavaBloomFilterCreator.VERSION before reading numHashFunctions at byte offset 9; log a warning and skip the fpp check for unknown versions rather than reading from an unexpected layout. - Add testBloomFilterFppUpdateSmallCardinality: exercises the formula bug with a cardinality-1 column; the final assertFalse(needProcess()) proves idempotency and would fail with the old formula.

Akanksha-kedia · 2026-07-02T12:56:23Z

Thanks for the review! I've pushed fixes for the issues identified:

Fix 1 (CRITICAL — formula integer truncation): The two-step formula m = (long)(-n * ln(p) / ln(2)^2); k = round(m/n * ln2) is wrong at small cardinalities due to integer truncation. At cardinality=1, fpp=0.01 it computes k=6 but Guava actually writes k=7, causing an infinite rebuild loop on every segment reload. Fixed by using Guava's direct formula: k = max(1, round(-ln(p) / ln(2))) which depends only on fpp and matches BloomFilter.optimalNumOfHashFunctions() exactly.

Fix 2 (MAJOR — duplicated detection logic): updateIndices was reimplementing the fpp-change detection inline instead of calling isFppChanged. Fixed by changing isFppChanged to accept SegmentDirectory.Reader (the common supertype of both Reader and Writer) so it can be called from both needUpdateIndices and updateIndices.

Fix 3 (MAJOR — missing version guard): isFppChanged was reading numHashFunctions at byte offset 9 without checking the bloom filter file version first. Added a version check at offset 4: if the version doesn't match OnHeapGuavaBloomFilterCreator.VERSION, the method logs a warning and returns false (skip the check) rather than potentially reading garbage bytes as hash function count.

Fix 4 (MAJOR — test doesn't exercise the formula bug): Added testBloomFilterFppUpdateSmallCardinality which builds a segment with a cardinality-1 STRING column. The test verifies: (a) same fpp → no rebuild, (b) changed fpp → rebuild triggered, (c) after rebuild → no rebuild (idempotency). This test would fail against the pre-fix formula since cardinality=1 is the only cardinality where the truncation error changes the rounding result.

Jackie-Jiang · 2026-07-02T21:05:35Z

@J-HowHuang Could you help review this?

Copilot

Pull request overview

This PR improves segment reload behavior in pinot-segment-local by detecting bloom filter fpp (false positive probability) configuration changes and triggering bloom-filter index rebuilds during segment preprocessing, reducing the need for manual config removal/re-add cycles.

Changes:

Added fpp-change detection in BloomFilterHandler.needUpdateIndices() / updateIndices() by reading bloom-filter metadata and comparing against expected values from the new config.
Introduced helpers to read numHashFunctions from the existing bloom filter buffer and compute the expected number of hash functions from config.
Added unit tests in SegmentPreProcessorTest to validate rebuild behavior across both V1 and V3 segment formats, including a small-cardinality regression scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java`	Adds bloom-filter fpp change detection and rebuild logic during segment preprocessing.
`pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java`	Adds tests verifying preprocessing detects and rebuilds bloom filters when fpp changes.

+      PinotDataBuffer dataBuffer = segmentReader.getIndexFor(column, StandardIndexes.bloomFilter());
+      int version = dataBuffer.getInt(VERSION_OFFSET);
+      if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
+        LOGGER.warn("Unexpected bloom filter version {} for segment: {}, column: {}; skipping fpp check", version,
+            segmentName, column);
+        return false;
+      }


+    int expectedNumHashFunctions = computeExpectedNumHashFunctions(columnMetadata, _bloomFilterConfigs.get(column));
+    if (expectedNumHashFunctions != existingNumHashFunctions) {
+      LOGGER.info("Bloom filter fpp config changed for segment: {}, column: {}, existing numHashFunctions: {}, "
+              + "expected numHashFunctions: {}. Index needs to be rebuilt.",
+          segmentName, column, existingNumHashFunctions, expectedNumHashFunctions);
+      return true;


+    // Create bloom filter with fpp 0.1
+    _bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.1, 0, false));
+    runPreProcessor();
+
+    // Verify no processing needed with same config
+    try (SegmentDirectory segmentDirectory = new SegmentLocalFSDirectory(INDEX_DIR, ReadMode.mmap);
+        SegmentPreProcessor processor = new SegmentPreProcessor(segmentDirectory,
+            createIndexLoadingConfig(_schema))) {
+      assertFalse(processor.needProcess());
+    }
+
+    // Update bloom filter fpp to 0.01
+    _bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.01, 0, false));
+


J-HowHuang · 2026-07-02T21:50:38Z

@Akanksha-kedia Thanks for addressing this issue!

I don't think it's a clean way to directly read the number of hash functions from the raw bytes Guava bloom filter writes. This makes Pinot bloom filter depends on the implementation of Guava's bloom filter and this hidden dependency can be easily overlooked in the future.

Instead can we try a different approach to internalize these parameters (mainly fpp here) into the pinot bloom filter header? Currently we have the header layout of our pinot bloom filter:

+------------------+---------------+-----------------------------+
| TYPE_VALUE (int) | VERSION (int) | Guava bloom filter bytes... |
+------------------+---------------+-----------------------------+

Can we create a new BloomFilterReader and BloomFilterCreator implementations that write and read the fpp into/from our header? Likely

+------------------+---------------+--------------+-----------------------------+
| TYPE_VALUE (int) | VERSION (int) | FPP (double) | Guava bloom filter bytes... |
+------------------+---------------+--------------+-----------------------------+

The reader factory should be able to tell which implementation to use by looking at TYPE_VALUE and read the correct bytes. And we can default to the new implementation and fall back to the old index then, so all the newly created bloom filters will be able to detect fpp change next time it reloads/loads.

Akanksha-kedia force-pushed the feat/rebuild-bloom-filter-on-fpp-change branch from 8ea839c to 074f4cb Compare July 2, 2026 11:46

Jackie-Jiang added enhancement Improvement to existing functionality index Related to indexing (general) labels Jul 2, 2026

Jackie-Jiang requested a review from Copilot July 2, 2026 21:03

Copilot started reviewing on behalf of Jackie-Jiang July 2, 2026 21:03 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebuild bloom filter index when fpp config changes#18898

Rebuild bloom filter index when fpp config changes#18898
Akanksha-kedia wants to merge 2 commits into
apache:masterfrom
Akanksha-kedia:feat/rebuild-bloom-filter-on-fpp-change

Akanksha-kedia commented Jul 1, 2026

Uh oh!

codecov-commenter commented Jul 1, 2026 •

edited

Loading

Uh oh!

Akanksha-kedia commented Jul 1, 2026

Uh oh!

Akanksha-kedia commented Jul 2, 2026

Uh oh!

Akanksha-kedia commented Jul 2, 2026

Uh oh!

Jackie-Jiang commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

J-HowHuang commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Akanksha-kedia commented Jul 1, 2026

Description

Changes Made

Related Issue

Upgrade Notes

Testing Done

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Akanksha-kedia commented Jul 1, 2026

What this PR does

Uh oh!

Akanksha-kedia commented Jul 2, 2026

Review: Rebuild bloom filter on fpp change

CRITICAL — Formula integer truncation causes infinite rebuild loop

MAJOR — updateIndices duplicates ~20 lines from isFppChanged instead of calling the helper

MAJOR — Version guard missing before reading at byte offset 9

MAJOR — Test does not exercise the formula bug; needs a small-cardinality case

MINOR — maxSizeInBytes > 0 path not covered by the new test

Byte-offset correctness (verified — no issue)

Uh oh!

Akanksha-kedia commented Jul 2, 2026

Uh oh!

Jackie-Jiang commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

J-HowHuang commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jul 1, 2026 •

edited

Loading

MAJOR — `updateIndices` duplicates ~20 lines from `isFppChanged` instead of calling the helper

MINOR — `maxSizeInBytes > 0` path not covered by the new test

J-HowHuang commented Jul 2, 2026 •

edited

Loading