Skip to content

Rebuild bloom filter index when fpp config changes#18898

Open
Akanksha-kedia wants to merge 2 commits into
apache:masterfrom
Akanksha-kedia:feat/rebuild-bloom-filter-on-fpp-change
Open

Rebuild bloom filter index when fpp config changes#18898
Akanksha-kedia wants to merge 2 commits into
apache:masterfrom
Akanksha-kedia:feat/rebuild-bloom-filter-on-fpp-change

Conversation

@Akanksha-kedia

Copy link
Copy Markdown
Contributor

Description

When a user changes the fpp (false positive probability) config for a bloom filter index, Pinot previously did NOT detect the change and would not rebuild the index. Users had to:

  1. Remove the bloom filter config
  2. Reload table
  3. Re-add bloom filter config with new fpp
  4. Reload table again

This PR adds fpp change detection to BloomFilterHandler, following the same pattern used for H3 index resolution detection (PR #16953). The detection works by comparing the number of hash functions stored in the existing bloom filter with what the new fpp config would produce (given the column's cardinality). If they differ, the bloom filter is removed and recreated with the updated config.

Changes Made

  • Modified BloomFilterHandler.needUpdateIndices() to check for fpp config changes on existing bloom filter columns
  • Modified BloomFilterHandler.updateIndices() to remove and rebuild bloom filters when fpp config has changed
  • Added isFppChanged() helper that reads numHashFunctions from the existing bloom filter data buffer
  • Added computeExpectedNumHashFunctions() that mirrors Guava's BloomFilter formula to compute the expected number of hash functions from fpp and cardinality

Related Issue

Fixes #17137

Upgrade Notes

None. This is a purely additive behavior change - bloom filter indexes will now be automatically rebuilt when fpp config changes, instead of silently keeping the old index.

Testing Done

  • Unit tests added: SegmentPreProcessorTest#testBloomFilterFppUpdate (tests both v1 and v3 segment formats)
  • Test verifies: creating bloom filter with fpp=0.1, confirming no processing needed, changing fpp to 0.01, confirming processing IS needed, rebuilding, and confirming no further processing needed
  • All existing bloom filter tests pass
  • Checkstyle, spotless, and license checks pass

When a user changes the fpp (false positive probability) config for a
bloom filter index, Pinot now detects the change and rebuilds the index
on segment reload. Previously, users had to remove the bloom filter
config, reload, re-add with the new fpp, and reload again.

The detection works by comparing the number of hash functions stored in
the existing bloom filter with the expected number computed from the new
fpp config and column cardinality. If they differ, the bloom filter is
removed and recreated with the updated config.
@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.42857% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.81%. Comparing base (7fe517a) to head (074f4cb).
⚠️ Report is 308 commits behind head on master.

Files with missing lines Patch % Lines
...t/index/loader/bloomfilter/BloomFilterHandler.java 71.42% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18898      +/-   ##
============================================
+ Coverage     63.68%   64.81%   +1.13%     
+ Complexity     1684     1347     -337     
============================================
  Files          3262     3392     +130     
  Lines        199826   211675   +11849     
  Branches      31031    33307    +2276     
============================================
+ Hits         127264   137207    +9943     
- Misses        62414    63398     +984     
- Partials      10148    11070     +922     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.81% <71.42%> (+1.13%) ⬆️
temurin 64.81% <71.42%> (+1.13%) ⬆️
unittests 64.81% <71.42%> (+1.13%) ⬆️
unittests1 56.98% <28.57%> (+1.22%) ⬆️
unittests2 37.19% <71.42%> (+2.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Akanksha-kedia

Copy link
Copy Markdown
Contributor Author

cc @Jackie-Jiang @klsince @xiangfu0 — requesting your review.

What this PR does

Implements bloom filter rebuild detection when the fpp (false positive probability) config changes (#17137).

Problem: When a user updates fpp in their bloom filter config, Pinot's segment pre-processor silently kept the old — now misconfigured — bloom filter. Other index types (H3, range, text) already detect config changes and trigger rebuilds; bloom filters were the odd one out.

Approach: Follows the H3 index resolution change-detection pattern.

  1. Reads the numHashFunctions stored in the existing bloom filter's data buffer (byte offset 9)
  2. Computes the expected value from the new fpp config: ceil(-ln(fpp) / ln(2))
  3. If they differ → marks the index for rebuild

Files changed:

  • BloomFilterHandler.java — added isFppChanged() and computeExpectedNumHashFunctions() hooked into needUpdateIndices() / updateIndices()
  • SegmentPreProcessorTest.java — added testBloomFilterFppUpdate covering both v1 and v3 segment formats

Change scope: Purely additive — existing segments with unchanged fpp are unaffected.

@Akanksha-kedia

Copy link
Copy Markdown
Contributor Author

Review: Rebuild bloom filter on fpp change

The overall approach is sound and consistent with how RangeIndexHandler detects version changes. 3 MAJOR issues need to be addressed before merging, including one that causes an infinite rebuild loop.


CRITICAL — Formula integer truncation causes infinite rebuild loop

BloomFilterHandler.java, computeExpectedNumHashFunctions()

The PR computes k via an intermediate (long) truncation of optimalNumOfBits, but Guava's optimalNumOfHashFunctions computes k directly from fpp only:

// Guava BloomFilter.java
static int optimalNumOfHashFunctions(double p) {
    return max(1, (int) Math.round(-Math.log(p) / LOG_TWO));
}

The integer truncation causes mismatches at small cardinalities: cardinality=1, fpp=0.01 → Guava writes k=7, PR computes expected k=6 → every reload triggers rebuild → infinite rebuild loop.

Fix — use Guava's direct formula:

return Math.max(1, (int) Math.round(-Math.log(fpp) / Math.log(2)));

MAJOR — updateIndices duplicates ~20 lines from isFppChanged instead of calling the helper

BloomFilterHandler.java, lines 158–183. RangeIndexHandler passes segmentWriter (a Writer extends Reader) directly to the helper that accepts a Reader — same pattern should be used here to avoid duplicated logic that can silently diverge.


MAJOR — Version guard missing before reading at byte offset 9

Without first verifying VERSION == OnHeapGuavaBloomFilterCreator.VERSION, a future format change silently reads garbage. Add:

int version = dataBuffer.getInt(4);
if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
    LOGGER.warn("Unexpected bloom filter version {} for segment/column {}/{}", version, segmentName, column);
    return false;
}

MAJOR — Test does not exercise the formula bug; needs a small-cardinality case

SegmentPreProcessorTest.java, testBloomFilterFppUpdate uses column3 which has high cardinality. The truncation mismatch only manifests for cardinality ∈ {1, 3} with fpp=0.01, so the test passes today despite the bug. Please add a test with a cardinality-1 column verifying: (1) unchanged fpp → no rebuild, (2) changed fpp → rebuild triggered, (3) after rebuild → no rebuild (idempotency).


MINOR — maxSizeInBytes > 0 path not covered by the new test

New test uses BloomFilterConfig(0.1, 0, false) (no size cap). A test that changes maxSizeInBytes such that the effective fpp and k change would cover the GuavaBloomFilterReaderUtils.computeFPP branch.


Byte-offset correctness (verified — no issue)

NUM_HASH_FUNCTIONS_OFFSET = 9 is correct: 4 bytes TYPE_VALUE + 4 bytes VERSION + 1 byte Guava strategy ordinal + 1 byte numHashFunctions. Consistent with BloomFilterReaderFactory.HEADER_SIZE=8 and BaseGuavaBloomFilterReader.NUM_HASH_FUNCTIONS_OFFSET=1.

…, add version guard and small-cardinality test

- Fix CRITICAL formula bug: replace the two-step formula (optimalNumOfBits
  via long-cast then k = round(m/n * ln2)) with Guava's direct formula
  k = max(1, round(-ln(p) / ln(2))). The old formula produced k=6 for
  cardinality=1 and fpp=0.01 but Guava writes k=7, causing an infinite
  rebuild loop on every segment reload.

- Deduplicate fpp-change detection: replace the duplicated 25-line inline
  block in updateIndices with a call to isFppChanged (which accepts
  SegmentDirectory.Reader, satisfied by both Reader and Writer).

- Add version guard: verify the bloom filter file version matches
  OnHeapGuavaBloomFilterCreator.VERSION before reading numHashFunctions
  at byte offset 9; log a warning and skip the fpp check for unknown
  versions rather than reading from an unexpected layout.

- Add testBloomFilterFppUpdateSmallCardinality: exercises the formula bug
  with a cardinality-1 column; the final assertFalse(needProcess()) proves
  idempotency and would fail with the old formula.
@Akanksha-kedia Akanksha-kedia force-pushed the feat/rebuild-bloom-filter-on-fpp-change branch from 8ea839c to 074f4cb Compare July 2, 2026 11:46
@Akanksha-kedia

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've pushed fixes for the issues identified:

Fix 1 (CRITICAL — formula integer truncation): The two-step formula m = (long)(-n * ln(p) / ln(2)^2); k = round(m/n * ln2) is wrong at small cardinalities due to integer truncation. At cardinality=1, fpp=0.01 it computes k=6 but Guava actually writes k=7, causing an infinite rebuild loop on every segment reload. Fixed by using Guava's direct formula: k = max(1, round(-ln(p) / ln(2))) which depends only on fpp and matches BloomFilter.optimalNumOfHashFunctions() exactly.

Fix 2 (MAJOR — duplicated detection logic): updateIndices was reimplementing the fpp-change detection inline instead of calling isFppChanged. Fixed by changing isFppChanged to accept SegmentDirectory.Reader (the common supertype of both Reader and Writer) so it can be called from both needUpdateIndices and updateIndices.

Fix 3 (MAJOR — missing version guard): isFppChanged was reading numHashFunctions at byte offset 9 without checking the bloom filter file version first. Added a version check at offset 4: if the version doesn't match OnHeapGuavaBloomFilterCreator.VERSION, the method logs a warning and returns false (skip the check) rather than potentially reading garbage bytes as hash function count.

Fix 4 (MAJOR — test doesn't exercise the formula bug): Added testBloomFilterFppUpdateSmallCardinality which builds a segment with a cardinality-1 STRING column. The test verifies: (a) same fpp → no rebuild, (b) changed fpp → rebuild triggered, (c) after rebuild → no rebuild (idempotency). This test would fail against the pre-fix formula since cardinality=1 is the only cardinality where the truncation error changes the rounding result.

@Jackie-Jiang Jackie-Jiang added enhancement Improvement to existing functionality index Related to indexing (general) labels Jul 2, 2026
@Jackie-Jiang Jackie-Jiang requested a review from Copilot July 2, 2026 21:03
@Jackie-Jiang

Copy link
Copy Markdown
Contributor

@J-HowHuang Could you help review this?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves segment reload behavior in pinot-segment-local by detecting bloom filter fpp (false positive probability) configuration changes and triggering bloom-filter index rebuilds during segment preprocessing, reducing the need for manual config removal/re-add cycles.

Changes:

  • Added fpp-change detection in BloomFilterHandler.needUpdateIndices() / updateIndices() by reading bloom-filter metadata and comparing against expected values from the new config.
  • Introduced helpers to read numHashFunctions from the existing bloom filter buffer and compute the expected number of hash functions from config.
  • Added unit tests in SegmentPreProcessorTest to validate rebuild behavior across both V1 and V3 segment formats, including a small-cardinality regression scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java Adds bloom-filter fpp change detection and rebuild logic during segment preprocessing.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java Adds tests verifying preprocessing detects and rebuilds bloom filters when fpp changes.

Comment on lines +112 to +118
PinotDataBuffer dataBuffer = segmentReader.getIndexFor(column, StandardIndexes.bloomFilter());
int version = dataBuffer.getInt(VERSION_OFFSET);
if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
LOGGER.warn("Unexpected bloom filter version {} for segment: {}, column: {}; skipping fpp check", version,
segmentName, column);
return false;
}
Comment on lines +124 to +129
int expectedNumHashFunctions = computeExpectedNumHashFunctions(columnMetadata, _bloomFilterConfigs.get(column));
if (expectedNumHashFunctions != existingNumHashFunctions) {
LOGGER.info("Bloom filter fpp config changed for segment: {}, column: {}, existing numHashFunctions: {}, "
+ "expected numHashFunctions: {}. Index needs to be rebuilt.",
segmentName, column, existingNumHashFunctions, expectedNumHashFunctions);
return true;
Comment on lines +1755 to +1768
// Create bloom filter with fpp 0.1
_bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.1, 0, false));
runPreProcessor();

// Verify no processing needed with same config
try (SegmentDirectory segmentDirectory = new SegmentLocalFSDirectory(INDEX_DIR, ReadMode.mmap);
SegmentPreProcessor processor = new SegmentPreProcessor(segmentDirectory,
createIndexLoadingConfig(_schema))) {
assertFalse(processor.needProcess());
}

// Update bloom filter fpp to 0.01
_bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.01, 0, false));

@J-HowHuang

J-HowHuang commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

@Akanksha-kedia Thanks for addressing this issue!

I don't think it's a clean way to directly read the number of hash functions from the raw bytes Guava bloom filter writes. This makes Pinot bloom filter depends on the implementation of Guava's bloom filter and this hidden dependency can be easily overlooked in the future.

Instead can we try a different approach to internalize these parameters (mainly fpp here) into the pinot bloom filter header? Currently we have the header layout of our pinot bloom filter:

+------------------+---------------+-----------------------------+
| TYPE_VALUE (int) | VERSION (int) | Guava bloom filter bytes... |
+------------------+---------------+-----------------------------+

Can we create a new BloomFilterReader and BloomFilterCreator implementations that write and read the fpp into/from our header? Likely

+------------------+---------------+--------------+-----------------------------+
| TYPE_VALUE (int) | VERSION (int) | FPP (double) | Guava bloom filter bytes... |
+------------------+---------------+--------------+-----------------------------+

The reader factory should be able to tell which implementation to use by looking at TYPE_VALUE and read the correct bytes. And we can default to the new implementation and fall back to the old index then, so all the newly created bloom filters will be able to detect fpp change next time it reloads/loads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality index Related to indexing (general)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rebuild bloomfilter when fpp changed

5 participants