Rebuild bloom filter index when fpp config changes#18898
Rebuild bloom filter index when fpp config changes#18898Akanksha-kedia wants to merge 2 commits into
Conversation
When a user changes the fpp (false positive probability) config for a bloom filter index, Pinot now detects the change and rebuilds the index on segment reload. Previously, users had to remove the bloom filter config, reload, re-add with the new fpp, and reload again. The detection works by comparing the number of hash functions stored in the existing bloom filter with the expected number computed from the new fpp config and column cardinality. If they differ, the bloom filter is removed and recreated with the updated config.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18898 +/- ##
============================================
+ Coverage 63.68% 64.81% +1.13%
+ Complexity 1684 1347 -337
============================================
Files 3262 3392 +130
Lines 199826 211675 +11849
Branches 31031 33307 +2276
============================================
+ Hits 127264 137207 +9943
- Misses 62414 63398 +984
- Partials 10148 11070 +922
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
cc @Jackie-Jiang @klsince @xiangfu0 — requesting your review. What this PR doesImplements bloom filter rebuild detection when the fpp (false positive probability) config changes (#17137). Problem: When a user updates fpp in their bloom filter config, Pinot's segment pre-processor silently kept the old — now misconfigured — bloom filter. Other index types (H3, range, text) already detect config changes and trigger rebuilds; bloom filters were the odd one out. Approach: Follows the H3 index resolution change-detection pattern.
Files changed:
Change scope: Purely additive — existing segments with unchanged fpp are unaffected. |
Review: Rebuild bloom filter on fpp changeThe overall approach is sound and consistent with how CRITICAL — Formula integer truncation causes infinite rebuild loop
The PR computes // Guava BloomFilter.java
static int optimalNumOfHashFunctions(double p) {
return max(1, (int) Math.round(-Math.log(p) / LOG_TWO));
}The integer truncation causes mismatches at small cardinalities: Fix — use Guava's direct formula: return Math.max(1, (int) Math.round(-Math.log(fpp) / Math.log(2)));MAJOR —
|
…, add version guard and small-cardinality test - Fix CRITICAL formula bug: replace the two-step formula (optimalNumOfBits via long-cast then k = round(m/n * ln2)) with Guava's direct formula k = max(1, round(-ln(p) / ln(2))). The old formula produced k=6 for cardinality=1 and fpp=0.01 but Guava writes k=7, causing an infinite rebuild loop on every segment reload. - Deduplicate fpp-change detection: replace the duplicated 25-line inline block in updateIndices with a call to isFppChanged (which accepts SegmentDirectory.Reader, satisfied by both Reader and Writer). - Add version guard: verify the bloom filter file version matches OnHeapGuavaBloomFilterCreator.VERSION before reading numHashFunctions at byte offset 9; log a warning and skip the fpp check for unknown versions rather than reading from an unexpected layout. - Add testBloomFilterFppUpdateSmallCardinality: exercises the formula bug with a cardinality-1 column; the final assertFalse(needProcess()) proves idempotency and would fail with the old formula.
8ea839c to
074f4cb
Compare
|
Thanks for the review! I've pushed fixes for the issues identified: Fix 1 (CRITICAL — formula integer truncation): The two-step formula Fix 2 (MAJOR — duplicated detection logic): Fix 3 (MAJOR — missing version guard): Fix 4 (MAJOR — test doesn't exercise the formula bug): Added |
|
@J-HowHuang Could you help review this? |
There was a problem hiding this comment.
Pull request overview
This PR improves segment reload behavior in pinot-segment-local by detecting bloom filter fpp (false positive probability) configuration changes and triggering bloom-filter index rebuilds during segment preprocessing, reducing the need for manual config removal/re-add cycles.
Changes:
- Added fpp-change detection in
BloomFilterHandler.needUpdateIndices()/updateIndices()by reading bloom-filter metadata and comparing against expected values from the new config. - Introduced helpers to read
numHashFunctionsfrom the existing bloom filter buffer and compute the expected number of hash functions from config. - Added unit tests in
SegmentPreProcessorTestto validate rebuild behavior across both V1 and V3 segment formats, including a small-cardinality regression scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java |
Adds bloom-filter fpp change detection and rebuild logic during segment preprocessing. |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java |
Adds tests verifying preprocessing detects and rebuilds bloom filters when fpp changes. |
| PinotDataBuffer dataBuffer = segmentReader.getIndexFor(column, StandardIndexes.bloomFilter()); | ||
| int version = dataBuffer.getInt(VERSION_OFFSET); | ||
| if (version != OnHeapGuavaBloomFilterCreator.VERSION) { | ||
| LOGGER.warn("Unexpected bloom filter version {} for segment: {}, column: {}; skipping fpp check", version, | ||
| segmentName, column); | ||
| return false; | ||
| } |
| int expectedNumHashFunctions = computeExpectedNumHashFunctions(columnMetadata, _bloomFilterConfigs.get(column)); | ||
| if (expectedNumHashFunctions != existingNumHashFunctions) { | ||
| LOGGER.info("Bloom filter fpp config changed for segment: {}, column: {}, existing numHashFunctions: {}, " | ||
| + "expected numHashFunctions: {}. Index needs to be rebuilt.", | ||
| segmentName, column, existingNumHashFunctions, expectedNumHashFunctions); | ||
| return true; |
| // Create bloom filter with fpp 0.1 | ||
| _bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.1, 0, false)); | ||
| runPreProcessor(); | ||
|
|
||
| // Verify no processing needed with same config | ||
| try (SegmentDirectory segmentDirectory = new SegmentLocalFSDirectory(INDEX_DIR, ReadMode.mmap); | ||
| SegmentPreProcessor processor = new SegmentPreProcessor(segmentDirectory, | ||
| createIndexLoadingConfig(_schema))) { | ||
| assertFalse(processor.needProcess()); | ||
| } | ||
|
|
||
| // Update bloom filter fpp to 0.01 | ||
| _bloomFilterConfigs = Map.of("column3", new BloomFilterConfig(0.01, 0, false)); | ||
|
|
|
@Akanksha-kedia Thanks for addressing this issue! I don't think it's a clean way to directly read the number of hash functions from the raw bytes Guava bloom filter writes. This makes Pinot bloom filter depends on the implementation of Guava's bloom filter and this hidden dependency can be easily overlooked in the future. Instead can we try a different approach to internalize these parameters (mainly Can we create a new The reader factory should be able to tell which implementation to use by looking at |
Description
When a user changes the
fpp(false positive probability) config for a bloom filter index, Pinot previously did NOT detect the change and would not rebuild the index. Users had to:This PR adds fpp change detection to
BloomFilterHandler, following the same pattern used for H3 index resolution detection (PR #16953). The detection works by comparing the number of hash functions stored in the existing bloom filter with what the new fpp config would produce (given the column's cardinality). If they differ, the bloom filter is removed and recreated with the updated config.Changes Made
BloomFilterHandler.needUpdateIndices()to check for fpp config changes on existing bloom filter columnsBloomFilterHandler.updateIndices()to remove and rebuild bloom filters when fpp config has changedisFppChanged()helper that readsnumHashFunctionsfrom the existing bloom filter data buffercomputeExpectedNumHashFunctions()that mirrors Guava's BloomFilter formula to compute the expected number of hash functions from fpp and cardinalityRelated Issue
Fixes #17137
Upgrade Notes
None. This is a purely additive behavior change - bloom filter indexes will now be automatically rebuilt when fpp config changes, instead of silently keeping the old index.
Testing Done
SegmentPreProcessorTest#testBloomFilterFppUpdate(tests both v1 and v3 segment formats)