[common] Introduce PrefixFileIndex for prefix query optimization#7750
[common] Introduce PrefixFileIndex for prefix query optimization#7750xuzifu666 wants to merge 3 commits into
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
Review: PrefixFileIndex for prefix query optimization
Thanks for this contribution. The idea of a lightweight prefix-based inverted index for accelerating LIKE 'prefix%' and STARTS_WITH queries is sound, and the benchmarks clearly demonstrate the value. Below are some issues and suggestions.
Correctness Issues
1. Null bitmap offset semantics are ambiguous in the reader
In the Writer, when there is exactly one null row, you encode it as nullOffset = -1 - nullBitmap.first(). However, in the Reader, the code that calls hasPrefix() never actually uses this offset encoding for the null case. The visitIsNull only checks hasNull, while the visitEqual(fieldRef, null) path also only checks hasNull. The nullOffset field is read but never actually used to reconstruct the null bitmap. If a future reader needs to return row-level results (e.g., for row-group filtering), this compact encoding will require documentation so it can be properly decoded.
2. hasPrefix() contains a dead code path for negative offset
if (offset < 0) {
// single value shortcut
return true;
}This branch can never be reached because the offsets stored in prefixOffsets are always >= 0 (they are computed via bodyOffset which starts at 0 and accumulates). The negative-offset optimization is only used for the null bitmap, which is not stored in prefixOffsets. This is dead code and might confuse future maintainers.
3. Query prefix longer than index prefix length produces false negatives risk
When a query literal (e.g., "hello_world") is longer than prefixLength, both the writer and reader truncate it to prefixLength chars. This is correct — but when the query prefix is shorter than prefixLength, the fallback iteration in hasPrefix() does a linear scan of all entries. This is O(n) where n is the number of distinct prefixes. For high-cardinality prefix spaces this could regress. Consider building a sorted structure (TreeMap) or at minimum documenting this trade-off.
Design Suggestions
4. dataType is accepted but never validated
The constructor accepts any DataType but the index only works with string types. If a user misconfigures a prefix index on an INT column, they will get a confusing ClassCastException at write time. Consider adding a type check in the constructor or factory (similar to how other indexes validate supported types).
5. The Reader does not store bitmap lengths — deserialization relies on internal format
The body section stores bitmaps back-to-back, but there is no stored length per bitmap. The readBitmap(offset) method passes data.length - bodyStart - offset as the available bytes, relying on RoaringBitmap32.deserialize() to only read what it needs. While this works for the current RoaringBitmap implementation, it is fragile — if the serialization format changes or if there are trailing bytes, it could break. Consider storing the byte length of each bitmap in the header.
6. No integration with the predicate pushdown framework
This PR only adds the index implementation but does not wire it into the query planning / file-pruning logic. For example, there is no evidence that StartsWith predicates or LIKE predicates will actually consult this index during scan. This might be intentional (staged PRs), but it would be good to clarify in the PR description whether a follow-up is planned.
7. Missing close() / resource cleanup in the Reader
The Reader class extends FileIndexReader and reads the full byte array eagerly, so there is no resource leak per se. However, the benchmark's queryPrefix method creates a LocalSeekableInputStream on every call and never closes it — this is a resource leak in the benchmark (though not in production code).
Minor / Style
- The
Writerclass usesjava.util.Listandjava.util.ArrayListwith fully-qualified names inside method bodies (lines insortedPrefixes()). These should be proper imports at the top of the file for consistency with the rest of the codebase. - The benchmark mixes JUnit 4 (
@Rule,TemporaryFolder) with JUnit 5 (@Testfrom jupiter).@Ruledoes not work with JUnit 5 without the@ExtendWith(SpringExtension.class)or@RegisterExtension— this meansfolder.create()must be called manually (which it is), but it is an unusual pattern. Consider switching to JUnit 5's@TempDir. - The
VERSIONandPREFIX_LENGTHstring constants inPrefixFileIndexshadow similar constants inBitmapFileIndex. If these are user-facing option keys, consider namespacing them (e.g.,"prefix.prefix-length").
Summary
The core algorithm is correct and the index design is reasonable for its intended use case. The main concerns are: (1) dead code in the reader's offset handling, (2) missing type validation, (3) no stored bitmap lengths making the format fragile, and (4) the linear fallback in hasPrefix() for short query prefixes. The benchmarks are convincing but would benefit from JUnit 5 alignment. Looking forward to seeing the integration with the scan/pushdown layer.
Purpose
In real-world analytics scenarios, prefix queries on high-cardinality string columns are very common. For example:
Existing file indexes in Paimon, such as BloomFilter and Bitmap, excel at equality lookups but cannot efficiently handle prefix matching. BloomFilter only checks exact value existence; Bitmap Index maps each distinct value to a bitmap, making it impossible to determine which values share a common prefix without scanning all entries.When no suitable index exists, the query engine must perform a full file scan — reading the entire data file (often tens of MBs) just to discover that no rows match the prefix predicate. This becomes prohibitively expensive at scale.
This PR introduces PrefixFileIndex, a new pluggable file-level index that accelerates prefix queries through a lightweight inverted index structure.
Prefix File Index is an inverted index that maps prefix strings to row number bitmaps. Unlike Bitmap Index which indexes exact values, it extracts the first N characters from each string value and groups rows by their prefix.
According to benchmark test result:
Test Environment
{category}_{id}format, 5 categories)paimon-benchmark/paimon-micro-benchmarks1. Index Size Comparison
Key Finding: Prefix Index size is independent of data cardinality, depending only on the number of prefix types. Even at cardinality 10000, the index remains at ~649KB.
2. Index Build Time Comparison
3. Query Performance — Skip Scenario (Core Value)
Querying a non-existing prefix; no-index scan must check all 1 million rows to confirm no match:
4. Production Scenario Inference
The above tests were conducted in memory, without accounting for disk I/O. In production:
Conclusion
Tests
PrefixFileIndexTest
PrefixIndexBenchmark