Skip to content

[common] Introduce PrefixFileIndex for prefix query optimization#7750

Open
xuzifu666 wants to merge 3 commits into
apache:masterfrom
xuzifu666:prefix_file_index_support
Open

[common] Introduce PrefixFileIndex for prefix query optimization#7750
xuzifu666 wants to merge 3 commits into
apache:masterfrom
xuzifu666:prefix_file_index_support

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

@xuzifu666 xuzifu666 commented Apr 30, 2026

Purpose

In real-world analytics scenarios, prefix queries on high-cardinality string columns are very common. For example:

WHERE url LIKE '/api/v1/%'
WHERE order_id LIKE 'ORD2024%'

Existing file indexes in Paimon, such as BloomFilter and Bitmap, excel at equality lookups but cannot efficiently handle prefix matching. BloomFilter only checks exact value existence; Bitmap Index maps each distinct value to a bitmap, making it impossible to determine which values share a common prefix without scanning all entries.When no suitable index exists, the query engine must perform a full file scan — reading the entire data file (often tens of MBs) just to discover that no rows match the prefix predicate. This becomes prohibitively expensive at scale.

This PR introduces PrefixFileIndex, a new pluggable file-level index that accelerates prefix queries through a lightweight inverted index structure.

Prefix File Index is an inverted index that maps prefix strings to row number bitmaps. Unlike Bitmap Index which indexes exact values, it extracts the first N characters from each string value and groups rows by their prefix.

According to benchmark test result:

Test Environment

  • CPU: Apple M4
  • JVM: Java HotSpot 17.0.12
  • Data Volume: 1 million string rows ({category}_{id} format, 5 categories)
  • Test Module: paimon-benchmark/paimon-micro-benchmarks

1. Index Size Comparison

Cardinality PrefixLen=2 PrefixLen=3 PrefixLen=4 BitmapIndex Raw Data Prefix Space Saving
100 649 KB 649 KB 649 KB 2.0 MB 13.1 MB 20x vs data
1000 649 KB 649 KB 649 KB 2.7 MB 14.7 MB 23x vs data
10000 649 KB 649 KB 649 KB 7.7 MB 15.7 MB 24x vs data

Key Finding: Prefix Index size is independent of data cardinality, depending only on the number of prefix types. Even at cardinality 10000, the index remains at ~649KB.


2. Index Build Time Comparison

Cardinality PrefixLen=2 PrefixLen=3 PrefixLen=4 BitmapIndex Prefix Build Speedup
100 126 ms 114 ms 110 ms 163 ms 1.3-1.5x
1000 112 ms 112 ms 110 ms 246 ms 2.2x
10000 111 ms 110 ms 113 ms 633 ms 5.6x

3. Query Performance — Skip Scenario (Core Value)

Querying a non-existing prefix; no-index scan must check all 1 million rows to confirm no match:

Cardinality PrefixIndex BitmapIndex No-Index-Full-Scan Prefix Index Speedup
100 ~1.3 μs ~2.1 μs 12.558 μs ~9.8x
1000 ~1.1 μs ~2.1 μs 13.147 μs ~11.9x
10000 ~1.2 μs ~1.6 μs 13.262 μs ~11.3x

4. Production Scenario Inference

The above tests were conducted in memory, without accounting for disk I/O. In production:

Scenario No Index Prefix Index Inferred Speedup
Data file size ~15 MB (1M rows) ~649 KB -
Disk read time ~50-200 ms ~0.1 ms (cache hit) 500-2000x
Skip decision time Must read all data Returns SKIP in 1 μs Tens of thousands x

Conclusion

Dimension Prefix Index Advantage
Index Size Only 1/20 of raw data, 3-12x smaller than Bitmap Index
Build Speed Up to 5.6x faster in high-cardinality scenarios
Skip Performance 10-12x faster in memory, hundreds to thousands x in real disk I/O

Tests

PrefixFileIndexTest
PrefixIndexBenchmark

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PrefixFileIndex for prefix query optimization

Thanks for this contribution. The idea of a lightweight prefix-based inverted index for accelerating LIKE 'prefix%' and STARTS_WITH queries is sound, and the benchmarks clearly demonstrate the value. Below are some issues and suggestions.


Correctness Issues

1. Null bitmap offset semantics are ambiguous in the reader

In the Writer, when there is exactly one null row, you encode it as nullOffset = -1 - nullBitmap.first(). However, in the Reader, the code that calls hasPrefix() never actually uses this offset encoding for the null case. The visitIsNull only checks hasNull, while the visitEqual(fieldRef, null) path also only checks hasNull. The nullOffset field is read but never actually used to reconstruct the null bitmap. If a future reader needs to return row-level results (e.g., for row-group filtering), this compact encoding will require documentation so it can be properly decoded.

2. hasPrefix() contains a dead code path for negative offset

if (offset < 0) {
    // single value shortcut
    return true;
}

This branch can never be reached because the offsets stored in prefixOffsets are always >= 0 (they are computed via bodyOffset which starts at 0 and accumulates). The negative-offset optimization is only used for the null bitmap, which is not stored in prefixOffsets. This is dead code and might confuse future maintainers.

3. Query prefix longer than index prefix length produces false negatives risk

When a query literal (e.g., "hello_world") is longer than prefixLength, both the writer and reader truncate it to prefixLength chars. This is correct — but when the query prefix is shorter than prefixLength, the fallback iteration in hasPrefix() does a linear scan of all entries. This is O(n) where n is the number of distinct prefixes. For high-cardinality prefix spaces this could regress. Consider building a sorted structure (TreeMap) or at minimum documenting this trade-off.


Design Suggestions

4. dataType is accepted but never validated

The constructor accepts any DataType but the index only works with string types. If a user misconfigures a prefix index on an INT column, they will get a confusing ClassCastException at write time. Consider adding a type check in the constructor or factory (similar to how other indexes validate supported types).

5. The Reader does not store bitmap lengths — deserialization relies on internal format

The body section stores bitmaps back-to-back, but there is no stored length per bitmap. The readBitmap(offset) method passes data.length - bodyStart - offset as the available bytes, relying on RoaringBitmap32.deserialize() to only read what it needs. While this works for the current RoaringBitmap implementation, it is fragile — if the serialization format changes or if there are trailing bytes, it could break. Consider storing the byte length of each bitmap in the header.

6. No integration with the predicate pushdown framework

This PR only adds the index implementation but does not wire it into the query planning / file-pruning logic. For example, there is no evidence that StartsWith predicates or LIKE predicates will actually consult this index during scan. This might be intentional (staged PRs), but it would be good to clarify in the PR description whether a follow-up is planned.

7. Missing close() / resource cleanup in the Reader

The Reader class extends FileIndexReader and reads the full byte array eagerly, so there is no resource leak per se. However, the benchmark's queryPrefix method creates a LocalSeekableInputStream on every call and never closes it — this is a resource leak in the benchmark (though not in production code).


Minor / Style

  • The Writer class uses java.util.List and java.util.ArrayList with fully-qualified names inside method bodies (lines in sortedPrefixes()). These should be proper imports at the top of the file for consistency with the rest of the codebase.
  • The benchmark mixes JUnit 4 (@Rule, TemporaryFolder) with JUnit 5 (@Test from jupiter). @Rule does not work with JUnit 5 without the @ExtendWith(SpringExtension.class) or @RegisterExtension — this means folder.create() must be called manually (which it is), but it is an unusual pattern. Consider switching to JUnit 5's @TempDir.
  • The VERSION and PREFIX_LENGTH string constants in PrefixFileIndex shadow similar constants in BitmapFileIndex. If these are user-facing option keys, consider namespacing them (e.g., "prefix.prefix-length").

Summary

The core algorithm is correct and the index design is reasonable for its intended use case. The main concerns are: (1) dead code in the reader's offset handling, (2) missing type validation, (3) no stored bitmap lengths making the format fragile, and (4) the linear fallback in hasPrefix() for short query prefixes. The benchmarks are convincing but would benefit from JUnit 5 alignment. Looking forward to seeing the integration with the scan/pushdown layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants