[feature](inverted index) Introduce SPIMI V4 inverted index storage format#63633
Open
airborne12 wants to merge 5 commits into
Open
[feature](inverted index) Introduce SPIMI V4 inverted index storage format#63633airborne12 wants to merge 5 commits into
airborne12 wants to merge 5 commits into
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
d47cb9d to
ab92f46
Compare
4 tasks
ab92f46 to
9281771
Compare
5 tasks
Member
Author
|
run buildall |
Problem Summary: Doris's inverted index writer is CLucene
`IndexWriter`-driven, which keeps the full posting list in
memory until segment flush and roughly doubles RAM during
fulltext ingest. This PR adds an in-house SPIMI (Single-Pass
In-Memory Indexing) writer plus a CLucene `IndexReader`-subclass
read shim, dispatched via a new `V4` value on
`InvertedIndexStorageFormatPB` / `TInvertedIndexFileStorageFormat`.
The user opts in per-table:
```sql
CREATE TABLE … PROPERTIES ("inverted_index_storage_format" = "V4");
```
Memory results vs the V2 CLucene path (matched workloads, equal
input):
- Diverse vocabulary : -51.997 % (mostly-unique tokens)
- All-unique vocab : -57.864 %
- Repetitive vocab : -70.450 % (hybrid compact mode kicks in)
ARCHITECTURE
```
FE: CREATE TABLE inverted_index_storage_format = V4
→ TInvertedIndexFileStorageFormat.V4 (thrift, =4)
→ InvertedIndexStorageFormatPB.V4 (proto, =3)
BE write path (V4):
InvertedIndexColumnWriter::add_values
→ reusableTokenStream → next() loop
→ SpimiPostingBuffer::Append (in-memory accumulator)
→ SpimiPostingBuffer::Sort
→ SegmentWriter::Emit
├── TermDictWriter → .tis / .tii
├── FreqProxEncoder → .frq / .prx (PFOR + skip list)
├── FieldInfosWriter → .fnm
└── (norms) → .nrm
→ SegmentInfosWriter → segments_N / segments.gen
→ IndexFileWriter compound packing → .idx
BE read path (V4):
ColumnReader factory (V4 && analyzed)
→ SpimiFulltextIndexReader
→ SpimiSearcherBuilder
└── SpimiCLuceneIndexReader (lucene::index::IndexReader subclass)
├── SpimiCLuceneTermEnum
├── SpimiCLuceneTermDocs
└── SpimiCLuceneTermPositions
→ CLucene IndexSearcher (drives all 8 query types unchanged)
→ roaring::Roaring
```
By subclassing `lucene::index::IndexReader`, the CLucene query
engine itself drives `MATCH_ANY`, `MATCH_ALL`, `MATCH_PHRASE`,
`MATCH_PHRASE_PREFIX`, `MATCH_REGEXP`, `EQUAL_QUERY`, `IS NULL`,
and raw-scan queries without any Doris-side query reimplementation.
KEY ENGINEERING DECISIONS
- **Pure SPIMI, no CLucene IndexWriter on V4.** Earlier work used
a "shadow" mode running SPIMI alongside CLucene for byte-equality
validation; V4 explicitly drops the CLucene side so the memory
savings are real. Shadow mode is preserved (gated by
`config::inverted_index_fulltext_spimi_shadow`) for ongoing
byte-equality testing.
- **Hybrid compact mode.** A per-term tagged-VInt stream replaces
the flat 12-byte record format when the avg occurrence/term
ratio crosses a threshold; this is what unlocks the -70.45 % on
repetitive vocabularies.
- **Byte format = Lucene 2.x, packaged via Doris's V2 compound
packer.** SPIMI files share the read path with V1/V2/V3 at
the IndexFileWriter layer; only the in-segment encoding
differs.
- **Byte-parser hardening.** Every reader treats its input as
untrusted: DCHECK→throw for every bound check, allocations
capped at `1<<24`, no pointer arithmetic on attacker-influenced
offsets.
- **`SpimiCLuceneIndexReader` marked `final`.** A future CLucene
upgrade adding a pure virtual surfaces as a compile-time
"abstract class marked final" diagnostic instead of throwing
`CL_ERR_UnsupportedOperation` at query time.
- **`doc()` distinguishes pre-start (-1) from post-exhausted
(INT_MAX).** CLucene's `SegmentTermDocs` contract has two
terminal states; conflating them breaks `PhraseQuery::do_next`'s
`_others`-advancement loop for 3+ token phrases.
- **Empty-string rows are NOT marked NULL.** Mirrors V2's
semantics for `WHERE body IS NULL`.
- **Zero-term segments emit a sentinel TII entry.** Honest
all-NULL / all-empty segments are now legitimate input to the
reader without crashing its hardening.
- **V4 + non-analyzed (keyword) columns rejected at writer init**
so users see the misconfiguration before ingesting data.
FILES
- `contrib/clucene` (submodule) — bumped to pull in the
`getTermInfosRAMUsed` / posting RAM accounting hooks SPIMI's
reader relies on, plus a couple of CLucene-side memory cuts
the design doc references.
- `gensrc/proto/olap_file.proto` / `gensrc/thrift/Types.thrift` —
V4 enum addition.
- `fe/fe-core/.../PropertyAnalyzer.java` — accepts `"v4"` literal,
plus `Config.inverted_index_storage_format = "V4"` global.
- `fe/fe-core/.../CloudInternalCatalog.java` — thrift→PB mapping.
- `be/src/storage/index/inverted/spimi/**` (32 files) — new SPIMI
directory.
- `be/src/storage/index/inverted/inverted_index_{writer,reader,
searcher,query_type}.{cpp,h}` — V4 dispatch.
- `be/src/storage/segment/column_reader.cpp` — factory V4 branch.
- `be/src/storage/index/inverted/SPIMI_DESIGN.md` — design doc.
New per-table storage format `inverted_index_storage_format =
"V4"` reduces fulltext index memory by 51–70 % depending on
vocabulary diversity, with no behavior change for the 8 production
query types (validated byte-byte-equal vs V2 on identical data).
- Test: Unit Test (108 SPIMI BE UTs), Regression Test
(`test_storage_format_v4.groovy`, plus all 5
`inverted_index_p0/storage_format` suites on both single-node
and cloud / storage-compute-separation clusters).
- Behavior changed: only when `inverted_index_storage_format =
"V4"` is set; V1/V2/V3/default behavior unchanged.
- Does this need documentation: V4 is opt-in; design doc included
in `be/src/storage/index/inverted/SPIMI_DESIGN.md`.
… regression
### What problem does this PR solve?
Problem Summary: Test coverage for the V4 SPIMI inverted index
storage format. Split into 28 BE unit-test files (per SPIMI
component) plus an end-to-end integration test, and one Groovy
regression suite that drives the full FE→BE→storage pipeline
against a live Doris cluster.
COVERAGE MATRIX
BE unit tests (28 files, 108 tests):
- `posting_buffer_test` — SPIMI accumulator, hybrid compact mode
- `pfor_encoder_test` — PFOR + skip-list encoder roundtrip
- `freq_prox_encoder_test` — .frq / .prx wire format
- `term_dict_{writer,reader}_test` — .tis / .tii roundtrip
- `term_docs_reader_test` — postings decoder
- `prox_reader_test` — positions decoder
- `field_infos_writer_test` — .fnm wire format
- `segment_infos_{writer,reader}_test` — segments_N manifest
- `skip_list_writer_test` — skip-list emit
- `segment_{writer,roundtrip}_test` — full segment-emit
- `clucene_term_{enum,docs,positions}_test` — IndexReader
shim correctness
- `clucene_index_reader_test` — virtual override set + final
marker invariants
- `fulltext_writer_test` — high-level facade
- `clucene_term_dict_differential_test` — SPIMI vs CLucene
byte-equality under shadow mode
- `integration_test` — end-to-end via CLucene
IndexOutput adapter
- `memory_budget_test` — guardrail asserting the ≥ 50 %
reduction on diverse / unique vocab and "at least no
increase" on repetitive vocab
- `lucene_output_test` / `file_lucene_output_test` /
`index_output_lucene_output_test` — output-stream adapters
- `tee_token_stream_test` — shadow-mode tap
- `inverted_index_writer_test::V4FullPipelineEndToEnd` — full
production-chain integration test that proved 3 cloud-only
bugs would have shipped (doc() pre-start sentinel, empty-
string null bitmap, zero-term segment .tii).
Regression suite (`test_storage_format_v4.groovy`, 19 assertions):
- 5 core MATCH queries on hand-crafted data
- 3-token MATCH_PHRASE (exercises `_others` array path)
- MATCH_REGEXP, MATCH_PHRASE_PREFIX
- MATCH_PHRASE vs MATCH_ALL distinguishability (locks in phrase
semantics; `MATCH_PHRASE 'the brown'` returns ∅ while
`MATCH_ALL` returns {1})
- NULL row + empty-string row handling
- Cross-format parity: same data on V2 and V4 tables produces
identical id sets for 6 query types — the strongest possible
black-box correctness check.
VALIDATED ON
- Single-node ASAN BE.
- Cloud cluster (storage-compute separation: FoundationDB
metaservice + recycler + cloud-mode FE + cloud-mode BE).
Cloud-mode surfaced 3 production bugs that single-node passed
by chance.
### Release note
V4 SPIMI ships with 108 BE unit tests and a 19-assertion
regression suite covering 8 query types and cross-format V2/V4
parity.
### Check List (For Author)
- Test: Unit Test, Regression Test.
- Behavior changed: No — tests only.
- Does this need documentation: No.
Issue Number: close #xxx
Problem Summary:
The initial SPIMI V4 write path (9fde735e7df + 39d3a1d6cf0) shipped with
a CLucene-shaped abstraction surface (`LuceneOutput`, `clucene_*` reader
classes) and limited defensive coverage. A multi-agent review surfaced
five concrete gaps that this change closes together:
1. Post-flush self-validation. `EmitSegment` now records per-stream
byte counts (`EmittedSegmentByteCounts`) and a sibling
`ValidateClosedSegmentByteCounts` re-queries on-disk lengths via
the Directory after close, throwing `INVERTED_INDEX_FILE_CORRUPTED`
on any partial-flush mismatch. This catches the P44–P46 cloud
pattern where async-S3 uploads silently truncate a segment file.
2. Corruption-path unit tests. New `corruption_paths_test.cpp`
exercises every `SPIMI_THROW_CORRUPT` site reachable from a public
reader entry point: `SegmentInfosReader::Read`,
`SpimiTermDocsReader::ReadTerm`, `SpimiProxReader::ReadPositions`,
`SpimiPforDecoder::DecodeBlockFromBytes`. Each crafted buffer maps
to a specific bad-input branch.
3. Benchmark statistical rigor. `run_spimi_throughput_workload` now
takes 11 samples (was 3), discards 2 warmup runs, randomizes V2/V4
per-iteration order to defeat page-cache warming bias, and reports
full distribution (min/p25/median/p75/max). Regression guard uses
median-of-9 plus a 2x-worst-run guard against single pathological
runs. Workload size is gated via `SPIMI_BENCH=1|large` so default
UT runs stay as regression guards while honest scaling validation
moves to the env-var-gated tier.
4. `omit_tfap` and cloud regression coverage. `test_storage_format_v4`
gains a `support_phrase=false` column that exercises the no-prox
writer/reader contract that previously had zero end-to-end
coverage. New `test_storage_format_v4_cloud` runs V4 through the
cloud storage vault when `isCloudMode()` so async-flush behavior
gets a dedicated regression.
5. SPIMI/CLucene naming separation. `LuceneOutput` /
`MemoryLuceneOutput` / `FileLuceneOutput` / `IndexOutputLuceneOutput`
renamed to `ByteOutput` family; `clucene_*` reader classes renamed
to `query_*`; `_CLTHROWA` byte-parser hard-fails migrated to
`doris::Exception` via the new `SPIMI_THROW_CORRUPT` macro in
`byte_parser_error.h`. The single remaining CLucene dependency in
`spimi/` is `IndexOutputByteOutput`, the explicit
ByteOutput-to-CLucene IndexOutput adapter that the production
write path must go through until `DorisFSDirectory` itself is
replaced.
None — internal hardening of an unreleased V4 storage format.
- Test:
- Unit Test: 40 BE UTs added/extended, all passing (validator
tests, corruption-path tests, refreshed throughput stats).
- Regression test: `test_storage_format_v4` extended +
`test_storage_format_v4_cloud` added, both passing locally
against the cloud-sim cluster.
- Behavior changed: No (test/observability changes; the validator
surfaces existing failure modes earlier).
- Does this need documentation: No
…ix V4 reader UT dispatch
Issue Number: close #xxx
Problem Summary:
Two gaps in the SPIMI V4 test coverage closed together:
1. **No storage-size benchmark.** Throughput / memory tests measured
write-side wins (V4: -55 % memory, -68 % CPU on diverse vocab) but
nothing measured the on-disk segment size. Customers see segment
size directly (storage cost, scan I/O, cloud transfer); a
structural change in the segment format would slip past existing
tests. Added `FullTextSpimiIdxSize*` tests that write the same
input through V2 and V4, read back the `.idx` file size, and
compare. Empirically the V4 segment is within 1 % of V2 on
diverse-vocab (Lucene 2.x format match) and ~10 % bigger on
repetitive (PFOR sub-block header overhead — documented trade-off).
2. **No V4 query-latency benchmark, and the existing reader UT pattern
crashed on V4 segments.** UT-level V4 reads in
`InvertedIndexReaderTest` previously segfaulted: tests
instantiated `FullTextIndexReader::create_shared(...)` for V4
segments, which routes the searcher build through CLucene's
`FulltextIndexSearcherBuilder` — but V4 segments don't have the
CLucene compound layout that builder expects. Production code at
`column_reader.cpp:657-664` dispatches correctly to
`SpimiFulltextIndexReader::create_shared(...)` for V4, which
`type()`-overrides to `SPIMI_FULLTEXT` and routes the searcher
build to `SpimiSearcherBuilder`. The UT pattern never picked up
that dispatch.
Fixed by adding two new tests in `inverted_index_reader_test.cpp`:
- `SpimiV4SingleQueryProbe`: writes a V4 segment via
`prepare_string_index(..., V4)`, opens with
`SpimiFulltextIndexReader`, runs MATCH_PHRASE — expects 600
matches, completes in ~8 ms. This locks in the production-
correct UT dispatch so the V4 read path stops being a UT-level
black box.
- `SpimiV2V4QueryLatencyBenchmark`: writes the same data through
both formats, runs 11 query iterations per format with V2/V4
alternation per iteration (defeats searcher-cache warm-up bias),
drops 2 warmups, reports min/p25/median/p75/max distribution.
Measured V4/V2 ratio is ~0.99-1.03 across high-freq and low-freq
posting lists — V4 reader is at parity with CLucene at this
scale.
A standalone regression-level latency suite
(`test_storage_format_v4_query_latency.groovy`) is included for
cluster-wide timing — UT-level numbers exclude planner / executor /
network cost, which only the cluster path can measure honestly.
None — test-only additions.
- Test:
- Unit Test: 5 new BE UTs (3 idx-size + 2 query-latency), all
passing locally. The query benchmark also validates V2/V4
cardinality parity per query as a correctness gate before
reporting latency.
- Regression test: 1 new suite
(`test_storage_format_v4_query_latency`); does not need cloud
cluster.
- Behavior changed: No (test-only).
- Does this need documentation: No.
Three cleanups bundled to recover the branch after a master rebase:
1. **Remove SPIMI shadow-mode code path.** Per the V4-pure design
directive, V4 must not co-exist with CLucene at write time. The
shadow-mode toggle `inverted_index_fulltext_spimi_shadow` was a
dev-only validation aid that tee'd SPIMI onto V1/V2/V3 writes for
byte-equality comparison vs CLucene. With V4 shipping as pure
SPIMI, this is dead weight. Removed:
- 10 `FullTextSpimiShadow*` / `FullTextSpimiByteEqual_*_DIAG` tests
- The `inverted_index_fulltext_spimi_shadow` config switch
- `release_spimi_shadow_for_array_path()` and shadow buffer
plumbing from `InvertedIndexColumnWriter` (kept the V4 array
rejection that's still useful)
- `run_spimi_memory_workload` refactored from "shadow on/off" to
"V2/V4 format-switch" — same intent (V2 vs V4 memory compare)
but uses the production path instead of the shadow tee
2. **Adapt to master API changes.** Master refactored two surfaces
the SPIMI branch touched:
- `FullTextIndexReader::query(...)` and `try_query(...)` now take
`const Field&` instead of `const void*`. Updated
`SpimiFulltextIndexReader::try_query` to match the new
signature. Updated all test call sites that built a
`StringRef` and passed `&ref` to construct a typed
`Field::create_field<TYPE_STRING>(str)` and pass it by reference.
- `IndexColumnWriter::create(...)` now accepts `const
TabletColumn*` directly; the `StorageField` / `StorageFieldFactory`
wrapper layer is gone. Updated 17 test call sites accordingly.
3. **clang-format.** Re-ran formatter on all changed C++ files. No
semantic changes — indent normalisation in the writer's
`add_values` slice path is the main visible diff.
### Release note
None — internal cleanup + master-rebase adaptation.
### Check List (For Author)
- Test:
- Unit Test: 163 SPIMI UTs pass after the refactor.
- Behavior changed: No — removed config switch was a dev-only
validation toggle that defaulted off; production users never set it.
- Does this need documentation: No.
9281771 to
0f96cfa
Compare
Contributor
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Member
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
Contributor
FE Regression Coverage ReportIncrement line coverage |
2 similar comments
Contributor
FE Regression Coverage ReportIncrement line coverage |
Contributor
FE Regression Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Introduces a new inverted index storage format V4 powered by SPIMI
(Single-Pass In-Memory Indexing), replacing the CLucene
IndexWriteron the write path for analyzed (fulltext) string columns.
Why
CLucene's
IndexWriteraccumulates per-tokenPostinglinked-listnodes plus a term hash table plus a char[] interning pool. On Doris
fulltext columns this dominates BE memory during write and shows up
in OOM kills on large segments. The encoding is byte-equivalent to
Lucene 2.x, but the in-memory representation is the cost. SPIMI
keeps a flat
(term_id, doc_id, position)record array plus asingle intern arena, then sorts + emits the same Lucene 2.x sibling
files (
.tis/.tii/.frq/.prx/.fnm/segments_N) onfinish(). Theon-disk format is unchanged; only the writer's working memory shape
changes.
Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)
.idxon-disk sizeRepetitive vocab is the architectural trade-off region: V4's
compact-mode VInt-delta stream scales per-occurrence while CLucene's
Posting struct scales per-unique-term. Absolute memory in this
regime is sub-MB on both sides, so the percentage swing has no
production impact. Storage-size delta on repetitive is the
documented PFOR header cost.
What's in this PR
be/src/storage/index/inverted/spimi/):SpimiPostingBuffer(flat record + arena + intern map withhybrid compact-mode VInt-delta migration),
SegmentWriter,TermDictWriter,FieldInfosWriter,SegmentInfosWriter,PFOR encoder for high-doc-freq postings,
ByteOutputfamilyabstracting CLucene's
IndexOutput.SpimiQueryIndexReader,SpimiTermDocsReader,SpimiProxReader,SpimiTermEnum,SpimiSearcherBuilder;SpimiFulltextIndexReaderis theDoris-side adapter (overrides
type() -> SPIMI_FULLTEXTsothe searcher cache routes correctly).
column_reader.cppdispatch: V4 storage format → SPIMIreader; V1/V2/V3 unchanged.
EmitSegmentpost-flush self-validation:ValidateClosedSegmentByteCountsre-queries on-disk filelengths after close, throws
INVERTED_INDEX_FILE_CORRUPTEDonmismatch — guards against the async-S3 partial-flush class of
bugs that single-node tests can't see.
be/test/storage/index/inverted/spimi/plus extended tests under
be/test/storage/segment/:SPIMI_THROW_CORRUPTsite (segments_N / .frq / .prx / PFOR / .tis-.tii readers)
fault-injection case
.idxbyte parity)randomized V2/V4 alternation + full distribution report
repetitive workloads
(
InvertedIndexReaderTest.SpimiV2V4QueryLatencyBenchmark)using the corrected
SpimiFulltextIndexReader::create_shareddispatch
SPIMI_BENCHenv-var tier: default UT runs use 12 Koccurrences (fast regression guard);
SPIMI_BENCH=1scales to~614 K,
SPIMI_BENCH=largescales to ~6 M for full-segmentstress. Keeps headline benchmark numbers reproducible without
ballooning every UT pass.
inverted_index_p0/storage_format/test_storage_format_v4— V2 vs V4 black-box parity across MATCH_ANY / MATCH_ALL /
MATCH_PHRASE / MATCH_PHRASE_PREFIX / MATCH_REGEXP, NULL/empty
handling, and the
support_phrase=false(omit_tfap) no-proxwrite+read path.
test_storage_format_v4_cloud— same coverage gated byisCloudMode()so the async-S3 upload path gets exercised.test_storage_format_v4_query_latency— cluster-levelV2 vs V4 query timing distribution.
PropertyAnalyzer,TabletIndex,OlapTable): acceptinverted_index_storage_format=V4inCREATE TABLE PROPERTIES; propagate through the protocol to BE.
What's NOT in this PR (known gaps)
currently emits a single
_0segment per column; compaction isdocumented as a follow-up in
SPIMI_DESIGN.md.omit_norms=true; the readside synthesizes a default-norm array. Score-using paths
(
MATCH_ALLwith relevance ordering) fall back to V2 behavioron V4 columns. Listed in design doc.
(
should_analyzer=false) and numeric (BKD) paths remain on theexisting writers.
Release note
Add inverted index storage format V4, an in-house SPIMI-based writer
that reduces BE write-side memory by ~55 % and CPU by ~68 % on
diverse-vocab fulltext workloads while keeping segment on-disk
format Lucene 2.x compatible. Enable by setting
inverted_index_storage_format = "V4"in CREATE TABLE PROPERTIES.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)