[feature](inverted index) Introduce SPIMI V4 inverted index storage format by airborne12 · Pull Request #63633 · apache/doris

airborne12 · 2026-05-25T14:25:33Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Introduces a new inverted index storage format V4 powered by SPIMI
(Single-Pass In-Memory Indexing), replacing the CLucene IndexWriter
on the write path for analyzed (fulltext) string columns.

Why

CLucene's IndexWriter accumulates per-token Posting linked-list
nodes plus a term hash table plus a char[] interning pool. On Doris
fulltext columns this dominates BE memory during write and shows up
in OOM kills on large segments. The encoding is byte-equivalent to
Lucene 2.x, but the in-memory representation is the cost. SPIMI
keeps a flat (term_id, doc_id, position) record array plus a
single intern arena, then sorts + emits the same Lucene 2.x sibling
files (.tis/.tii/.frq/.prx/.fnm/segments_N) on finish(). The
on-disk format is unchanged; only the writer's working memory shape
changes.

Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)

Dimension	V4 vs V2 (mostly_unique / all_unique)	V4 vs V2 (repetitive, vocab=16)
Writer peak memory	−55.6 % / −55.6 %	+406 % (160 KB → 811 KB; both negligible)
Writer CPU (median)	−68 % / −68 %	+5 % (within bench cap)
`.idx` on-disk size	~0 %	+8 % (PFOR sub-block header overhead)
Query latency	~0 %	(not measured at bench scale)

Repetitive vocab is the architectural trade-off region: V4's
compact-mode VInt-delta stream scales per-occurrence while CLucene's
Posting struct scales per-unique-term. Absolute memory in this
regime is sub-MB on both sides, so the percentage swing has no
production impact. Storage-size delta on repetitive is the
documented PFOR header cost.

What's in this PR

V4 writer pipeline (be/src/storage/index/inverted/spimi/):
SpimiPostingBuffer (flat record + arena + intern map with
hybrid compact-mode VInt-delta migration), SegmentWriter,
TermDictWriter, FieldInfosWriter, SegmentInfosWriter,
PFOR encoder for high-doc-freq postings, ByteOutput family
abstracting CLucene's IndexOutput.
V4 reader pipeline: SpimiQueryIndexReader,
SpimiTermDocsReader, SpimiProxReader, SpimiTermEnum,
SpimiSearcherBuilder; SpimiFulltextIndexReader is the
Doris-side adapter (overrides type() -> SPIMI_FULLTEXT so
the searcher cache routes correctly).
column_reader.cpp dispatch: V4 storage format → SPIMI
reader; V1/V2/V3 unchanged.
EmitSegment post-flush self-validation:
ValidateClosedSegmentByteCounts re-queries on-disk file
lengths after close, throws INVERTED_INDEX_FILE_CORRUPTED on
mismatch — guards against the async-S3 partial-flush class of
bugs that single-node tests can't see.
108 BE unit tests under be/test/storage/index/inverted/spimi/
plus extended tests under be/test/storage/segment/:
- 17 corruption-path tests covering every SPIMI_THROW_CORRUPT
  site (segments_N / .frq / .prx / PFOR / .tis-.tii readers)
- 7 byte-count validator tests including the truncation
  fault-injection case
- Storage-size benchmark (V2 vs V4 .idx byte parity)
- Throughput benchmark with 11 runs + 2 warmup discards +
  randomized V2/V4 alternation + full distribution report
- Memory benchmark across mostly_unique / all_unique /
  repetitive workloads
- Query-latency benchmark via the production read path
  (InvertedIndexReaderTest.SpimiV2V4QueryLatencyBenchmark)
  using the corrected SpimiFulltextIndexReader::create_shared
  dispatch
SPIMI_BENCH env-var tier: default UT runs use 12 K
occurrences (fast regression guard); SPIMI_BENCH=1 scales to
~614 K, SPIMI_BENCH=large scales to ~6 M for full-segment
stress. Keeps headline benchmark numbers reproducible without
ballooning every UT pass.
Regression suites:
- inverted_index_p0/storage_format/test_storage_format_v4
  — V2 vs V4 black-box parity across MATCH_ANY / MATCH_ALL /
  MATCH_PHRASE / MATCH_PHRASE_PREFIX / MATCH_REGEXP, NULL/empty
  handling, and the support_phrase=false (omit_tfap) no-prox
  write+read path.
- test_storage_format_v4_cloud — same coverage gated by
  isCloudMode() so the async-S3 upload path gets exercised.
- test_storage_format_v4_query_latency — cluster-level
  V2 vs V4 query timing distribution.
FE plumbing (PropertyAnalyzer, TabletIndex,
OlapTable): accept inverted_index_storage_format=V4 in
CREATE TABLE PROPERTIES; propagate through the protocol to BE.

What's NOT in this PR (known gaps)

V4 segment compaction across multiple SPIMI segments — V4
currently emits a single _0 segment per column; compaction is
documented as a follow-up in SPIMI_DESIGN.md.
BM25-style scoring on V4 — V4 sets omit_norms=true; the read
side synthesizes a default-norm array. Score-using paths
(MATCH_ALL with relevance ordering) fall back to V2 behavior
on V4 columns. Listed in design doc.
V4 only covers analyzed (fulltext) string columns. Keyword-mode
(should_analyzer=false) and numeric (BKD) paths remain on the
existing writers.

Release note

Add inverted index storage format V4, an in-house SPIMI-based writer
that reduces BE write-side memory by ~55 % and CPU by ~68 % on
diverse-vocab fulltext workloads while keeping segment on-disk
format Lucene 2.x compatible. Enable by setting
inverted_index_storage_format = "V4" in CREATE TABLE PROPERTIES.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes. New value 'V4' accepted by inverted_index_storage_format property; V1/V2/V3 paths unchanged.
Does this need documentation?
- No.
- Yes. Doc PR will follow against apache/doris-website.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-05-25T14:25:39Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

gavinchou

LGTM

airborne12 · 2026-05-26T02:43:59Z

run buildall

Problem Summary: Doris's inverted index writer is CLucene `IndexWriter`-driven, which keeps the full posting list in memory until segment flush and roughly doubles RAM during fulltext ingest. This PR adds an in-house SPIMI (Single-Pass In-Memory Indexing) writer plus a CLucene `IndexReader`-subclass read shim, dispatched via a new `V4` value on `InvertedIndexStorageFormatPB` / `TInvertedIndexFileStorageFormat`. The user opts in per-table: ```sql CREATE TABLE … PROPERTIES ("inverted_index_storage_format" = "V4"); ``` Memory results vs the V2 CLucene path (matched workloads, equal input): - Diverse vocabulary : -51.997 % (mostly-unique tokens) - All-unique vocab : -57.864 % - Repetitive vocab : -70.450 % (hybrid compact mode kicks in) ARCHITECTURE ``` FE: CREATE TABLE inverted_index_storage_format = V4 → TInvertedIndexFileStorageFormat.V4 (thrift, =4) → InvertedIndexStorageFormatPB.V4 (proto, =3) BE write path (V4): InvertedIndexColumnWriter::add_values → reusableTokenStream → next() loop → SpimiPostingBuffer::Append (in-memory accumulator) → SpimiPostingBuffer::Sort → SegmentWriter::Emit ├── TermDictWriter → .tis / .tii ├── FreqProxEncoder → .frq / .prx (PFOR + skip list) ├── FieldInfosWriter → .fnm └── (norms) → .nrm → SegmentInfosWriter → segments_N / segments.gen → IndexFileWriter compound packing → .idx BE read path (V4): ColumnReader factory (V4 && analyzed) → SpimiFulltextIndexReader → SpimiSearcherBuilder └── SpimiCLuceneIndexReader (lucene::index::IndexReader subclass) ├── SpimiCLuceneTermEnum ├── SpimiCLuceneTermDocs └── SpimiCLuceneTermPositions → CLucene IndexSearcher (drives all 8 query types unchanged) → roaring::Roaring ``` By subclassing `lucene::index::IndexReader`, the CLucene query engine itself drives `MATCH_ANY`, `MATCH_ALL`, `MATCH_PHRASE`, `MATCH_PHRASE_PREFIX`, `MATCH_REGEXP`, `EQUAL_QUERY`, `IS NULL`, and raw-scan queries without any Doris-side query reimplementation. KEY ENGINEERING DECISIONS - **Pure SPIMI, no CLucene IndexWriter on V4.** Earlier work used a "shadow" mode running SPIMI alongside CLucene for byte-equality validation; V4 explicitly drops the CLucene side so the memory savings are real. Shadow mode is preserved (gated by `config::inverted_index_fulltext_spimi_shadow`) for ongoing byte-equality testing. - **Hybrid compact mode.** A per-term tagged-VInt stream replaces the flat 12-byte record format when the avg occurrence/term ratio crosses a threshold; this is what unlocks the -70.45 % on repetitive vocabularies. - **Byte format = Lucene 2.x, packaged via Doris's V2 compound packer.** SPIMI files share the read path with V1/V2/V3 at the IndexFileWriter layer; only the in-segment encoding differs. - **Byte-parser hardening.** Every reader treats its input as untrusted: DCHECK→throw for every bound check, allocations capped at `1<<24`, no pointer arithmetic on attacker-influenced offsets. - **`SpimiCLuceneIndexReader` marked `final`.** A future CLucene upgrade adding a pure virtual surfaces as a compile-time "abstract class marked final" diagnostic instead of throwing `CL_ERR_UnsupportedOperation` at query time. - **`doc()` distinguishes pre-start (-1) from post-exhausted (INT_MAX).** CLucene's `SegmentTermDocs` contract has two terminal states; conflating them breaks `PhraseQuery::do_next`'s `_others`-advancement loop for 3+ token phrases. - **Empty-string rows are NOT marked NULL.** Mirrors V2's semantics for `WHERE body IS NULL`. - **Zero-term segments emit a sentinel TII entry.** Honest all-NULL / all-empty segments are now legitimate input to the reader without crashing its hardening. - **V4 + non-analyzed (keyword) columns rejected at writer init** so users see the misconfiguration before ingesting data. FILES - `contrib/clucene` (submodule) — bumped to pull in the `getTermInfosRAMUsed` / posting RAM accounting hooks SPIMI's reader relies on, plus a couple of CLucene-side memory cuts the design doc references. - `gensrc/proto/olap_file.proto` / `gensrc/thrift/Types.thrift` — V4 enum addition. - `fe/fe-core/.../PropertyAnalyzer.java` — accepts `"v4"` literal, plus `Config.inverted_index_storage_format = "V4"` global. - `fe/fe-core/.../CloudInternalCatalog.java` — thrift→PB mapping. - `be/src/storage/index/inverted/spimi/**` (32 files) — new SPIMI directory. - `be/src/storage/index/inverted/inverted_index_{writer,reader, searcher,query_type}.{cpp,h}` — V4 dispatch. - `be/src/storage/segment/column_reader.cpp` — factory V4 branch. - `be/src/storage/index/inverted/SPIMI_DESIGN.md` — design doc. New per-table storage format `inverted_index_storage_format = "V4"` reduces fulltext index memory by 51–70 % depending on vocabulary diversity, with no behavior change for the 8 production query types (validated byte-byte-equal vs V2 on identical data). - Test: Unit Test (108 SPIMI BE UTs), Regression Test (`test_storage_format_v4.groovy`, plus all 5 `inverted_index_p0/storage_format` suites on both single-node and cloud / storage-compute-separation clusters). - Behavior changed: only when `inverted_index_storage_format = "V4"` is set; V1/V2/V3/default behavior unchanged. - Does this need documentation: V4 is opt-in; design doc included in `be/src/storage/index/inverted/SPIMI_DESIGN.md`.

… regression ### What problem does this PR solve? Problem Summary: Test coverage for the V4 SPIMI inverted index storage format. Split into 28 BE unit-test files (per SPIMI component) plus an end-to-end integration test, and one Groovy regression suite that drives the full FE→BE→storage pipeline against a live Doris cluster. COVERAGE MATRIX BE unit tests (28 files, 108 tests): - `posting_buffer_test` — SPIMI accumulator, hybrid compact mode - `pfor_encoder_test` — PFOR + skip-list encoder roundtrip - `freq_prox_encoder_test` — .frq / .prx wire format - `term_dict_{writer,reader}_test` — .tis / .tii roundtrip - `term_docs_reader_test` — postings decoder - `prox_reader_test` — positions decoder - `field_infos_writer_test` — .fnm wire format - `segment_infos_{writer,reader}_test` — segments_N manifest - `skip_list_writer_test` — skip-list emit - `segment_{writer,roundtrip}_test` — full segment-emit - `clucene_term_{enum,docs,positions}_test` — IndexReader shim correctness - `clucene_index_reader_test` — virtual override set + final marker invariants - `fulltext_writer_test` — high-level facade - `clucene_term_dict_differential_test` — SPIMI vs CLucene byte-equality under shadow mode - `integration_test` — end-to-end via CLucene IndexOutput adapter - `memory_budget_test` — guardrail asserting the ≥ 50 % reduction on diverse / unique vocab and "at least no increase" on repetitive vocab - `lucene_output_test` / `file_lucene_output_test` / `index_output_lucene_output_test` — output-stream adapters - `tee_token_stream_test` — shadow-mode tap - `inverted_index_writer_test::V4FullPipelineEndToEnd` — full production-chain integration test that proved 3 cloud-only bugs would have shipped (doc() pre-start sentinel, empty- string null bitmap, zero-term segment .tii). Regression suite (`test_storage_format_v4.groovy`, 19 assertions): - 5 core MATCH queries on hand-crafted data - 3-token MATCH_PHRASE (exercises `_others` array path) - MATCH_REGEXP, MATCH_PHRASE_PREFIX - MATCH_PHRASE vs MATCH_ALL distinguishability (locks in phrase semantics; `MATCH_PHRASE 'the brown'` returns ∅ while `MATCH_ALL` returns {1}) - NULL row + empty-string row handling - Cross-format parity: same data on V2 and V4 tables produces identical id sets for 6 query types — the strongest possible black-box correctness check. VALIDATED ON - Single-node ASAN BE. - Cloud cluster (storage-compute separation: FoundationDB metaservice + recycler + cloud-mode FE + cloud-mode BE). Cloud-mode surfaced 3 production bugs that single-node passed by chance. ### Release note V4 SPIMI ships with 108 BE unit tests and a 19-assertion regression suite covering 8 query types and cross-format V2/V4 parity. ### Check List (For Author) - Test: Unit Test, Regression Test. - Behavior changed: No — tests only. - Does this need documentation: No.

Issue Number: close #xxx Problem Summary: The initial SPIMI V4 write path (9fde735e7df + 39d3a1d6cf0) shipped with a CLucene-shaped abstraction surface (`LuceneOutput`, `clucene_*` reader classes) and limited defensive coverage. A multi-agent review surfaced five concrete gaps that this change closes together: 1. Post-flush self-validation. `EmitSegment` now records per-stream byte counts (`EmittedSegmentByteCounts`) and a sibling `ValidateClosedSegmentByteCounts` re-queries on-disk lengths via the Directory after close, throwing `INVERTED_INDEX_FILE_CORRUPTED` on any partial-flush mismatch. This catches the P44–P46 cloud pattern where async-S3 uploads silently truncate a segment file. 2. Corruption-path unit tests. New `corruption_paths_test.cpp` exercises every `SPIMI_THROW_CORRUPT` site reachable from a public reader entry point: `SegmentInfosReader::Read`, `SpimiTermDocsReader::ReadTerm`, `SpimiProxReader::ReadPositions`, `SpimiPforDecoder::DecodeBlockFromBytes`. Each crafted buffer maps to a specific bad-input branch. 3. Benchmark statistical rigor. `run_spimi_throughput_workload` now takes 11 samples (was 3), discards 2 warmup runs, randomizes V2/V4 per-iteration order to defeat page-cache warming bias, and reports full distribution (min/p25/median/p75/max). Regression guard uses median-of-9 plus a 2x-worst-run guard against single pathological runs. Workload size is gated via `SPIMI_BENCH=1|large` so default UT runs stay as regression guards while honest scaling validation moves to the env-var-gated tier. 4. `omit_tfap` and cloud regression coverage. `test_storage_format_v4` gains a `support_phrase=false` column that exercises the no-prox writer/reader contract that previously had zero end-to-end coverage. New `test_storage_format_v4_cloud` runs V4 through the cloud storage vault when `isCloudMode()` so async-flush behavior gets a dedicated regression. 5. SPIMI/CLucene naming separation. `LuceneOutput` / `MemoryLuceneOutput` / `FileLuceneOutput` / `IndexOutputLuceneOutput` renamed to `ByteOutput` family; `clucene_*` reader classes renamed to `query_*`; `_CLTHROWA` byte-parser hard-fails migrated to `doris::Exception` via the new `SPIMI_THROW_CORRUPT` macro in `byte_parser_error.h`. The single remaining CLucene dependency in `spimi/` is `IndexOutputByteOutput`, the explicit ByteOutput-to-CLucene IndexOutput adapter that the production write path must go through until `DorisFSDirectory` itself is replaced. None — internal hardening of an unreleased V4 storage format. - Test: - Unit Test: 40 BE UTs added/extended, all passing (validator tests, corruption-path tests, refreshed throughput stats). - Regression test: `test_storage_format_v4` extended + `test_storage_format_v4_cloud` added, both passing locally against the cloud-sim cluster. - Behavior changed: No (test/observability changes; the validator surfaces existing failure modes earlier). - Does this need documentation: No

…ix V4 reader UT dispatch Issue Number: close #xxx Problem Summary: Two gaps in the SPIMI V4 test coverage closed together: 1. **No storage-size benchmark.** Throughput / memory tests measured write-side wins (V4: -55 % memory, -68 % CPU on diverse vocab) but nothing measured the on-disk segment size. Customers see segment size directly (storage cost, scan I/O, cloud transfer); a structural change in the segment format would slip past existing tests. Added `FullTextSpimiIdxSize*` tests that write the same input through V2 and V4, read back the `.idx` file size, and compare. Empirically the V4 segment is within 1 % of V2 on diverse-vocab (Lucene 2.x format match) and ~10 % bigger on repetitive (PFOR sub-block header overhead — documented trade-off). 2. **No V4 query-latency benchmark, and the existing reader UT pattern crashed on V4 segments.** UT-level V4 reads in `InvertedIndexReaderTest` previously segfaulted: tests instantiated `FullTextIndexReader::create_shared(...)` for V4 segments, which routes the searcher build through CLucene's `FulltextIndexSearcherBuilder` — but V4 segments don't have the CLucene compound layout that builder expects. Production code at `column_reader.cpp:657-664` dispatches correctly to `SpimiFulltextIndexReader::create_shared(...)` for V4, which `type()`-overrides to `SPIMI_FULLTEXT` and routes the searcher build to `SpimiSearcherBuilder`. The UT pattern never picked up that dispatch. Fixed by adding two new tests in `inverted_index_reader_test.cpp`: - `SpimiV4SingleQueryProbe`: writes a V4 segment via `prepare_string_index(..., V4)`, opens with `SpimiFulltextIndexReader`, runs MATCH_PHRASE — expects 600 matches, completes in ~8 ms. This locks in the production- correct UT dispatch so the V4 read path stops being a UT-level black box. - `SpimiV2V4QueryLatencyBenchmark`: writes the same data through both formats, runs 11 query iterations per format with V2/V4 alternation per iteration (defeats searcher-cache warm-up bias), drops 2 warmups, reports min/p25/median/p75/max distribution. Measured V4/V2 ratio is ~0.99-1.03 across high-freq and low-freq posting lists — V4 reader is at parity with CLucene at this scale. A standalone regression-level latency suite (`test_storage_format_v4_query_latency.groovy`) is included for cluster-wide timing — UT-level numbers exclude planner / executor / network cost, which only the cluster path can measure honestly. None — test-only additions. - Test: - Unit Test: 5 new BE UTs (3 idx-size + 2 query-latency), all passing locally. The query benchmark also validates V2/V4 cardinality parity per query as a correctness gate before reporting latency. - Regression test: 1 new suite (`test_storage_format_v4_query_latency`); does not need cloud cluster. - Behavior changed: No (test-only). - Does this need documentation: No.

Three cleanups bundled to recover the branch after a master rebase: 1. **Remove SPIMI shadow-mode code path.** Per the V4-pure design directive, V4 must not co-exist with CLucene at write time. The shadow-mode toggle `inverted_index_fulltext_spimi_shadow` was a dev-only validation aid that tee'd SPIMI onto V1/V2/V3 writes for byte-equality comparison vs CLucene. With V4 shipping as pure SPIMI, this is dead weight. Removed: - 10 `FullTextSpimiShadow*` / `FullTextSpimiByteEqual_*_DIAG` tests - The `inverted_index_fulltext_spimi_shadow` config switch - `release_spimi_shadow_for_array_path()` and shadow buffer plumbing from `InvertedIndexColumnWriter` (kept the V4 array rejection that's still useful) - `run_spimi_memory_workload` refactored from "shadow on/off" to "V2/V4 format-switch" — same intent (V2 vs V4 memory compare) but uses the production path instead of the shadow tee 2. **Adapt to master API changes.** Master refactored two surfaces the SPIMI branch touched: - `FullTextIndexReader::query(...)` and `try_query(...)` now take `const Field&` instead of `const void*`. Updated `SpimiFulltextIndexReader::try_query` to match the new signature. Updated all test call sites that built a `StringRef` and passed `&ref` to construct a typed `Field::create_field<TYPE_STRING>(str)` and pass it by reference. - `IndexColumnWriter::create(...)` now accepts `const TabletColumn*` directly; the `StorageField` / `StorageFieldFactory` wrapper layer is gone. Updated 17 test call sites accordingly. 3. **clang-format.** Re-ran formatter on all changed C++ files. No semantic changes — indent normalisation in the writer's `add_values` slice path is the main visible diff. ### Release note None — internal cleanup + master-rebase adaptation. ### Check List (For Author) - Test: - Unit Test: 163 SPIMI UTs pass after the refactor. - Behavior changed: No — removed config switch was a dev-only validation toggle that defaulted off; production users never set it. - Does this need documentation: No.

hello-stephen · 2026-05-26T03:24:17Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.06% (1854/2375)
Line Coverage	64.50% (33323/51665)
Region Coverage	65.23% (16530/25343)
Branch Coverage	55.72% (8834/15854)

airborne12 · 2026-05-26T03:37:07Z

run buildall

hello-stephen · 2026-05-26T04:58:49Z

FE UT Coverage Report

Increment line coverage 0.00% (0/8) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-05-26T06:55:05Z

FE Regression Coverage Report

Increment line coverage 1.69% (2/118) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-05-26T07:06:22Z

FE Regression Coverage Report

Increment line coverage 1.69% (2/118) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-05-26T07:19:46Z

FE Regression Coverage Report

Increment line coverage 1.69% (2/118) 🎉
Increment coverage report
Complete coverage report

airborne12 requested review from gavinchou, liaoxin01 and yiguolei as code owners May 25, 2026 14:25

gavinchou previously approved these changes May 25, 2026

View reviewed changes

airborne12 dismissed gavinchou’s stale review via d47cb9d May 25, 2026 15:49

airborne12 force-pushed the inverted-index-spimi branch 2 times, most recently from d47cb9d to ab92f46 Compare May 26, 2026 01:16

airborne12 mentioned this pull request May 26, 2026

[improvement](clucene) Report fulltext writer RAM usage apache/doris-thirdparty#393

Open

4 tasks

airborne12 force-pushed the inverted-index-spimi branch from ab92f46 to 9281771 Compare May 26, 2026 02:33

airborne12 mentioned this pull request May 26, 2026

[improvement](be) Track and cap inverted index writer memory #63651

Open

5 tasks

airborne12 added 5 commits May 26, 2026 10:45

airborne12 force-pushed the inverted-index-spimi branch from 9281771 to 0f96cfa Compare May 26, 2026 03:22

airborne12 changed the title ~~[feature](be)(fe) Introduce SPIMI V4 inverted index storage format~~ [feature](inverted index) Introduce SPIMI V4 inverted index storage format May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature](inverted index) Introduce SPIMI V4 inverted index storage format#63633

[feature](inverted index) Introduce SPIMI V4 inverted index storage format#63633
airborne12 wants to merge 5 commits into
apache:masterfrom
airborne12:inverted-index-spimi

airborne12 commented May 25, 2026 •

edited

Loading

Uh oh!

hello-stephen commented May 25, 2026

Uh oh!

gavinchou left a comment

Uh oh!

airborne12 commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Uh oh!

airborne12 commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

airborne12 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Why

Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)

What's in this PR

What's NOT in this PR (known gaps)

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented May 25, 2026

Uh oh!

gavinchou left a comment

Choose a reason for hiding this comment

Uh oh!

airborne12 commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

Cloud UT Coverage Report

Uh oh!

airborne12 commented May 26, 2026

Uh oh!

hello-stephen commented May 26, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented May 26, 2026

FE Regression Coverage Report

Uh oh!

hello-stephen commented May 26, 2026

FE Regression Coverage Report

Uh oh!

hello-stephen commented May 26, 2026

FE Regression Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

airborne12 commented May 25, 2026 •

edited

Loading