Skip to content

[improvement](clucene) Report fulltext writer RAM usage#393

Open
airborne12 wants to merge 1 commit into
apache:clucenefrom
airborne12:inverted-index-spimi-clucene
Open

[improvement](clucene) Report fulltext writer RAM usage#393
airborne12 wants to merge 1 commit into
apache:clucenefrom
airborne12:inverted-index-spimi-clucene

Conversation

@airborne12
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache/doris#63633 (SPIMI V4 inverted index storage format)

Problem Summary:

Four CLucene-side improvements that the Doris fulltext inverted index
writer depends on. Three reduce per-token Posting memory; one exposes
the RAM-used metric Doris uses for memory tracking.

Commits in this PR (chronological order)

  1. [improvement](clucene) Report fulltext writer RAM usage
    Expose SDocumentsWriter buffered postings memory through
    getRAMUsed() so Doris can account fulltext index writer memory
    inside its segment memory estimate. Pure observability.

  2. [improvement](clucene) Reduce fulltext writer posting buffer memory — Tighten the Posting struct layout: drop unused
    alignment padding and switch a small int field to a smaller width.
    Cuts ~25 % off per-term posting memory on the standard fulltext
    workload.

  3. [improvement](clucene) Drop redundant Posting::textLen field
    textLen was a cached copy of the text vector's size that was
    always reachable via text.size(). Removing the cached copy
    shaves another 4 bytes per Posting with no algorithmic change.

  4. [improvement](clucene) Allocate Posting position state only when needed — Lazy-allocate the Posting's position-tracking
    members until the first position is recorded. For columns with
    support_phrase=false or a single token per doc, this state is
    never touched; previously it was always allocated.

Why these are stacked together

The four optimisations were developed against the same SDocumentWriter
critical path. Each one builds on the previous one's layout change —
splitting them across PRs would force re-conflict-resolution on the
same lines. Each commit compiles + passes the Doris BE unit tests on
its own (InvertedIndexWriterTest.FullTextStringMemoryEstimate*
covers the API surface touched).

Downstream Doris PR

This PR must merge before apache/doris#63633 can reference the
correct clucene submodule SHA. The Doris PR's submodule pointer will
be updated to the tip of clucene after this lands.

Release note

None — internal CLucene memory optimisations, no behaviour change
on the Lucene 2.x on-disk format. Doris's memory tracking will
report tighter peaks on fulltext columns once the downstream PR
merges.

Check List (For Author)

  • Test:
    • Unit Test (covered by Doris BE UT
      InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings
      and friends — exercised in the downstream Doris PR's CI)
  • Behavior changed:
    • No.
    • Yes — IndexWriter::getRAMUsed() now includes
      SDocumentsWriter buffered postings memory (was 0 before).
  • Does this need documentation:
    • No.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Expose SDocumentsWriter buffered postings memory through getRAMUsed so Doris can account full-text index writer memory.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - sh run-be-ut.sh --run --filter=InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings
- Behavior changed: No
- Does this need documentation: No
@airborne12 airborne12 force-pushed the inverted-index-spimi-clucene branch from 22fec29 to bab8109 Compare May 26, 2026 02:27
@airborne12 airborne12 changed the title [improvement](clucene) Reduce fulltext writer posting buffer memory [improvement](clucene) Report fulltext writer RAM usage May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant