[improvement](clucene) Report fulltext writer RAM usage#393
Open
airborne12 wants to merge 1 commit into
Open
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Expose SDocumentsWriter buffered postings memory through getRAMUsed so Doris can account full-text index writer memory.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- sh run-be-ut.sh --run --filter=InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings
- Behavior changed: No
- Does this need documentation: No
22fec29 to
bab8109
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: apache/doris#63633 (SPIMI V4 inverted index storage format)
Problem Summary:
Four CLucene-side improvements that the Doris fulltext inverted index
writer depends on. Three reduce per-token Posting memory; one exposes
the RAM-used metric Doris uses for memory tracking.
Commits in this PR (chronological order)
[improvement](clucene) Report fulltext writer RAM usage—Expose
SDocumentsWriterbuffered postings memory throughgetRAMUsed()so Doris can account fulltext index writer memoryinside its segment memory estimate. Pure observability.
[improvement](clucene) Reduce fulltext writer posting buffer memory— Tighten thePostingstruct layout: drop unusedalignment padding and switch a small int field to a smaller width.
Cuts ~25 % off per-term posting memory on the standard fulltext
workload.
[improvement](clucene) Drop redundant Posting::textLen field—textLenwas a cached copy of the text vector's size that wasalways reachable via
text.size(). Removing the cached copyshaves another 4 bytes per
Postingwith no algorithmic change.[improvement](clucene) Allocate Posting position state only when needed— Lazy-allocate thePosting's position-trackingmembers until the first position is recorded. For columns with
support_phrase=falseor a single token per doc, this state isnever touched; previously it was always allocated.
Why these are stacked together
The four optimisations were developed against the same SDocumentWriter
critical path. Each one builds on the previous one's layout change —
splitting them across PRs would force re-conflict-resolution on the
same lines. Each commit compiles + passes the Doris BE unit tests on
its own (
InvertedIndexWriterTest.FullTextStringMemoryEstimate*covers the API surface touched).
Downstream Doris PR
This PR must merge before apache/doris#63633 can reference the
correct clucene submodule SHA. The Doris PR's submodule pointer will
be updated to the tip of
cluceneafter this lands.Release note
None — internal CLucene memory optimisations, no behaviour change
on the Lucene 2.x on-disk format. Doris's memory tracking will
report tighter peaks on fulltext columns once the downstream PR
merges.
Check List (For Author)
InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostingsand friends — exercised in the downstream Doris PR's CI)
IndexWriter::getRAMUsed()now includesSDocumentsWriterbuffered postings memory (was 0 before).