[improvement] Control HNSW build chunk memory#62869
Open
yx-keith wants to merge 7 commits into
Open
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Reduce ANN/HNSW index build input-buffer memory spikes by sizing build chunks with a byte budget, removing the initialization-time full chunk reserve, and releasing temporary vector buffers at index finish.
### Release note
None
### Check List (For Author)
- Test: Partial
- Unit Test: Added/updated BE unit tests in ann_index_writer_test.cpp
- Manual test: git diff --check passed for modified files
- Not run: build-support/clang-format.sh could not run because llvm@16 is not installed locally
- Not run: ./run-be-ut.sh --run --filter=AnnIndexWriterTest failed because JAVA_HOME points to JDK 8 and JDK_17 is not set
- Behavior changed: No
- Does this need documentation: No
82aab7e to
0fb16c7
Compare
…order Follow-up to 0fb16c7 ("Control HNSW build chunk memory"). The first commit reduced input-buffer spikes but left a few correctness and memory issues in AnnIndexColumnWriter. What changes: - Introduce VectorIndex::needs_training() and FaissVectorIndex override so HNSW+FLAT reports false (no training required) while IVF and any scalar/product quantizer reports true. - Drive train() from a single _train_once_if_needed() helper guarded by a _trained flag. Previously every full chunk re-invoked train(), which for IVF re-ran k-means and invalidated vectors already added, and for HNSW pointlessly acquired the OpenMP build budget mutex. - Move _reset_chunk_buffer(true) before save() in finish() so the serialization workspace does not coexist with a stale chunk-bytes allocation (peak shaved by up to ann_index_build_chunk_bytes). - Harden _reset_chunk_buffer(false): when chunk_rows or dimension is unset, target * 2 is zero and the previous predicate degenerated to "free on every batch". Now we short-circuit when target == 0. - Drop the dead "if (_chunk_rows == 0)" branch in add_array_values; init() always populates it, replaced by DCHECK. - Log the index id on every ANN-writer log line so operators can attribute OOM and skip events to a specific tablet index. Tests: - MockVectorIndex grows a needs_training() mock; existing IVF/PQ/SQ tests now assert train() is invoked exactly once (not per chunk). - New TestHnswSkipsTrainCall verifies HNSW pushes multiple chunks through add() with zero train() invocations. - New TestComputeChunkRows* cover dim==0, byte budget smaller than a row, and the chunk_size upper bound. - New TestResetChunkBufferKeepsCapacityWhenTargetUnknown regresses the shrink-bug fix by exercising the protected helper directly.
The chunk-bytes changes in 0fb16c7 and f3afb2a9 only bounded the input buffer. The in-memory HNSW graph, the quantized vector store, and the per-thread workspace are still uncontrolled, so concurrent loads / compactions / schema-change builds can stack peak usage and OOM the BE. This commit adds a global admission-control layer pairing with the existing ScopedOmpThreadBudget (which caps CPU): a build now also has to reserve its estimated memory peak before train/add. New module storage/index/ann/ann_build_memory_budget.{h,cpp}: - AnnBuildMemoryBudget singleton with try_reserve(bytes, timeout_ms), release(bytes), reserved_bytes(). Waiters block on a condition variable and are notified on every release. A single oversized build is allowed when nothing else is in flight, so it cannot self-deadlock. - AnnBuildMemoryReservation RAII handle. Move-only; releases bytes on destruction. try_acquire() returns an inactive handle on failure. - estimate_ann_build_memory(): conservative model covering input chunk buffer, quantizer-aware vector store (FLAT/SQ8/SQ4/PQ), HNSW graph (rows * max_degree * 8 * 2.0) or IVF centroid overhead, plus per-thread workspace (omp_threads * 4 MiB). When expected_rows is unknown (the writer does not know segment size at init() time) it falls back to chunk_rows, giving at least a single-chunk reservation. AnnIndexColumnWriter integration: - init() reserves the estimated peak via AnnBuildMemoryReservation. Failure routes through _apply_oom_action: * "skip" -> writer enters _skip_due_to_oom mode, add_array_values becomes a no-op, finish() deletes the index entry so the surrounding segment write still succeeds (warning logged). * "wait" -> try_reserve has already waited the configured timeout; treated as failure (RuntimeError with diagnostics). * "fail" -> same RuntimeError path without waiting. - The reservation is held in a writer member and released by RAII when the writer is destroyed, after save()/delete_index has completed. New mutable configs (default disabled to keep existing deployments unaffected): - ann_index_build_memory_budget_bytes (default 0 = disabled) - ann_index_build_memory_wait_timeout_ms (default 30000) - ann_index_build_on_oom_action (default "wait", validated) Tests (new ann_build_memory_budget_test.cpp): - Budget primitives: reserve/release accounting, timeout when starved, oversized single build allowed, waiter wakes on release, RAII handle releases on destruction, failed acquisition yields inactive handle. - Estimator: grows with rows, dim=0 returns 0, unknown rows fall back to chunk-size, FLAT > SQ8 > SQ4 > PQ store footprint. - Writer integration: skip mode deletes index entry and swallows rows, fail mode returns RuntimeError from init(), disabled budget is a complete no-op (reserved_bytes stays 0).
Follow-up to the ANN build admission control. Two issues:
- on_oom_action="fail" still waited the full ann_index_build_memory_wait_timeout_ms
before failing, contradicting its "fail immediately" contract. init() now passes
timeout 0 for "fail" via _oom_wait_timeout_ms().
- The admission reservation only ever covered one chunk (expected_rows was always 0
at init time), so the HNSW graph / vector store of a full segment were never
accounted and the global budget could not bound real memory. The reservation now
grows incrementally as rows accumulate:
* AnnBuildMemoryBudget::try_reserve gains a caller_held argument so a build can
grow its own reservation without deadlocking against itself; a grow only
blocks on *other* concurrent builds.
* AnnBuildMemoryReservation::grow() tops up the handle.
* The writer tracks _added_rows and calls _ensure_reservation_for_rows() before
each streaming add() (backpressure via on_oom_action on contention) and a
best-effort grow at finish() for the last partial chunk.
Tests: grow sole-build/contended/wakeup paths, fail-action returns without waiting
despite a large timeout, and the reservation grows with accumulated rows.
# Conflicts: # be/src/storage/index/ann/ann_index_writer.h
Contributor
Author
|
run buildall |
Contributor
Author
|
run buildall |
The admission-control header chain leaked <faiss/*> into every translation unit that includes column_writer.h (via ann_index_writer.h -> ann_build_memory_budget.h -> faiss_ann_index.h), breaking the build with "fatal error: 'faiss/Index.h' file not found" because those units do not have the FAISS include path. Decouple the budget module from FAISS: - Add AnnBuildMemoryParams, a FAISS-free POD mirroring the few build fields the estimator needs (index kind, quantizer, dim, max_degree, ivf_nlist, pq_m). - estimate_ann_build_memory() now takes AnnBuildMemoryParams; the budget header no longer includes faiss_ann_index.h. - AnnIndexColumnWriter stores AnnBuildMemoryParams instead of FaissBuildParameter and drops the faiss_ann_index.h include; init() projects FaissBuildParameter via to_memory_params(), so ann_index_writer.h stays FAISS-free. - Update the estimator unit tests to build AnnBuildMemoryParams. No behavior change: the estimate values and admission logic are identical.
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 31449 ms |
Contributor
TPC-DS: Total hot run time: 171557 ms |
…ntrol WriterAdmissionTest.SkipModeDeletesIndexAndSwallowsRows and FailModeReturnsErrorFromInit set a tiny budget (1 byte) expecting the writer to hit skip / fail. But AnnBuildMemoryBudget deliberately lets a single oversized build through when nothing else is in flight (so it cannot deadlock against itself), so with no other reservation the writer was admitted and neither skip nor fail fired. Pre-occupy the budget with try_reserve(1) before init() in both tests, mirroring FailModeDoesNotWaitDespiteLargeTimeout, so the writer is no longer the sole build and admission control correctly rejects it. No production code change.
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 31372 ms |
Contributor
TPC-DS: Total hot run time: 172815 ms |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
TPC-H: Total hot run time: 32190 ms |
Contributor
TPC-DS: Total hot run time: 171717 ms |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Building HNSW / ANN (vector) indexes had several memory and correctness issues:
1,000,000 rows × dim. Fordim=512that is ~2GB reserved up front, even for a segment with only a few thousand rows.train()was called once per buffered chunk. For IVF this re-runs k-means and invalidates vectors already added in earlier chunks (a correctness bug), and for HNSW it needlessly acquires the build lock.What this PR does
1. Byte-budgeted chunking (
ann_index_build_chunk_bytes, default 128MB).The effective per-batch row count is
min(ann_index_build_chunk_size, chunk_bytes / row_bytes). The init-time full reserve is removed, and the input buffer is freed beforesave()so the serialization workspace does not overlap a stale chunk allocation.2. Train once per build.
Adds
VectorIndex::needs_training()(HNSW+FLAT → false; IVF / SQ / PQ → true) and drives training through a single_train_once_if_needed()guarded by a_trainedflag, sotrain()runs at most once. Fixes the IVF re-cluster bug and removes pointless lock acquisition for HNSW.3. Global build-memory admission control (new
ann_build_memory_budget.{h,cpp}).A process-wide
AnnBuildMemoryBudgetsingleton pairs with the existingScopedOmpThreadBudget(which caps CPU): a build must reserve its estimated memory peak before train/add and releases it on finish (RAII). The reservation grows incrementally with the real accumulated row count, so the global budget reflects actual memory rather than just a per-chunk floor. On budget exhaustion the behavior is configurable viaann_index_build_on_oom_action:wait— block up toann_index_build_memory_wait_timeout_ms, then fail;skip— delete the index entry and let the segment write succeed (queries fall back to brute force);fail— return immediately without waiting.A single oversized build is admitted when nothing else is in flight, so it cannot deadlock against itself.
New configs (all default to a no-op so existing deployments are unaffected):
ann_index_build_chunk_bytes=134217728ann_index_build_memory_budget_bytes=0(0 disables admission control)ann_index_build_memory_wait_timeout_ms=30000ann_index_build_on_oom_action="wait"Implementation note: the budget module is intentionally FAISS-free. It uses a lightweight
AnnBuildMemoryParamsPOD (mirroring only the fields the estimator needs) instead ofFaissBuildParameter, soann_index_writer.hdoes not leak<faiss/*>into the many non-ANN translation units that includecolumn_writer.h.Release note
Add memory control for ANN/HNSW index building: byte-budgeted input chunks and an optional global build-memory admission control to bound peak memory across concurrent index builds. Disabled by default (
ann_index_build_memory_budget_bytes=0).Check List (For Author)
ann_build_memory_budget_test.cpp(budget reserve/release/grow paths, estimator monotonicity, writer skip/fail/disabled paths) and extendedann_index_writer_test.cpp(train-once for IVF/PQ/SQ, HNSW skips train).Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)