issue #599: add readerSupplierFactory to PQRetrainer / OnDiskGraphIndexCompactor#14
Merged
Merged
Conversation
…skGraphIndexCompactor Allow callers to inject a Function<OnDiskGraphIndex,ReaderSupplier> into PQRetrainer and OnDiskGraphIndexCompactor so that PQ retraining vector extraction can use a pre-downloaded or locally-buffered copy of each source graph file instead of per-node block-cache round-trips. Changes: - OnDiskGraphIndex: add getView(ReaderSupplier) overload so a View can be opened against a caller-supplied reader rather than the default one - PQRetrainer: new 4-arg constructor accepting an optional readerSupplierFactory; when non-null, extractVectorsSequential opens each source View via getView(supplier) and closes the supplier after all vectors for that source are extracted - OnDiskGraphIndexCompactor: new 6-arg constructor forwarding the factory to PQRetrainer; existing 5-arg constructor passes null (no change to default behaviour) - TestPQRetrainerCustomReader: unit tests covering getView(supplier), factory invocation count, and backward-compatibility of the old constructor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eolivelli
added a commit
to eolivelli/herddb
that referenced
this pull request
May 19, 2026
…le download (#601) Fixes #599. ## Changes - **`DeleteOnCloseReaderSupplier`** (new): wraps a `ReaderSupplier` and deletes the backing temp file on `close()`; ensures temp segment files downloaded for PQ retraining are cleaned up after use - **`SegmentPQReaderSupplier`** (new): builds the `Function<OnDiskGraphIndex,ReaderSupplier>` factory injected into `OnDiskGraphIndexCompactor`; for each source segment, either downloads the graph file directly from object storage (when `DataStorageManager.supportsDirectMultipartDownload()` is `true`) or opens a sequential reader via `DataStorageManager.multipartIndexReaderSupplier()`; temp files live in `store.tmpDirectory()` (= IS data directory) - **`VectorIndexCompactor`**: wire `SegmentPQReaderSupplier.forSegments()` into `rebuildSegmentStreaming()` via the new 6-arg `OnDiskGraphIndexCompactor` constructor (introduced in eolivelli/jvector#14); add `PQ_BULK_READER_COUNT` `AtomicInteger` counter for test observability - **`PersistentVectorStore`**: expose `segmentStorageKey()`, `tableSpaceUUID()`, and `dataStorageManager()` as package-private accessors needed by `SegmentPQReaderSupplier` ## Tests - **`VectorIndexStreamingCompactionTest#streamingCompactionUsesBulkPQReaderSupplier`**: creates a store with `MemoryDataStorageManager` (dim=16, 300 vectors × 5 checkpoints so FusedPQ is enabled), calls `store.runCompactionCycle()`, and asserts that `PQ_BULK_READER_COUNT` increased by ≥ 2 (one per source segment) and that a search on the resulting index returns non-empty results ## Depends on eolivelli/jvector#14 — the 6-arg `OnDiskGraphIndexCompactor` constructor and `OnDiskGraphIndex.getView(ReaderSupplier)` overload 🤖 Implemented by the `pr-worker` agent. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OnDiskGraphIndex: addgetView(ReaderSupplier)overload so aViewcan be opened against a caller-supplied reader rather than the default block-cache-backed onePQRetrainer: new 4-arg constructor accepting an optionalFunction<OnDiskGraphIndex, ReaderSupplier> readerSupplierFactory; when non-null,extractVectorsSequentialopens each sourceViewviagetView(supplier)and closes the supplier after all vectors for that source are extracted — eliminating per-node block-cache round-trips (issue add post-test diagnostics capture for GHA datastax/jvector#599 Option B)OnDiskGraphIndexCompactor: new 6-arg constructor forwarding the factory toPQRetrainer; existing 5-arg constructor passesnull(unchanged default behaviour)TestPQRetrainerCustomReader: unit tests coveringgetView(supplier)correctness, factory invocation count, and backward-compatibility of the old constructorTests
TestPQRetrainerCustomReader#getViewWithCustomSupplierReadsVectorsCorrectly: verifies the newgetView(ReaderSupplier)returns the same vectors as the default pathTestPQRetrainerCustomReader#pqRetrainerInvokesCustomFactoryOncePerSource: verifies the factory is called exactly once per source segment and the resulting PQ codebook is non-nullTestPQRetrainerCustomReader#defaultConstructorUsesNullFactory: verifies the 3-arg constructor still works without a factory (backward-compat)🤖 Implemented by the
pr-workeragent.