Skip to content

issue #599: add readerSupplierFactory to PQRetrainer / OnDiskGraphIndexCompactor#14

Merged
eolivelli merged 1 commit into
mainfrom
issue-599-pq-bulk-reader-supplier
May 19, 2026
Merged

issue #599: add readerSupplierFactory to PQRetrainer / OnDiskGraphIndexCompactor#14
eolivelli merged 1 commit into
mainfrom
issue-599-pq-bulk-reader-supplier

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Summary

  • OnDiskGraphIndex: add getView(ReaderSupplier) overload so a View can be opened against a caller-supplied reader rather than the default block-cache-backed one
  • PQRetrainer: new 4-arg constructor accepting an optional Function<OnDiskGraphIndex, ReaderSupplier> readerSupplierFactory; when non-null, extractVectorsSequential opens each source View via getView(supplier) and closes the supplier after all vectors for that source are extracted — eliminating per-node block-cache round-trips (issue add post-test diagnostics capture for GHA datastax/jvector#599 Option B)
  • OnDiskGraphIndexCompactor: new 6-arg constructor forwarding the factory to PQRetrainer; existing 5-arg constructor passes null (unchanged default behaviour)
  • TestPQRetrainerCustomReader: unit tests covering getView(supplier) correctness, factory invocation count, and backward-compatibility of the old constructor

Tests

  • TestPQRetrainerCustomReader#getViewWithCustomSupplierReadsVectorsCorrectly: verifies the new getView(ReaderSupplier) returns the same vectors as the default path
  • TestPQRetrainerCustomReader#pqRetrainerInvokesCustomFactoryOncePerSource: verifies the factory is called exactly once per source segment and the resulting PQ codebook is non-null
  • TestPQRetrainerCustomReader#defaultConstructorUsesNullFactory: verifies the 3-arg constructor still works without a factory (backward-compat)

🤖 Implemented by the pr-worker agent.

…skGraphIndexCompactor

Allow callers to inject a Function<OnDiskGraphIndex,ReaderSupplier> into
PQRetrainer and OnDiskGraphIndexCompactor so that PQ retraining vector
extraction can use a pre-downloaded or locally-buffered copy of each
source graph file instead of per-node block-cache round-trips.

Changes:
- OnDiskGraphIndex: add getView(ReaderSupplier) overload so a View can
  be opened against a caller-supplied reader rather than the default one
- PQRetrainer: new 4-arg constructor accepting an optional
  readerSupplierFactory; when non-null, extractVectorsSequential opens
  each source View via getView(supplier) and closes the supplier after
  all vectors for that source are extracted
- OnDiskGraphIndexCompactor: new 6-arg constructor forwarding the
  factory to PQRetrainer; existing 5-arg constructor passes null (no
  change to default behaviour)
- TestPQRetrainerCustomReader: unit tests covering getView(supplier),
  factory invocation count, and backward-compatibility of the old constructor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eolivelli eolivelli merged commit cae0118 into main May 19, 2026
2 of 10 checks passed
eolivelli added a commit to eolivelli/herddb that referenced this pull request May 19, 2026
…le download (#601)

Fixes #599.

## Changes

- **`DeleteOnCloseReaderSupplier`** (new): wraps a `ReaderSupplier` and
deletes the backing temp file on `close()`; ensures temp segment files
downloaded for PQ retraining are cleaned up after use
- **`SegmentPQReaderSupplier`** (new): builds the
`Function<OnDiskGraphIndex,ReaderSupplier>` factory injected into
`OnDiskGraphIndexCompactor`; for each source segment, either downloads
the graph file directly from object storage (when
`DataStorageManager.supportsDirectMultipartDownload()` is `true`) or
opens a sequential reader via
`DataStorageManager.multipartIndexReaderSupplier()`; temp files live in
`store.tmpDirectory()` (= IS data directory)
- **`VectorIndexCompactor`**: wire
`SegmentPQReaderSupplier.forSegments()` into `rebuildSegmentStreaming()`
via the new 6-arg `OnDiskGraphIndexCompactor` constructor (introduced in
eolivelli/jvector#14); add `PQ_BULK_READER_COUNT` `AtomicInteger`
counter for test observability
- **`PersistentVectorStore`**: expose `segmentStorageKey()`,
`tableSpaceUUID()`, and `dataStorageManager()` as package-private
accessors needed by `SegmentPQReaderSupplier`

## Tests

-
**`VectorIndexStreamingCompactionTest#streamingCompactionUsesBulkPQReaderSupplier`**:
creates a store with `MemoryDataStorageManager` (dim=16, 300 vectors × 5
checkpoints so FusedPQ is enabled), calls `store.runCompactionCycle()`,
and asserts that `PQ_BULK_READER_COUNT` increased by ≥ 2 (one per source
segment) and that a search on the resulting index returns non-empty
results

## Depends on

eolivelli/jvector#14 — the 6-arg `OnDiskGraphIndexCompactor` constructor
and `OnDiskGraphIndex.getView(ReaderSupplier)` overload

🤖 Implemented by the `pr-worker` agent.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant