Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE by michaeljmarshall · Pull Request #610 · datastax/jvector

michaeljmarshall · 2026-02-04T05:43:49Z

Fixes #590

Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. Here is a reference to the algorithm: https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin.

The algorithm is essentially constant time, which is an improvement on what we currently had. We will now only generate MAX_PQ_TRAINING_SET_SIZE random numbers instead of ravv.size() random numbers. The slight increase cost is checking a hash set for containment.

This change also handles the boundary case where the vector values object has at most MAX_PQ_TRAINING_SET_SIZE.

…G_SET_SIZE

Adding this as its own commit since the canonical implementation is shown as 1-based and I want to make it clear how I've modified it.

Copilot

Pull request overview

This PR optimizes the training vector extraction process in ProductQuantization by implementing Floyd's random sampling algorithm to ensure at most MAX_PQ_TRAINING_SET_SIZE vectors are selected. This replaces the previous approach of filtering all vectors with a random probability check.

Changes:

Replaced probabilistic filtering with deterministic sampling using Floyd's algorithm
Added special handling for cases where total vectors ≤ MAX_PQ_TRAINING_SET_SIZE
Changed from ThreadLocalRandom to SplittableRandom with a fixed seed for reproducibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T05:44:45Z

jvector-base/src/main/java/io/github/jbellis/jvector/quantization/ProductQuantization.java

+        } else {
+            // Uses Floyd’s sampling algorithm to select MAX_PQ_TRAINING_SET_SIZE random ordinals from 0 to ravv.size()
+            // while only iterating MAX_PQ_TRAINING_SET_SIZE times.
+            SplittableRandom rng = new SplittableRandom(1);


Using a hardcoded seed (1) removes randomness from the sampling process, making training deterministic. This could lead to consistently selecting the same subset of vectors across runs, potentially missing diverse data patterns. Consider either removing the seed for true randomness or making it configurable.

Suggested change

SplittableRandom rng = new SplittableRandom(1);

SplittableRandom rng = new SplittableRandom();

@tlwillke - do we want deterministic randomness here? Given that vectors are already randomly ordered, it seems somewhat beneficial to remove the extra layer of randomness so that graph construction is reproducible for a given size of ravv and order of vectors. Let me know.

jvector-base/src/main/java/io/github/jbellis/jvector/quantization/ProductQuantization.java

michaeljmarshall added 3 commits February 3, 2026 23:28

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAININ…

98acb5b

…G_SET_SIZE

Make the sampling alg. 0-based

517af31

Adding this as its own commit since the canonical implementation is shown as 1-based and I want to make it clear how I've modified it.

Add an assertion

5c2214b

michaeljmarshall self-assigned this Feb 4, 2026

michaeljmarshall requested review from MarkWolters, jshook and tlwillke as code owners February 4, 2026 05:43

michaeljmarshall requested a review from Copilot February 4, 2026 05:44

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610
michaeljmarshall wants to merge 3 commits intodatastax:mainfrom
michaeljmarshall:jvector-590

michaeljmarshall commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

michaeljmarshall Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	SplittableRandom rng = new SplittableRandom(1);
	SplittableRandom rng = new SplittableRandom();

Conversation

michaeljmarshall commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant