Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610
Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610michaeljmarshall wants to merge 3 commits intodatastax:mainfrom
Conversation
Adding this as its own commit since the canonical implementation is shown as 1-based and I want to make it clear how I've modified it.
There was a problem hiding this comment.
Pull request overview
This PR optimizes the training vector extraction process in ProductQuantization by implementing Floyd's random sampling algorithm to ensure at most MAX_PQ_TRAINING_SET_SIZE vectors are selected. This replaces the previous approach of filtering all vectors with a random probability check.
Changes:
- Replaced probabilistic filtering with deterministic sampling using Floyd's algorithm
- Added special handling for cases where total vectors ≤ MAX_PQ_TRAINING_SET_SIZE
- Changed from ThreadLocalRandom to SplittableRandom with a fixed seed for reproducibility
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| // Uses Floyd’s sampling algorithm to select MAX_PQ_TRAINING_SET_SIZE random ordinals from 0 to ravv.size() | ||
| // while only iterating MAX_PQ_TRAINING_SET_SIZE times. | ||
| SplittableRandom rng = new SplittableRandom(1); |
There was a problem hiding this comment.
Using a hardcoded seed (1) removes randomness from the sampling process, making training deterministic. This could lead to consistently selecting the same subset of vectors across runs, potentially missing diverse data patterns. Consider either removing the seed for true randomness or making it configurable.
| SplittableRandom rng = new SplittableRandom(1); | |
| SplittableRandom rng = new SplittableRandom(); |
There was a problem hiding this comment.
@tlwillke - do we want deterministic randomness here? Given that vectors are already randomly ordered, it seems somewhat beneficial to remove the extra layer of randomness so that graph construction is reproducible for a given size of ravv and order of vectors. Let me know.
Fixes #590
Uses floyd's random sampling algorithm to select random training vectors from the
RandomAccessVectorValues. The solution has two phases. The first is to selectMAX_PQ_TRAINING_SET_SIZErandom ordinals. Then, it maps those ordinals to vectors. Here is a reference to the algorithm: https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin.The algorithm is essentially constant time, which is an improvement on what we currently had. We will now only generate
MAX_PQ_TRAINING_SET_SIZErandom numbers instead ofravv.size()random numbers. The slight increase cost is checking a hash set for containment.This change also handles the boundary case where the vector values object has at most
MAX_PQ_TRAINING_SET_SIZE.