-
Notifications
You must be signed in to change notification settings - Fork 149
Dataset scrubbing in DataSetUtils invalidates ground truth and can inject null ground truth ordinals #649
Description
DataSetUtils.getScrubbedDataSet(...) currently mutates benchmark datasets at load time by removing zero base vectors, removing duplicate base vectors, and removing query vectors that are present in the base set. It then attempts to remap the original ground-truth ordinals onto the scrubbed base set using rawToScrubbed.get(j).
When a ground-truth ordinal points to a base vector that was removed during scrubbing, rawToScrubbed.get(j) returns null, and that null is inserted into the in-memory ground truth.
This later fails in a new Exception planned for AccuracyMetrics. Example:
Null ground truth ordinal in top-10 at index 3
More importantly, this is not just a null-handling bug. It is a benchmark-correctness bug. Once the base set is changed, the original ground truth is no longer valid for the modified dataset. For ANN datasets, the base vectors, query vectors, and ground truth form a coupled benchmark definition. If we remove or deduplicate base vectors, then the correct top-k neighbors may change.
If the supplied ground truth only goes to top-100 and we need recall@100, there is no safe way to recover correctness from the original file after such scrubbing.
When this occurs
- Dot-product datasets where zero vectors are removed
- Any dataset where duplicate base vectors are removed
- Any benchmark that uses the original ground truth after
getScrubbedDataSet(...) - Especially problematic when evaluating recall/MAP at
kequal to the full available GT depth, since there is no spare GT to fill gaps created by removed neighbors
Actual behavior
- Benchmark loader silently changes the dataset
- Original GT is remapped against the changed base
- Removed GT ordinals become
null - Evaluation later fails, or worse, could become semantically incorrect even if nulls were filtered out
Expected behavior
- Benchmark datasets should not be altered at evaluation time in a way that invalidates the supplied ground truth
- If preprocessing is desired, it should happen offline as an explicit dataset-preparation step
- Any preprocessed dataset must have its own matching ground truth recomputed for that final dataset
Proposed fix
- Remove this benchmark-time scrubbing/remapping logic from
DataSetUtils.getScrubbedDataSet(...) - Treat benchmark datasets as fixed inputs during evaluation
- Move any zero-vector removal / deduplication / query filtering into explicit offline preprocessing tools
- Require preprocessed datasets to ship with matching recomputed ground truth
- Fail fast if a benchmark configuration attempts to evaluate against mismatched or transformed data rather than trying to remap original GT
Rationale
This keeps benchmark semantics correct and avoids hidden data mutation inside the benchmark harness. It also aligns with standard ANN benchmarking practice: if the dataset changes, the ground truth must be regenerated for the changed dataset.