Skip to content

Dataset scrubbing in DataSetUtils invalidates ground truth and can inject null ground truth ordinals #649

@tlwillke

Description

@tlwillke

DataSetUtils.getScrubbedDataSet(...) currently mutates benchmark datasets at load time by removing zero base vectors, removing duplicate base vectors, and removing query vectors that are present in the base set. It then attempts to remap the original ground-truth ordinals onto the scrubbed base set using rawToScrubbed.get(j).

When a ground-truth ordinal points to a base vector that was removed during scrubbing, rawToScrubbed.get(j) returns null, and that null is inserted into the in-memory ground truth.

This later fails in a new Exception planned for AccuracyMetrics. Example:

Null ground truth ordinal in top-10 at index 3

More importantly, this is not just a null-handling bug. It is a benchmark-correctness bug. Once the base set is changed, the original ground truth is no longer valid for the modified dataset. For ANN datasets, the base vectors, query vectors, and ground truth form a coupled benchmark definition. If we remove or deduplicate base vectors, then the correct top-k neighbors may change.

If the supplied ground truth only goes to top-100 and we need recall@100, there is no safe way to recover correctness from the original file after such scrubbing.

When this occurs

  • Dot-product datasets where zero vectors are removed
  • Any dataset where duplicate base vectors are removed
  • Any benchmark that uses the original ground truth after getScrubbedDataSet(...)
  • Especially problematic when evaluating recall/MAP at k equal to the full available GT depth, since there is no spare GT to fill gaps created by removed neighbors

Actual behavior

  • Benchmark loader silently changes the dataset
  • Original GT is remapped against the changed base
  • Removed GT ordinals become null
  • Evaluation later fails, or worse, could become semantically incorrect even if nulls were filtered out

Expected behavior

  • Benchmark datasets should not be altered at evaluation time in a way that invalidates the supplied ground truth
  • If preprocessing is desired, it should happen offline as an explicit dataset-preparation step
  • Any preprocessed dataset must have its own matching ground truth recomputed for that final dataset

Proposed fix

  1. Remove this benchmark-time scrubbing/remapping logic from DataSetUtils.getScrubbedDataSet(...)
  2. Treat benchmark datasets as fixed inputs during evaluation
  3. Move any zero-vector removal / deduplication / query filtering into explicit offline preprocessing tools
  4. Require preprocessed datasets to ship with matching recomputed ground truth
  5. Fail fast if a benchmark configuration attempts to evaluate against mismatched or transformed data rather than trying to remap original GT

Rationale

This keeps benchmark semantics correct and avoids hidden data mutation inside the benchmark harness. It also aligns with standard ANN benchmarking practice: if the dataset changes, the ground truth must be regenerated for the changed dataset.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions