Dataset scrubbing in DataSetUtils invalidates ground truth and can inject null ground truth ordinals

`DataSetUtils.getScrubbedDataSet(...)` currently mutates benchmark datasets at load time by removing zero base vectors, removing duplicate base vectors, and removing query vectors that are present in the base set. It then attempts to remap the original ground-truth ordinals onto the scrubbed base set using `rawToScrubbed.get(j)`.

When a ground-truth ordinal points to a base vector that was removed during scrubbing, `rawToScrubbed.get(j)` returns `null`, and that `null` is inserted into the in-memory ground truth.

This later fails in a new Exception planned for `AccuracyMetrics`.  Example:

```text
Null ground truth ordinal in top-10 at index 3
```

More importantly, this is not just a null-handling bug. It is a benchmark-correctness bug. Once the base set is changed, the original ground truth is no longer valid for the modified dataset. For ANN datasets, the base vectors, query vectors, and ground truth form a coupled benchmark definition. If we remove or deduplicate base vectors, then the correct top-k neighbors may change.

If the supplied ground truth only goes to top-100 and we need recall@100, there is no safe way to recover correctness from the original file after such scrubbing.

## When this occurs

- Dot-product datasets where zero vectors are removed
- Any dataset where duplicate base vectors are removed
- Any benchmark that uses the original ground truth after `getScrubbedDataSet(...)`
- Especially problematic when evaluating recall/MAP at `k` equal to the full available GT depth, since there is no spare GT to fill gaps created by removed neighbors

## Actual behavior

- Benchmark loader silently changes the dataset
- Original GT is remapped against the changed base
- Removed GT ordinals become `null`
- Evaluation later fails, or worse, could become semantically incorrect even if nulls were filtered out

## Expected behavior

- Benchmark datasets should not be altered at evaluation time in a way that invalidates the supplied ground truth
- If preprocessing is desired, it should happen offline as an explicit dataset-preparation step
- Any preprocessed dataset must have its own matching ground truth recomputed for that final dataset

## Proposed fix

1. Remove this benchmark-time scrubbing/remapping logic from `DataSetUtils.getScrubbedDataSet(...)`
2. Treat benchmark datasets as fixed inputs during evaluation
3. Move any zero-vector removal / deduplication / query filtering into explicit offline preprocessing tools
4. Require preprocessed datasets to ship with matching recomputed ground truth
5. Fail fast if a benchmark configuration attempts to evaluate against mismatched or transformed data rather than trying to remap original GT

## Rationale

This keeps benchmark semantics correct and avoids hidden data mutation inside the benchmark harness. It also aligns with standard ANN benchmarking practice: if the dataset changes, the ground truth must be regenerated for the changed dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset scrubbing in DataSetUtils invalidates ground truth and can inject null ground truth ordinals #649

When this occurs

Actual behavior

Expected behavior

Proposed fix

Rationale

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset scrubbing in DataSetUtils invalidates ground truth and can inject null ground truth ordinals #649

Description

When this occurs

Actual behavior

Expected behavior

Proposed fix

Rationale

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions