Describe the bug
approx_distinct over a Utf8View column can report an inflated distinct count. The same string value may be hashed in two different ways depending on which batch it arrives in, so one distinct value gets recorded in two different HyperLogLog registers and counted more than once.
To Reproduce
fn distinct_count(acc: &mut StringViewHLLAccumulator) -> u64 {
match acc.evaluate().unwrap() {
ScalarValue::UInt64(Some(v)) => v,
other => panic!("unexpected evaluate result: {other:?}"),
}
}
// A string longer than the 12-byte inline limit
const LONG: &str = "this string is definitely longer than twelve bytes";
#[test]
fn split_batches_match_single_mixed_batch() {
// Multiset: {"aaa" x2, "bbb", LONG}, so 3 distinct values.
let mixed: ArrayRef =
Arc::new(StringViewArray::from(vec!["aaa", "bbb", LONG, "aaa"]));
let mut acc_single = StringViewHLLAccumulator::new();
acc_single.update_batch(&[mixed]).unwrap();
// Same multiset, but split so "aaa" lands in both an all-inline batch
// and a batch with a data buffer (forced by LONG).
let inline_only: ArrayRef = Arc::new(StringViewArray::from(vec!["aaa", "bbb"]));
let with_buffer: ArrayRef = Arc::new(StringViewArray::from(vec!["aaa", LONG]));
assert!(inline_only.as_string_view().data_buffers().is_empty());
assert!(!with_buffer.as_string_view().data_buffers().is_empty());
let mut acc_split = StringViewHLLAccumulator::new();
acc_split.update_batch(&[inline_only]).unwrap();
acc_split.update_batch(&[with_buffer]).unwrap();
assert_eq!(
distinct_count(&mut acc_single),
distinct_count(&mut acc_split)
);
assert_eq!(distinct_count(&mut acc_single), 3);
}
Expected behavior
No response
Additional context
found this when working on #22768
Describe the bug
approx_distinctover aUtf8Viewcolumn can report an inflated distinct count. The same string value may be hashed in two different ways depending on which batch it arrives in, so one distinct value gets recorded in two different HyperLogLog registers and counted more than once.To Reproduce
Expected behavior
No response
Additional context
found this when working on #22768