Skip to content

approx_distinct over-counts Utf8View because the hash strategy is chosen per batch instead of per value #22796

@haohuaijin

Description

@haohuaijin

Describe the bug

approx_distinct over a Utf8View column can report an inflated distinct count. The same string value may be hashed in two different ways depending on which batch it arrives in, so one distinct value gets recorded in two different HyperLogLog registers and counted more than once.

To Reproduce

    fn distinct_count(acc: &mut StringViewHLLAccumulator) -> u64 {
        match acc.evaluate().unwrap() {
            ScalarValue::UInt64(Some(v)) => v,
            other => panic!("unexpected evaluate result: {other:?}"),
        }
    }

    // A string longer than the 12-byte inline limit
    const LONG: &str = "this string is definitely longer than twelve bytes";

    #[test]
    fn split_batches_match_single_mixed_batch() {
        // Multiset: {"aaa" x2, "bbb", LONG}, so 3 distinct values.
        let mixed: ArrayRef =
            Arc::new(StringViewArray::from(vec!["aaa", "bbb", LONG, "aaa"]));
        let mut acc_single = StringViewHLLAccumulator::new();
        acc_single.update_batch(&[mixed]).unwrap();

        // Same multiset, but split so "aaa" lands in both an all-inline batch
        // and a batch with a data buffer (forced by LONG).
        let inline_only: ArrayRef = Arc::new(StringViewArray::from(vec!["aaa", "bbb"]));
        let with_buffer: ArrayRef = Arc::new(StringViewArray::from(vec!["aaa", LONG]));
        assert!(inline_only.as_string_view().data_buffers().is_empty());
        assert!(!with_buffer.as_string_view().data_buffers().is_empty());

        let mut acc_split = StringViewHLLAccumulator::new();
        acc_split.update_batch(&[inline_only]).unwrap();
        acc_split.update_batch(&[with_buffer]).unwrap();

        assert_eq!(
            distinct_count(&mut acc_single),
            distinct_count(&mut acc_split)
        );
        assert_eq!(distinct_count(&mut acc_single), 3);
    }

Expected behavior

No response

Additional context

found this when working on #22768

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions