Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 2, 2025

This adds tests that prove some bugs that are fixed in #18832 so we can level the playing field on benchmarks.

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 2, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Dec 2, 2025

I expect tests to fail. We can fix them once we confirm.

Comment on lines +230 to +237
downcast_dictionary_array! {
v => {
let values_contains = self.contains(v.values().as_ref(), negated)?;
let result = take(&values_contains, v.keys(), None)?;
return Ok(downcast_array(result.as_ref()))
}
_ => {}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of this was a bug

Comment on lines 233 to 237
let result = match (v.null_count() > 0, negated) {
(true, false) => {
// has nulls, not negated"
BooleanArray::from_iter(
v.iter().map(|value| Some(self.values.contains(&value?))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were not handling nulls properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i double checked with postgres and the new code looks right

postgres=# select NULL IN (1);
 ?column?
----------

(1 row)

postgres=# select 1 IN (1, NULL);
 ?column?
----------
 t
(1 row)

postgres=# select 1 IN (2, NULL);
 ?column?
----------

(1 row)

postgres=# select 1 NOT IN (2, NULL);
 ?column?
----------

(1 row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I checked postgres and duckdb to get the right behavior

@adriangb adriangb force-pushed the add-in-list-tests branch 3 times, most recently from 76fd945 to f2eb513 Compare December 8, 2025 17:54
@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

@alamb I've added test harnesses to remove duplication as you requested 😄

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- I went through the code fixes and tests and this PR looks like a definitely improvement to me

I left some small suggestions

I also ran code coverage on this PR to double check and it did find one more case that is not covered

cargo llvm-cov test --html -p datafusion-physical-expr
Image

/// This validates data types, evaluates the list as constants, and uses specialized
/// StaticFilter implementations for better performance (e.g., Int32StaticFilter for Int32).
///
/// Returns an error if data types don't match or if the list contains non-constant expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is correct -- the code appears to handle the case of non-constant expressions as well

I think this might also be clearer if it were simply named try_new() rather than try_from_static_filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks updated

BooleanArray::from_iter(
v.iter().map(|value| Some(self.values.contains(&value?))),
)
// Either needle or haystack has nulls, not negated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Either needle or haystack has nulls, not negated
// needle has nulls, not negated

BooleanArray::from_iter(
v.iter().map(|value| Some(!self.values.contains(&value?))),
)
// Either needle or haystack has nulls, negated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Either needle or haystack has nulls, negated
// needle has nulls, negated

negated,
Some(instantiate_static_filter(in_array)?),
)),
Err(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -- I think it would be clearer if this try_evaluate_constant_list evaluated to Result<Option<..>> rather than Result -- that way we avoid the string allocation on error, and could pass real errors .

This code responds to all errors, even if the issue is something different than the types are not supported

If you went this route, I suspect you wouldn't need the second error check for type matches (it would already be handled)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed and refactored to Result<Option<>> as you suggested

name: &'static str,
value_in: ScalarValue,
value_not_in: ScalarValue,
value_in_list: ScalarValue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reading of these tests is that they always test a one element list -- as in the list does not have multiple values

Do you think we need coverage for multi-value lists?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous tests all had mutli-value lists, such as

        // expression: "a not in ("a", "b")"

Maybe we could make this something like values_in: Vec<ScalarValue>

However, I see there are multi-value tests below too, so maybe this is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to accept other_list_values: Vec<ScalarValue> and have the tests throw in some extra values to give coverage

Comment on lines 233 to 237
let result = match (v.null_count() > 0, negated) {
(true, false) => {
// has nulls, not negated"
BooleanArray::from_iter(
v.iter().map(|value| Some(self.values.contains(&value?))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i double checked with postgres and the new code looks right

postgres=# select NULL IN (1);
 ?column?
----------

(1 row)

postgres=# select 1 IN (1, NULL);
 ?column?
----------
 t
(1 row)

postgres=# select 1 IN (2, NULL);
 ?column?
----------

(1 row)

postgres=# select 1 NOT IN (2, NULL);
 ?column?
----------

(1 row)

// Create dictionary-encoded batch with values [1, 2, 5]
// Dictionary: keys [0, 1, 2] -> values [1, 2, 5]
let keys = Int8Array::from(vec![0, 1, 2]);
let values = Int32Array::from(vec![1, 2, 5]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally recommend using values that are clearly not the keys - for example

Suggested change
let values = Int32Array::from(vec![1, 2, 5]);
let values = Int32Array::from(vec![100, 200, 500]);

( you would also have to change the literals above)

}
(false, true) => {
// no null, negated
// No nulls anywhere, negated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to code coverage, this branch is not covered by tests (see PR comments)

RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::clone(&array)])?;

// Helper to format SQL-like representation for error messages
let _format_sql = |negated: bool, with_null: bool| -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helper seems unused (why is it named starting with _ 🤔 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a pipe dream to have nice error messages. I updated it and implemented it. now if a test fails you get something like:

  assertion `left == right` failed: Failed for: a IN (5, 3, 7)
  a: PrimitiveArray<Int32>
  [
    1,
    3,
    4,
  ]
    left: BooleanArray
  [
    true,
    true,
    true,
  ]
   right: BooleanArray
  [
    false,
    true,
    false,
  ]

@adriangb adriangb added this pull request to the merge queue Dec 9, 2025
Merged via the queue into apache:main with commit 21a16e4 Dec 9, 2025
31 checks passed
@adriangb adriangb deleted the add-in-list-tests branch December 9, 2025 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants