Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InList: fix bug for comparing with Null in the list using the set optimization #2809

Merged
merged 5 commits into from
Jul 4, 2022

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Jun 29, 2022

Which issue does this PR close?

close #2817

In this pr: #2156, @Ted-Jiang add the set to optimize the Inlist.
But the implementation does not consider the NULL case for IN or NOT IN expr.

In the SQL system like Mysql or spark, NULL is not equal NULL and NULL can't be compared with other data type.
If A compare with NULL, the result must be NULL.

select 1 not in (2,NULL);
NULL 
select 2 in (1,NULL)
NULL
select NULL in (1,NULL)
NULL
select NULL not in (1,NULL)
NULL

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Jun 29, 2022
},
ColumnarValue::Array(_) => {
unimplemented!("InList does not yet support nested columns.")
}
})
.collect::<Vec<_>>();

// TODO do we need to replace this below logic by `in_list_primitive`?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to replace below logic by the macro collection_contains_check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

Yes, I want to replace the logic in the in_list_primitive by the macro collection_contains_check.
I will replace the duplicated code in the in_list_primitive.

.collect::<BooleanArray>(),
)));
.map(|vop| {
match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The f32 is not implemented the eq trait, we can't use the set::<f32>.contains

Copy link
Contributor Author

@liukun4515 liukun4515 Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just use the set::<Scalarvalue::Float32/Float64>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers

It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)

https://github.com/apache/arrow-datafusion/blob/88b88d4360054a85982987aa07b3f3afd2db7d70/datafusion/common/src/scalar.rs#L33

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I will optimize this with follow-up pr with issue #2831

@liukun4515 liukun4515 changed the title InList: fix bug for compare with Null in the list using the set optimization InList: fix bug for comparing with Null in the list using the set optimization Jun 29, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 29, 2022

Codecov Report

Merging #2809 (f2730a6) into master (b47ab7c) will increase coverage by 0.09%.
The diff coverage is 85.79%.

@@            Coverage Diff             @@
##           master    #2809      +/-   ##
==========================================
+ Coverage   85.18%   85.27%   +0.09%     
==========================================
  Files         275      275              
  Lines       48564    48675     +111     
==========================================
+ Hits        41367    41508     +141     
+ Misses       7197     7167      -30     
Impacted Files Coverage Δ
...atafusion/physical-expr/src/expressions/in_list.rs 81.83% <85.79%> (+13.63%) ⬆️
datafusion/expr/src/logical_plan/plan.rs 74.31% <0.00%> (-0.20%) ⬇️
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47ab7c...f2730a6. Read the comment docs.

@viirya
Copy link
Member

viirya commented Jun 29, 2022

I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too.

Could you rephrase the issue this is going to fix?

@liukun4515
Copy link
Contributor Author

liukun4515 commented Jun 30, 2022

I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too.

Could you rephrase the issue this is going to fix?

sorry for the confused description.

I have described the issue in #2817 and the description of this pull request.

PTAL @viirya

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.

The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.

I wonder if it is worth just 1 test in sql_integration somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs

I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.

lit(ScalarValue::Boolean(Some(true))),
lit(ScalarValue::Boolean(None)),
];
for _ in 1..(OPTIMIZER_INSET_THRESHOLD + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the choice of bounds of 1 ... OPTIMIZER_INSET_THRESHOLD + 1 -- why not 0..OPTIMIZER_INSET_THRESHOLD? Not that this way is wrong, I just don't understand it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 0..OPTIMIZER_INSET_THRESHOLD

let col_a = col("a", &schema)?;
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?;

// expression: "a in (0,3,4....)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// expression: "a in (0,3,4....)"
// expression: "a in (0,Null,3,4....)"

batch,
list.clone(),
&false,
vec![Some(true), None, None],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let col_a = col("a", &schema)?;
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?;

// expression: "a in (0.0,3.0,4.0 ....)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// expression: "a in (0.0,3.0,4.0 ....)"
// expression: "a in (0.0,Null,3.0,4.0 ....)"

@@ -1139,4 +1242,192 @@ mod tests {

Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the test coverage

.collect::<BooleanArray>(),
)));
.map(|vop| {
match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers

It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)

https://github.com/apache/arrow-datafusion/blob/88b88d4360054a85982987aa07b3f3afd2db7d70/datafusion/common/src/scalar.rs#L33

},
ColumnarValue::Array(_) => {
unimplemented!("InList does not yet support nested columns.")
}
})
.collect::<Vec<_>>();

// TODO do we need to replace this below logic by `in_list_primitive`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

@liukun4515
Copy link
Contributor Author

Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.

The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.

I wonder if it is worth just 1 test in sql_integration somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs

I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.

#2832 issue to track it

collection_contains_check!(array, native_set, negated, contains_null)
}

fn set_contains_utf<OffsetSize: OffsetSizeTrait>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn set_contains_utf<OffsetSize: OffsetSizeTrait>(
fn make_set_contains_utf8<OffsetSize: OffsetSizeTrait>(

@@ -532,66 +521,131 @@ impl PhysicalExpr for InListExpr {
match value_data_type {
DataType::Boolean => {
let array = array.as_any().downcast_ref::<BooleanArray>().unwrap();
set_contains_with_negated!(array, set, self.negated)
Ok(set_contains_for_primitive!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why set_contains_for_primitive is macro, but set_contains_utf and make_set_contains_decimal are methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_contains_utf and make_set_contains_decimal are just apply to a specified data type case which are not compatible with other primitive case.
set_contains_for_primitive will apply to many similar case and same logic

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The logic handling null looks correct to me. I have a few comments about method naming and macro/method.

return Ok(ColumnarValue::Array(Arc::new(
macro_rules! collection_contains_check {
($ARRAY:expr, $VALUES:expr, $NEGATED:expr, $CONTAINS_NULL:expr) => {{
let bool_array = if $NEGATED {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two $NEGATED cases look very similar, is it possible to combine them to save code?

e.g.

if $CONTAINS_NULL {
  $ARRAY
    .iter()
    .map(|vop| match vop.map(|v| {
      if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
    }) {
      Some(true) if $NEGATED => None,
      Some(false) if !$NEGATED => None,
      x => x,
    })
  .collect::<BooleanArray>()
} else {
  $ARRAY
    .iter()
    .map(|vop| vop.map(|v| {
      if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
    })
    .collect::<BooleanArray>()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya
these two branches can be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya @alamb
each loop will check the $NEGATED and has two branches for switching.
Is this can use the vectorization?
I am not sure about this, so I will remain current implementation.
We can discuss it until getting the conclusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2833 track this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have definitely seen cases where

if $NEGATED {
  // loop
} else {
  // loop
}

Was faster than

loop {
  if $NEGATED {
    // ..
  } else {
   // ..
  } 
}

But I think the only way to find out if that applies in this case would be with a benchmark

@liukun4515 liukun4515 merged commit 57f47ab into apache:master Jul 4, 2022
@liukun4515 liukun4515 deleted the inlist_set_bug branch July 4, 2022 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IN/NOT IN List: NULL is not equal to NULL
4 participants