Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10356: [Rust][DataFusion] Add support for is_in #9038

Closed
wants to merge 1 commit into from

Conversation

seddonm1
Copy link
Contributor

This PR is a work-in-progress simple implementation of InList ('ABC' IN ('ABC', 'DEF')) which currently only operates on strings.

It uses the kernels::comparison::contains implementation but there are a few issues I am struggling with:

  1. kernels::comparison::contains allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these ListArrays?

  2. kernels::comparison::contains returns false if either of the comparison values is null. Is this the desired behavior? If not I can modify the kernel to return null instead.

  3. If the basic implementation looks correct I can add the rest of the data types (via macros).

@github-actions
Copy link

/// The value to compare
expr: Box<Expr>,
/// The low end of the range
list: Vec<Expr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).

You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118

And coerce the in list items all to the same types

/// The low end of the range
list: Vec<Expr>,
/// Whether the expression is negated
negated: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might keep negated out and use not instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on not(..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly included negated to allow pretty printing like: 'z' NOT IN ('x','y'). I have changed this so it now uses the not expr so will now display NOT 'z' IN ('x','y')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting sql style NOT IN would be nice (though no changes needed in this PR)

Copy link
Contributor

@Dandandan Dandandan Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead 👍

Copy link
Contributor

@alamb alamb Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between c NOT IN (...) and NOT c IN (...) FWIW that might require representing them differently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return NULL.

SELECT NOT NULL IN ('a');
SELECT NULL NOT IN ('a');
SELECT NOT 'a' IN (NULL);
SELECT 'a' NOT IN (NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks @seddonm1 for checking . sounds good to me

.iter()
.any(|dt| *dt != value_data_type)
{
return Err(DataFusionError::Internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this earlier already when creating/checking the logical plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.

@Dandandan
Copy link
Contributor

I think this is a great start @seddonm1 !

Just some thoughts: an "ideal" implementation would convert the items upfront to an Vec<i64>, Vec<String>, etc on creation, so less work would be needed when executing the plan. Also, at some threshold for bigger arrays we could create a HashSet<i64>, HashSet<String> etc instead and write a kernel using a this HashSet.

@seddonm1
Copy link
Contributor Author

Thanks @Dandandan

Do you think we should re-evaluate the current behavior of the kernels::comparison::contains based on my comments 1 and 2?

Creating the same array n times feels very inefficient if we can modify the kernel.

@Dandandan
Copy link
Contributor

  1. Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.
  2. I believe the current behavior related to NULL is correct, just as field=NULL will also be false.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Dec 30, 2020

  1. Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.

Agree. I can have a look as part of this PR.

  1. I believe the current behavior related to NULL is correct, just as field=NULL will also be false.

I have tested in Postgres and it will return NULL if value is NULL unlike this implementation which returns false.

@seddonm1
Copy link
Contributor Author

I have updated this PR with a reimplementation of the logic so that the kernel which has two undesired behaviour (see points 1 and 2) is no longer invoked. It should also support the full range of types as well.

@alamb
Copy link
Contributor

alamb commented Dec 31, 2020

The full set of Rust CI tests did not run on this PR :(

Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do?

I apologize for the inconvenience.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 1, 2021

@Dandandan @alamb rebased and added some tests.

@codecov-io
Copy link

codecov-io commented Jan 1, 2021

Codecov Report

Merging #9038 (96cd76d) into master (98159f1) will decrease coverage by 0.03%.
The diff coverage is 77.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9038      +/-   ##
==========================================
- Coverage   82.60%   82.57%   -0.04%     
==========================================
  Files         204      204              
  Lines       50496    50879     +383     
==========================================
+ Hits        41713    42013     +300     
- Misses       8783     8866      +83     
Impacted Files Coverage Δ
rust/arrow-pyarrow-integration-testing/src/lib.rs 0.00% <ø> (ø)
rust/benchmarks/src/bin/tpch.rs 7.02% <ø> (ø)
...datafusion/src/physical_plan/string_expressions.rs 87.50% <ø> (ø)
rust/datafusion/src/optimizer/utils.rs 58.18% <45.45%> (-0.54%) ⬇️
rust/datafusion/src/sql/utils.rs 53.92% <47.05%> (-0.68%) ⬇️
rust/datafusion/src/physical_plan/functions.rs 78.51% <63.63%> (-1.49%) ⬇️
rust/arrow/src/compute/kernels/sort.rs 93.56% <68.62%> (+0.14%) ⬆️
rust/datafusion/src/logical_plan/expr.rs 76.92% <77.27%> (+0.02%) ⬆️
rust/datafusion/src/physical_plan/expressions.rs 83.77% <78.53%> (-0.71%) ⬇️
rust/arrow/src/ffi.rs 75.95% <80.00%> (+0.28%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0050795...96cd76d. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like great work -- thanks @seddonm1 !

I think starting with basic functionality and then making it faster / more full featured is a great idea.

For this PR in particular, I think the minimum required work would be:

  1. Tests for the other data types and null handling (even if the null handling doesn't strictly follow SQL)

Bonus points for the following (and they would be fine to file as follow on PRs):

  1. Type coercion / checking for the types of the in list during planning time
  2. ANSI null handling semantics
  3. More optimized runtime implementation (e.g. optimizing the comparisons / make a hashset, etc).

All in all, this is a great start. Thank you @seddonm1

/// The value to compare
expr: Box<Expr>,
/// The low end of the range
list: Vec<Expr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).

You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118

And coerce the in list items all to the same types

.iter()
.any(|dt| *dt != value_data_type)
{
return Err(DataFusionError::Internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.

datatype => unimplemented!("Unexpected type {} for InList", datatype),
},
ColumnarValue::Array(_) => {
unimplemented!("InList should not receive Array")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably generate an error earlier in planing too (e.g. if you see an expression like my_col IN (my_other_col, 'foo') )

} => {
let list_expr = list
.iter()
.map(|e| self.sql_expr_to_logical_expr(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where I think you could add the type coercion / checking logic

@@ -1849,3 +1849,45 @@ async fn string_expressions() -> Result<()> {
assert_eq!(expected, actual);
Ok(())
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is also support in this PR for numeric types, I would also suggest some basic tests for IN lists with numbers as well (e.g. c1 IN (1, 2 3) as well as c1 IN (1, NULL)

Ok(ColumnarValue::Array(Arc::new(
array
.iter()
.map(|x| x.map(|x| values.contains(&&x)))
Copy link
Contributor

@alamb alamb Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this handles NULL correctly -- like for a value of where expr is NULL the output should be NULL (not true/false). The semantics when there is a literal NULL in the inlist are even stranger (but likely could be handled as a follow on PR)

For example:

sqlite> create table t(c1 int);
sqlite> insert into t values (10);
sqlite> insert into t values (20);
sqlite> insert into t values(NULL);
sqlite> select c1, c1 IN (20, NULL) from t;
10|
20|1
|
sqlite> select c1, c1 IN (20) from t;
10|0
20|1
|

Note that 10 IN (20, NULL) is actually NULL rather than FALSE. Crazy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of mapping the array was suggested by @jorgecarleitao when helping me with the StringExpressions: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/string_expressions.rs#L84. The benefit is that if the input value is NULL (i.e. None) then we don't have to do any work on it (the second map).

I have confirmed this is the desired behavior against Postgres 13.1 so that any NULL input expr should return null:

SELECT NULL IN ('a'); -> NULL
SELECT NULL NOT IN ('a'); -> NULL
SELECT NULL IN (NULL, 'a'); -> NULL
SELECT NULL NOT IN (NULL, 'a'); -> NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to the second problem, NULL in the list component it gets even crazier than your example (Postgres 13.1). What a mess.

SELECT 'a' IN (NULL); -> NULL
SELECT 'a' IN (NULL, 'a'); -> TRUE
SELECT 'a' IN (NULL, 'b'); -> NULL

@alamb
Copy link
Contributor

alamb commented Jan 1, 2021

Note that the clippy error https://github.com/apache/arrow/pull/9038/checks?check_run_id=1632226138 has been fixed on master so if you rebase this PR against master that CI check should pass

@alamb
Copy link
Contributor

alamb commented Jan 1, 2021

kernels::comparison::contains returns false if either of the comparison values is null. Is this the desired behavior? If not I can modify the kernel to return null instead.

FWIW I think the contains kernel should return NULL in these cases

@jhorstmann
Copy link
Contributor

An alternative implementation would be to translate x IN ('ABC', 'DEF', 'GHI') into (x = 'ABC') OR (x = 'DEF') OR (x = 'GHI') on the logical plan level. That should handle nulls correctly and for a short set of values it should also have good performance since there are special comparison kernels for comparing against literals. For a larger set of values the contains might be more performant because of the possibility for returning early.

@Dandandan
Copy link
Contributor

@jhorstmann nice idea! Maybe it would be better to do that in an optimization rule?

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 1, 2021

@jhorstmann @Dandandan I like the simplicity of this idea but there are a lot of strange cases that need to be considered given how ANSI SQL handles NULL values.

@jhorstmann
Copy link
Contributor

Semantics for the IN and OR versions should be the same in SQL, for example try the following query in postgres:

SELECT 'abc' in ('abc', 'def')
     , 'abc' in ('abc', 'def', null)
     , 'abc' in ('def', null)
     , null in ('abc', 'def')
     -- same expressions rewritten using OR
     , (('abc' = 'abc') OR ('abc' = 'def'))
     , (('abc' = 'abc') OR ('abc' = 'def') or ('abc' = null))
     , (('abc' = 'def') or ('abc' = null))
     , ((null = 'abc') OR (null = 'def'))
  1. result is true, no nulls involved
  2. result is true, additional null on the rhs does not change this since (true OR null) = true
  3. result is null since (false OR null) IS NULL
  4. result is null since lhs of each comparison is null

But you are right, it's not that simple since the arrow or / and kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 1, 2021

But you are right, it's not that simple since the arrow or / and kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.

@jhorstmann I like this approach the best. We temporarily shelve this PR (and I can do more work on the early validation) whilst these kernels are implemented then invoke them like your idea.

@alamb
Copy link
Contributor

alamb commented Jan 2, 2021

FWIW I think having a native col IN (constant list) is a useful thing to have (eventually). On other words I do like the direction of this PR

The rationale for having col IN (list) type predicates is that they make recognizing the special case of "single column predicates" more easily (e.g. so you can split them off and push them down the plans into scans)

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 4, 2021

Ok, I have done a major refactor against a rebased master.

I believe this now meets the ANSI behavior with regard to NULL handling but it does not yet support syntax where columns are referenced in the list parameter like:

SELECT TRUE IN (col1, col2, FALSE)

This has been implemented with a "make it work then make it fast" approach as this InList expression actually has a lot of complexity.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code, and the tests. The tests are 👍 and I think this is a great initial implementation on which to build. Thank you @seddonm1

}

#[tokio::test]
async fn in_list_scalar() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

.project(vec![col("c1").in_list(list, false)])?
.build()?;
let execution_plan = plan(&logical_plan)?;
// verify that the plan correctly adds cast from Int64(1) to Utf8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -305,6 +312,7 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> {
low.as_ref().to_owned(),
high.as_ref().to_owned(),
]),
Expr::InList { expr, .. } => Ok(vec![expr.as_ref().to_owned()]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also include the exprs in list as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated the PR with this.

@@ -416,6 +424,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr>
Ok(expr)
}
}
Expr::InList { .. } => Ok(expr.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here, I think we might want to include the list -- even though at the moment it only contains constants, it is a Vec<Expr>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this is just cloning the while InList expression (not the expr in InList) as the optimiser is not doing anything for this Expression yet.

let list = vec![
lit(ScalarValue::Float64(Some(0.0))),
lit(ScalarValue::Float64(Some(0.1))),
lit(ScalarValue::Utf8(None)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts, but given the coercion logic you added in the planner, I think the literals at this point should all be the same type as the expr value. In other words, can you really see a NOT IN (NULL::Utf8)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The literal Expr::Literal(ScalarValue::Utf8(None)) is a special case in DataFusion at the moment which represents the SQL logical NULL. It is being passed through to the evaluation as it is required to identify whether the list contains any literal NULL so that we can override the return value with NULL. I think this could be optimised in future.

@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Jan 6, 2021

@seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :)
my only concern is that the current implementation doesn't appear to make use of SIMD; have you looked into comparing performance against an expression of the form (eq(array, scalar1) or eq(array, scalar2) or ... or eq(array, scalarN))?

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 6, 2021

@alamb thanks for taking the time to review this as I know it ended up as quite a large PR 👍 . I have updated based on your comment.

@yordan-pavlov yes this is basically as naive implementation as possible and could be heavily optimised. I think we should merge this PR to unblock TPC-H Query 12: l_shipmode in ('MAIL', 'SHIP') then look at optimisation. The test cases should help with any future optimisation work anyway.

@alamb
Copy link
Contributor

alamb commented Jan 6, 2021

I'll plan to merge this in as soon as the CI passes

@@ -656,7 +656,7 @@ fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<Logic
on
l_orderkey = o_orderkey
where
(l_shipmode = 'MAIL' or l_shipmode = 'SHIP')
l_shipmode in ('MAIL', 'SHIP')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jan 7, 2021

@seddonm1 sadly this PR has some small conflicts -- can you please rebase it so I can merge it in? Thanks again for all your work to get this done.

@alamb alamb changed the title ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP) ARROW-10356: [Rust][DataFusion] Add support for is_in Jan 7, 2021
@alamb
Copy link
Contributor

alamb commented Jan 7, 2021

I also changed the title of this PR so that it doesn't say "WIP" anymore -- as I don't think it is WIP (I hope not, given that I plan to merge it!)

@seddonm1
Copy link
Contributor Author

seddonm1 commented Jan 7, 2021

Thanks @alamb . Yes the WIP was a leftover :D I have rebased so once the CI passes it should merge!

@alamb
Copy link
Contributor

alamb commented Jan 8, 2021

I filed https://issues.apache.org/jira/browse/ARROW-11182 to track possible improvements to performance

@alamb alamb closed this in c4ee536 Jan 8, 2021
@alamb
Copy link
Contributor

alamb commented Jan 8, 2021

Thanks again @seddonm1 .

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR is a work-in-progress simple implementation of `InList` (`'ABC' IN ('ABC', 'DEF')`) which currently only operates on strings.

It uses the `kernels::comparison::contains` implementation but there are a few issues I am struggling with:

1. `kernels::comparison::contains` allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these `ListArray`s?

2. `kernels::comparison::contains` returns `false` if either of the comparison values is `null`. Is this the desired behavior? If not I can modify the kernel to return null instead.

3. If the basic implementation looks correct I can add the rest of the data types (via macros).

Closes apache#9038 from seddonm1/in-list

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants