-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10356: [Rust][DataFusion] Add support for is_in #9038
Conversation
/// The value to compare | ||
expr: Box<Expr>, | ||
/// The low end of the range | ||
list: Vec<Expr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).
You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118
And coerce the in list items all to the same types
/// The low end of the range | ||
list: Vec<Expr>, | ||
/// Whether the expression is negated | ||
negated: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might keep negated out and use not
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on not(..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly included negated
to allow pretty printing like: 'z' NOT IN ('x','y')
. I have changed this so it now uses the not
expr
so will now display NOT 'z' IN ('x','y')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think supporting sql style NOT IN
would be nice (though no changes needed in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between c NOT IN (...)
and NOT c IN (...)
FWIW that might require representing them differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return NULL
.
SELECT NOT NULL IN ('a');
SELECT NULL NOT IN ('a');
SELECT NOT 'a' IN (NULL);
SELECT 'a' NOT IN (NULL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks @seddonm1 for checking . sounds good to me
.iter() | ||
.any(|dt| *dt != value_data_type) | ||
{ | ||
return Err(DataFusionError::Internal(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this earlier already when creating/checking the logical plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.
I think this is a great start @seddonm1 ! Just some thoughts: an "ideal" implementation would convert the items upfront to an |
Thanks @Dandandan Do you think we should re-evaluate the current behavior of the Creating the same array |
|
Agree. I can have a look as part of this PR.
I have tested in Postgres and it will return |
I have updated this PR with a reimplementation of the logic so that the kernel which has two undesired behaviour (see points 1 and 2) is no longer invoked. It should also support the full range of types as well. |
The full set of Rust CI tests did not run on this PR :( Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do? I apologize for the inconvenience. |
@Dandandan @alamb rebased and added some tests. |
Codecov Report
@@ Coverage Diff @@
## master #9038 +/- ##
==========================================
- Coverage 82.60% 82.57% -0.04%
==========================================
Files 204 204
Lines 50496 50879 +383
==========================================
+ Hits 41713 42013 +300
- Misses 8783 8866 +83
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like great work -- thanks @seddonm1 !
I think starting with basic functionality and then making it faster / more full featured is a great idea.
For this PR in particular, I think the minimum required work would be:
- Tests for the other data types and null handling (even if the null handling doesn't strictly follow SQL)
Bonus points for the following (and they would be fine to file as follow on PRs):
- Type coercion / checking for the types of the in list during planning time
- ANSI null handling semantics
- More optimized runtime implementation (e.g. optimizing the comparisons / make a hashset, etc).
All in all, this is a great start. Thank you @seddonm1
/// The value to compare | ||
expr: Box<Expr>, | ||
/// The low end of the range | ||
list: Vec<Expr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).
You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118
And coerce the in list items all to the same types
.iter() | ||
.any(|dt| *dt != value_data_type) | ||
{ | ||
return Err(DataFusionError::Internal(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.
datatype => unimplemented!("Unexpected type {} for InList", datatype), | ||
}, | ||
ColumnarValue::Array(_) => { | ||
unimplemented!("InList should not receive Array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably generate an error earlier in planing too (e.g. if you see an expression like my_col IN (my_other_col, 'foo')
)
} => { | ||
let list_expr = list | ||
.iter() | ||
.map(|e| self.sql_expr_to_logical_expr(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where I think you could add the type coercion / checking logic
@@ -1849,3 +1849,45 @@ async fn string_expressions() -> Result<()> { | |||
assert_eq!(expected, actual); | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is also support in this PR for numeric types, I would also suggest some basic tests for IN lists with numbers as well (e.g. c1 IN (1, 2 3)
as well as c1 IN (1, NULL)
Ok(ColumnarValue::Array(Arc::new( | ||
array | ||
.iter() | ||
.map(|x| x.map(|x| values.contains(&&x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this handles NULL
correctly -- like for a value of where expr
is NULL the output should be NULL (not true/false). The semantics when there is a literal NULL
in the inlist are even stranger (but likely could be handled as a follow on PR)
For example:
sqlite> create table t(c1 int);
sqlite> insert into t values (10);
sqlite> insert into t values (20);
sqlite> insert into t values(NULL);
sqlite> select c1, c1 IN (20, NULL) from t;
10|
20|1
|
sqlite> select c1, c1 IN (20) from t;
10|0
20|1
|
Note that 10 IN (20, NULL)
is actually NULL
rather than FALSE
. Crazy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of mapping the array was suggested by @jorgecarleitao when helping me with the StringExpressions: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/string_expressions.rs#L84. The benefit is that if the input value is NULL
(i.e. None
) then we don't have to do any work on it (the second map
).
I have confirmed this is the desired behavior against Postgres 13.1 so that any NULL
input expr
should return null:
SELECT NULL IN ('a'); -> NULL
SELECT NULL NOT IN ('a'); -> NULL
SELECT NULL IN (NULL, 'a'); -> NULL
SELECT NULL NOT IN (NULL, 'a'); -> NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to the second problem, NULL
in the list
component it gets even crazier than your example (Postgres 13.1). What a mess.
SELECT 'a' IN (NULL); -> NULL
SELECT 'a' IN (NULL, 'a'); -> TRUE
SELECT 'a' IN (NULL, 'b'); -> NULL
Note that the clippy error https://github.com/apache/arrow/pull/9038/checks?check_run_id=1632226138 has been fixed on master so if you rebase this PR against master that CI check should pass |
FWIW I think the |
An alternative implementation would be to translate |
@jhorstmann nice idea! Maybe it would be better to do that in an optimization rule? |
@jhorstmann @Dandandan I like the simplicity of this idea but there are a lot of strange cases that need to be considered given how ANSI SQL handles NULL values. |
Semantics for the
But you are right, it's not that simple since the arrow |
@jhorstmann I like this approach the best. We temporarily shelve this PR (and I can do more work on the early validation) whilst these kernels are implemented then invoke them like your idea. |
FWIW I think having a native The rationale for having |
Ok, I have done a major refactor against a rebased master. I believe this now meets the ANSI behavior with regard to SELECT TRUE IN (col1, col2, FALSE) This has been implemented with a "make it work then make it fast" approach as this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code, and the tests. The tests are 👍 and I think this is a great initial implementation on which to build. Thank you @seddonm1
} | ||
|
||
#[tokio::test] | ||
async fn in_list_scalar() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
.project(vec![col("c1").in_list(list, false)])? | ||
.build()?; | ||
let execution_plan = plan(&logical_plan)?; | ||
// verify that the plan correctly adds cast from Int64(1) to Utf8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -305,6 +312,7 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> { | |||
low.as_ref().to_owned(), | |||
high.as_ref().to_owned(), | |||
]), | |||
Expr::InList { expr, .. } => Ok(vec![expr.as_ref().to_owned()]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also include the exprs in list
as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated the PR with this.
@@ -416,6 +424,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr> | |||
Ok(expr) | |||
} | |||
} | |||
Expr::InList { .. } => Ok(expr.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here, I think we might want to include the list
-- even though at the moment it only contains constants, it is a Vec<Expr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here this is just cloning the while InList
expression (not the expr
in InList
) as the optimiser is not doing anything for this Expression yet.
let list = vec![ | ||
lit(ScalarValue::Float64(Some(0.0))), | ||
lit(ScalarValue::Float64(Some(0.1))), | ||
lit(ScalarValue::Utf8(None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it hurts, but given the coercion logic you added in the planner, I think the literals at this point should all be the same type as the expr value. In other words, can you really see a NOT IN (NULL::Utf8)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The literal Expr::Literal(ScalarValue::Utf8(None))
is a special case in DataFusion at the moment which represents the SQL logical NULL
. It is being passed through to the evaluation as it is required to identify whether the list contains any literal NULL
so that we can override the return value with NULL
. I think this could be optimised in future.
@seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :) |
@alamb thanks for taking the time to review this as I know it ended up as quite a large PR 👍 . I have updated based on your comment. @yordan-pavlov yes this is basically as naive implementation as possible and could be heavily optimised. I think we should merge this PR to unblock TPC-H Query 12: |
I'll plan to merge this in as soon as the CI passes |
@@ -656,7 +656,7 @@ fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<Logic | |||
on | |||
l_orderkey = o_orderkey | |||
where | |||
(l_shipmode = 'MAIL' or l_shipmode = 'SHIP') | |||
l_shipmode in ('MAIL', 'SHIP') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@seddonm1 sadly this PR has some small conflicts -- can you please rebase it so I can merge it in? Thanks again for all your work to get this done. |
I also changed the title of this PR so that it doesn't say "WIP" anymore -- as I don't think it is WIP (I hope not, given that I plan to merge it!) |
Thanks @alamb . Yes the WIP was a leftover :D I have rebased so once the CI passes it should merge! |
I filed https://issues.apache.org/jira/browse/ARROW-11182 to track possible improvements to performance |
Thanks again @seddonm1 . |
This PR is a work-in-progress simple implementation of `InList` (`'ABC' IN ('ABC', 'DEF')`) which currently only operates on strings. It uses the `kernels::comparison::contains` implementation but there are a few issues I am struggling with: 1. `kernels::comparison::contains` allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these `ListArray`s? 2. `kernels::comparison::contains` returns `false` if either of the comparison values is `null`. Is this the desired behavior? If not I can modify the kernel to return null instead. 3. If the basic implementation looks correct I can add the rest of the data types (via macros). Closes apache#9038 from seddonm1/in-list Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR is a work-in-progress simple implementation of
InList
('ABC' IN ('ABC', 'DEF')
) which currently only operates on strings.It uses the
kernels::comparison::contains
implementation but there are a few issues I am struggling with:kernels::comparison::contains
allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create theseListArray
s?kernels::comparison::contains
returnsfalse
if either of the comparison values isnull
. Is this the desired behavior? If not I can modify the kernel to return null instead.If the basic implementation looks correct I can add the rest of the data types (via macros).