ARROW-10356: [Rust][DataFusion] Add support for is_in #9038

seddonm1 · 2020-12-28T23:30:40Z

This PR is a work-in-progress simple implementation of InList ('ABC' IN ('ABC', 'DEF')) which currently only operates on strings.

It uses the kernels::comparison::contains implementation but there are a few issues I am struggling with:

kernels::comparison::contains allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these ListArrays?
kernels::comparison::contains returns false if either of the comparison values is null. Is this the desired behavior? If not I can modify the kernel to return null instead.
If the basic implementation looks correct I can add the rest of the data types (via macros).

github-actions · 2020-12-28T23:44:02Z

https://issues.apache.org/jira/browse/ARROW-10356

Dandandan · 2020-12-29T02:57:10Z

rust/datafusion/src/logical_plan/expr.rs

+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,


I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it?

I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650

Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed.

I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).

You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118

And coerce the in list items all to the same types

Dandandan · 2020-12-29T03:04:31Z

rust/datafusion/src/logical_plan/expr.rs

+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,


We might keep negated out and use not instead?

This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on not(..)

I mainly included negated to allow pretty printing like: 'z' NOT IN ('x','y'). I have changed this so it now uses the not expr so will now display NOT 'z' IN ('x','y')

I think supporting sql style NOT IN would be nice (though no changes needed in this PR)

Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead 👍

I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between c NOT IN (...) and NOT c IN (...) FWIW that might require representing them differently

hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this

Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return NULL.

SELECT NOT NULL IN ('a'); SELECT NULL NOT IN ('a'); SELECT NOT 'a' IN (NULL); SELECT 'a' NOT IN (NULL);

Ok, thanks @seddonm1 for checking . sounds good to me

Dandandan · 2020-12-29T03:07:06Z

rust/datafusion/src/physical_plan/expressions.rs

+            .iter()
+            .any(|dt| *dt != value_data_type)
+        {
+            return Err(DataFusionError::Internal(format!(


I think we should do this earlier already when creating/checking the logical plan.

I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.

Dandandan · 2020-12-29T03:17:44Z

I think this is a great start @seddonm1 !

Just some thoughts: an "ideal" implementation would convert the items upfront to an Vec<i64>, Vec<String>, etc on creation, so less work would be needed when executing the plan. Also, at some threshold for bigger arrays we could create a HashSet<i64>, HashSet<String> etc instead and write a kernel using a this HashSet.

seddonm1 · 2020-12-29T23:47:25Z

Thanks @Dandandan

Do you think we should re-evaluate the current behavior of the kernels::comparison::contains based on my comments 1 and 2?

Creating the same array n times feels very inefficient if we can modify the kernel.

Dandandan · 2020-12-30T09:04:06Z

Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.
I believe the current behavior related to NULL is correct, just as field=NULL will also be false.

seddonm1 · 2020-12-30T22:26:03Z

Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.

Agree. I can have a look as part of this PR.

I believe the current behavior related to NULL is correct, just as field=NULL will also be false.

I have tested in Postgres and it will return NULL if value is NULL unlike this implementation which returns false.

seddonm1 · 2020-12-31T01:07:47Z

I have updated this PR with a reimplementation of the logic so that the kernel which has two undesired behaviour (see points 1 and 2) is no longer invoked. It should also support the full range of types as well.

alamb · 2020-12-31T13:18:03Z

The full set of Rust CI tests did not run on this PR :(

Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do?

I apologize for the inconvenience.

seddonm1 · 2021-01-01T01:54:36Z

@Dandandan @alamb rebased and added some tests.

codecov-io · 2021-01-01T02:14:15Z

Codecov Report

Merging #9038 (96cd76d) into master (98159f1) will decrease coverage by 0.03%.
The diff coverage is 77.75%.

@@            Coverage Diff             @@
##           master    #9038      +/-   ##
==========================================
- Coverage   82.60%   82.57%   -0.04%     
==========================================
  Files         204      204              
  Lines       50496    50879     +383     
==========================================
+ Hits        41713    42013     +300     
- Misses       8783     8866      +83

Impacted Files	Coverage Δ
rust/arrow-pyarrow-integration-testing/src/lib.rs	`0.00% <ø> (ø)`
rust/benchmarks/src/bin/tpch.rs	`7.02% <ø> (ø)`
...datafusion/src/physical_plan/string_expressions.rs	`87.50% <ø> (ø)`
rust/datafusion/src/optimizer/utils.rs	`58.18% <45.45%> (-0.54%)`	⬇️
rust/datafusion/src/sql/utils.rs	`53.92% <47.05%> (-0.68%)`	⬇️
rust/datafusion/src/physical_plan/functions.rs	`78.51% <63.63%> (-1.49%)`	⬇️
rust/arrow/src/compute/kernels/sort.rs	`93.56% <68.62%> (+0.14%)`	⬆️
rust/datafusion/src/logical_plan/expr.rs	`76.92% <77.27%> (+0.02%)`	⬆️
rust/datafusion/src/physical_plan/expressions.rs	`83.77% <78.53%> (-0.71%)`	⬇️
rust/arrow/src/ffi.rs	`75.95% <80.00%> (+0.28%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0050795...96cd76d. Read the comment docs.

alamb

This looks like great work -- thanks @seddonm1 !

I think starting with basic functionality and then making it faster / more full featured is a great idea.

For this PR in particular, I think the minimum required work would be:

Tests for the other data types and null handling (even if the null handling doesn't strictly follow SQL)

Bonus points for the following (and they would be fine to file as follow on PRs):

Type coercion / checking for the types of the in list during planning time
ANSI null handling semantics
More optimized runtime implementation (e.g. optimizing the comparisons / make a hashset, etc).

All in all, this is a great start. Thank you @seddonm1

alamb · 2021-01-01T11:39:05Z

rust/datafusion/src/logical_plan/expr.rs

+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,


I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).

You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118

And coerce the in list items all to the same types

alamb · 2021-01-01T11:41:45Z

rust/datafusion/src/physical_plan/expressions.rs

+            .iter()
+            .any(|dt| *dt != value_data_type)
+        {
+            return Err(DataFusionError::Internal(format!(


I suggest that the appropriate place would be to "coerce" all the in list item types to the same data type during Logical --> Physical plan creation.

alamb · 2021-01-01T11:43:07Z

rust/datafusion/src/physical_plan/expressions.rs

+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")


this should probably generate an error earlier in planing too (e.g. if you see an expression like my_col IN (my_other_col, 'foo') )

alamb · 2021-01-01T11:43:52Z

rust/datafusion/src/sql/planner.rs

+            } => {
+                let list_expr = list
+                    .iter()
+                    .map(|e| self.sql_expr_to_logical_expr(e))


Here is where I think you could add the type coercion / checking logic

alamb · 2021-01-01T11:46:16Z

rust/datafusion/tests/sql.rs

@@ -1849,3 +1849,45 @@ async fn string_expressions() -> Result<()> {
    assert_eq!(expected, actual);
    Ok(())
 }
+
+#[tokio::test]


Since there is also support in this PR for numeric types, I would also suggest some basic tests for IN lists with numbers as well (e.g. c1 IN (1, 2 3) as well as c1 IN (1, NULL)

alamb · 2021-01-01T11:56:03Z

rust/datafusion/src/physical_plan/expressions.rs

+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))


I wonder if this handles NULL correctly -- like for a value of where expr is NULL the output should be NULL (not true/false). The semantics when there is a literal NULL in the inlist are even stranger (but likely could be handled as a follow on PR)

For example:

sqlite> create table t(c1 int); sqlite> insert into t values (10); sqlite> insert into t values (20); sqlite> insert into t values(NULL); sqlite> select c1, c1 IN (20, NULL) from t; 10| 20|1 | sqlite> select c1, c1 IN (20) from t; 10|0 20|1 |

Note that 10 IN (20, NULL) is actually NULL rather than FALSE. Crazy

This approach of mapping the array was suggested by @jorgecarleitao when helping me with the StringExpressions: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/string_expressions.rs#L84. The benefit is that if the input value is NULL (i.e. None) then we don't have to do any work on it (the second map).

I have confirmed this is the desired behavior against Postgres 13.1 so that any NULL input expr should return null:

SELECT NULL IN ('a'); -> NULL SELECT NULL NOT IN ('a'); -> NULL SELECT NULL IN (NULL, 'a'); -> NULL SELECT NULL NOT IN (NULL, 'a'); -> NULL

As to the second problem, NULL in the list component it gets even crazier than your example (Postgres 13.1). What a mess.

SELECT 'a' IN (NULL); -> NULL SELECT 'a' IN (NULL, 'a'); -> TRUE SELECT 'a' IN (NULL, 'b'); -> NULL

alamb · 2021-01-01T12:02:06Z

Note that the clippy error https://github.com/apache/arrow/pull/9038/checks?check_run_id=1632226138 has been fixed on master so if you rebase this PR against master that CI check should pass

alamb · 2021-01-01T12:17:56Z

kernels::comparison::contains returns false if either of the comparison values is null. Is this the desired behavior? If not I can modify the kernel to return null instead.

FWIW I think the contains kernel should return NULL in these cases

jhorstmann · 2021-01-01T12:30:35Z

An alternative implementation would be to translate x IN ('ABC', 'DEF', 'GHI') into (x = 'ABC') OR (x = 'DEF') OR (x = 'GHI') on the logical plan level. That should handle nulls correctly and for a short set of values it should also have good performance since there are special comparison kernels for comparing against literals. For a larger set of values the contains might be more performant because of the possibility for returning early.

Dandandan · 2021-01-01T17:25:35Z

@jhorstmann nice idea! Maybe it would be better to do that in an optimization rule?

seddonm1 · 2021-01-01T22:11:04Z

@jhorstmann @Dandandan I like the simplicity of this idea but there are a lot of strange cases that need to be considered given how ANSI SQL handles NULL values.

jhorstmann · 2021-01-01T22:33:50Z

Semantics for the IN and OR versions should be the same in SQL, for example try the following query in postgres:

SELECT 'abc' in ('abc', 'def')
     , 'abc' in ('abc', 'def', null)
     , 'abc' in ('def', null)
     , null in ('abc', 'def')
     -- same expressions rewritten using OR
     , (('abc' = 'abc') OR ('abc' = 'def'))
     , (('abc' = 'abc') OR ('abc' = 'def') or ('abc' = null))
     , (('abc' = 'def') or ('abc' = null))
     , ((null = 'abc') OR (null = 'def'))

result is true, no nulls involved
result is true, additional null on the rhs does not change this since (true OR null) = true
result is null since (false OR null) IS NULL
result is null since lhs of each comparison is null

But you are right, it's not that simple since the arrow or / and kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.

seddonm1 · 2021-01-01T22:44:43Z

But you are right, it's not that simple since the arrow or / and kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.

@jhorstmann I like this approach the best. We temporarily shelve this PR (and I can do more work on the early validation) whilst these kernels are implemented then invoke them like your idea.

alamb · 2021-01-02T12:35:18Z

FWIW I think having a native col IN (constant list) is a useful thing to have (eventually). On other words I do like the direction of this PR

The rationale for having col IN (list) type predicates is that they make recognizing the special case of "single column predicates" more easily (e.g. so you can split them off and push them down the plans into scans)

seddonm1 · 2021-01-04T21:16:02Z

Ok, I have done a major refactor against a rebased master.

I believe this now meets the ANSI behavior with regard to NULL handling but it does not yet support syntax where columns are referenced in the list parameter like:

SELECT TRUE IN (col1, col2, FALSE)

This has been implemented with a "make it work then make it fast" approach as this InList expression actually has a lot of complexity.

alamb

I reviewed the code, and the tests. The tests are 👍 and I think this is a great initial implementation on which to build. Thank you @seddonm1

alamb · 2021-01-05T21:04:25Z

rust/datafusion/tests/sql.rs

+}
+
+#[tokio::test]
+async fn in_list_scalar() -> Result<()> {


alamb · 2021-01-05T21:05:20Z

rust/datafusion/src/physical_plan/planner.rs

+            .project(vec![col("c1").in_list(list, false)])?
+            .build()?;
+        let execution_plan = plan(&logical_plan)?;
+        // verify that the plan correctly adds cast from Int64(1) to Utf8


alamb · 2021-01-05T21:08:20Z

rust/datafusion/src/optimizer/utils.rs

@@ -305,6 +312,7 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> {
            low.as_ref().to_owned(),
            high.as_ref().to_owned(),
        ]),
+        Expr::InList { expr, .. } => Ok(vec![expr.as_ref().to_owned()]),


Shouldn't this also include the exprs in list as well ?

Thanks. I have updated the PR with this.

alamb · 2021-01-05T21:09:30Z

rust/datafusion/src/optimizer/utils.rs

@@ -416,6 +424,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr>
                Ok(expr)
            }
        }
+        Expr::InList { .. } => Ok(expr.clone()),


likewise here, I think we might want to include the list -- even though at the moment it only contains constants, it is a Vec<Expr>

Here this is just cloning the while InList expression (not the expr in InList) as the optimiser is not doing anything for this Expression yet.

alamb · 2021-01-05T21:15:00Z

rust/datafusion/src/physical_plan/expressions.rs

+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+            lit(ScalarValue::Utf8(None)),


I don't think it hurts, but given the coercion logic you added in the planner, I think the literals at this point should all be the same type as the expr value. In other words, can you really see a NOT IN (NULL::Utf8)?

The literal Expr::Literal(ScalarValue::Utf8(None)) is a special case in DataFusion at the moment which represents the SQL logical NULL. It is being passed through to the evaluation as it is required to identify whether the list contains any literal NULL so that we can override the return value with NULL. I think this could be optimised in future.

yordan-pavlov · 2021-01-06T14:04:55Z

@seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :)
my only concern is that the current implementation doesn't appear to make use of SIMD; have you looked into comparing performance against an expression of the form (eq(array, scalar1) or eq(array, scalar2) or ... or eq(array, scalarN))?

seddonm1 · 2021-01-06T20:43:12Z

@alamb thanks for taking the time to review this as I know it ended up as quite a large PR 👍 . I have updated based on your comment.

@yordan-pavlov yes this is basically as naive implementation as possible and could be heavily optimised. I think we should merge this PR to unblock TPC-H Query 12: l_shipmode in ('MAIL', 'SHIP') then look at optimisation. The test cases should help with any future optimisation work anyway.

alamb · 2021-01-06T23:31:59Z

I'll plan to merge this in as soon as the CI passes

alamb · 2021-01-06T23:32:06Z

rust/benchmarks/src/bin/tpch.rs

@@ -656,7 +656,7 @@ fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<Logic
            on
                l_orderkey = o_orderkey
            where
-                (l_shipmode = 'MAIL' or l_shipmode = 'SHIP')
+                l_shipmode in ('MAIL', 'SHIP')


alamb · 2021-01-07T16:53:49Z

@seddonm1 sadly this PR has some small conflicts -- can you please rebase it so I can merge it in? Thanks again for all your work to get this done.

alamb · 2021-01-07T17:04:45Z

I also changed the title of this PR so that it doesn't say "WIP" anymore -- as I don't think it is WIP (I hope not, given that I plan to merge it!)

seddonm1 · 2021-01-07T20:31:15Z

Thanks @alamb . Yes the WIP was a leftover :D I have rebased so once the CI passes it should merge!

alamb · 2021-01-08T11:49:42Z

I filed https://issues.apache.org/jira/browse/ARROW-11182 to track possible improvements to performance

alamb · 2021-01-08T11:52:24Z

Thanks again @seddonm1 .

This PR is a work-in-progress simple implementation of `InList` (`'ABC' IN ('ABC', 'DEF')`) which currently only operates on strings. It uses the `kernels::comparison::contains` implementation but there are a few issues I am struggling with: 1. `kernels::comparison::contains` allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these `ListArray`s? 2. `kernels::comparison::contains` returns `false` if either of the comparison values is `null`. Is this the desired behavior? If not I can modify the kernel to return null instead. 3. If the basic implementation looks correct I can add the rest of the data types (via macros). Closes apache#9038 from seddonm1/in-list Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 28, 2020

Dandandan reviewed Dec 29, 2020

View reviewed changes

seddonm1 force-pushed the in-list branch from 1ca78e3 to ae61b57 Compare January 1, 2021 01:53

alamb reviewed Jan 1, 2021

View reviewed changes

seddonm1 force-pushed the in-list branch from ae61b57 to 119c021 Compare January 1, 2021 21:41

seddonm1 force-pushed the in-list branch from 119c021 to 7a44a99 Compare January 4, 2021 21:10

alamb approved these changes Jan 5, 2021

View reviewed changes

alamb reviewed Jan 6, 2021

View reviewed changes

alamb changed the title ~~ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)~~ ARROW-10356: [Rust][DataFusion] Add support for is_in Jan 7, 2021

in_list

96cd76d

seddonm1 force-pushed the in-list branch from 9e8f888 to 96cd76d Compare January 7, 2021 20:30

alamb closed this in c4ee536 Jan 8, 2021

asfimport mentioned this pull request Jan 10, 2021

[Rust] [DataFusion] Add support for is_in #26342

Closed

ARROW-10356: [Rust][DataFusion] Add support for is_in #9038

ARROW-10356: [Rust][DataFusion] Add support for is_in #9038

Conversation

seddonm1 commented Dec 28, 2020

github-actions bot commented Dec 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Jan 1, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Jan 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Dec 29, 2020

seddonm1 commented Dec 29, 2020

Dandandan commented Dec 30, 2020

seddonm1 commented Dec 30, 2020 • edited Loading

seddonm1 commented Dec 31, 2020

alamb commented Dec 31, 2020

seddonm1 commented Jan 1, 2021

codecov-io commented Jan 1, 2021 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jan 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 1, 2021

alamb commented Jan 1, 2021

jhorstmann commented Jan 1, 2021

Dandandan commented Jan 1, 2021

seddonm1 commented Jan 1, 2021

jhorstmann commented Jan 1, 2021

seddonm1 commented Jan 1, 2021

alamb commented Jan 2, 2021

seddonm1 commented Jan 4, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov commented Jan 6, 2021 • edited Loading

seddonm1 commented Jan 6, 2021

alamb commented Jan 6, 2021

Choose a reason for hiding this comment

alamb commented Jan 7, 2021

alamb commented Jan 7, 2021 • edited Loading

seddonm1 commented Jan 7, 2021

alamb commented Jan 8, 2021

alamb commented Jan 8, 2021

Dandandan Jan 1, 2021 •

edited

Loading

alamb Jan 1, 2021 •

edited

Loading

seddonm1 commented Dec 30, 2020 •

edited

Loading

codecov-io commented Jan 1, 2021 •

edited

Loading

alamb Jan 1, 2021 •

edited

Loading

yordan-pavlov commented Jan 6, 2021 •

edited

Loading

alamb commented Jan 7, 2021 •

edited

Loading