Use coerced type in inlist expr planning #2794

viirya · 2022-06-25T20:10:05Z

Which issue does this PR close?

Closes #2793
Closes #2787

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

viirya · 2022-06-25T20:33:24Z

After this change, sql::expr::in_list_array can pass with apache/arrow-rs#1943

For another failed test case sql::expr::test_in_list_scalar, we don't have Utf8 -> Int64 type coerce rule. But I feel that it is another issue. ~~So I will fix it in another PR.~~ There are some tests failed too due to the lack of coerce rule of Utf8 -> Int64, I will add it here together.

cc @alamb

codecov-commenter · 2022-06-25T22:48:10Z

Codecov Report

Merging #2794 (26881a6) into master (7c60412) will decrease coverage by 0.03%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #2794      +/-   ##
==========================================
- Coverage   85.14%   85.11%   -0.04%     
==========================================
  Files         273      273              
  Lines       48248    48243       -5     
==========================================
- Hits        41079    41060      -19     
- Misses       7169     7183      +14

Impacted Files	Coverage Δ
datafusion/expr/src/binary_rule.rs	`84.28% <75.00%> (-0.22%)`	⬇️
datafusion/core/src/physical_plan/planner.rs	`80.83% <100.00%> (-0.04%)`	⬇️
...atafusion/physical-expr/src/expressions/in_list.rs	`71.35% <100.00%> (-4.11%)`	⬇️
datafusion/physical-expr/src/planner.rs	`92.36% <100.00%> (+0.56%)`	⬆️
datafusion/physical-expr/src/expressions/cast.rs	`98.00% <0.00%> (-0.40%)`	⬇️
datafusion/expr/src/logical_plan/plan.rs	`73.91% <0.00%> (ø)`
datafusion/core/src/physical_plan/metrics/value.rs	`87.43% <0.00%> (+0.50%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c60412...26881a6. Read the comment docs.

viirya · 2022-06-25T23:30:02Z

datafusion/physical-expr/src/planner.rs

                    })
                    .collect::<Result<Vec<_>>>()?;

-                expressions::in_list(value_expr, list_exprs, negated, input_schema)
+                let (cast_expr, cast_list_exprs) =
+                    in_list_cast(value_expr, list_exprs, input_schema)?;


We should check data type casting for in_list during expression planning.

This is much better in my opinion. Thank you @viirya

@viirya I just go through the code for the data coerced for the InList.
When I want to support decimal in the InList, i created the pr #2764 and added coercion rule for InList in the in_list.rs file.

It make sense to me that you move the coerced rule from inlist to there.

From my option known from codebase of other database, the coercion should be done before generating the physical plan or physical expr.

We have discussed about the coercion rule in #1356 (comment)

viirya · 2022-06-25T23:30:40Z

datafusion/physical-expr/src/planner.rs

-                                // TODO: Can't cast from list type to value type directly
-                                // We should use the coercion rule to get the common data type


Use type coercion rule now.

alamb

Looks good to me @viirya -- thank you!

The only thing I am worried about is the change of the planner.rs case so it no longer checks c1 = c2 -- but as long as there is a good reason for that change this PR looks really nice to me

cc @Ted-Jiang and @liukun4515

alamb · 2022-06-26T12:36:02Z

datafusion/expr/src/binary_rule.rs

@@ -185,6 +186,17 @@ fn comparison_order_coercion(
        .or_else(|| null_coercion(lhs_type, rhs_type))
 }

+fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {


I have some concerns about the rule between string and number.
I check some situation in the spark:

spark-sql> desc t3; c1 int spark-sql> explain extended select * from t3 where c1 = cast(123.123 as string); == Parsed Logical Plan == 'Project [*] +- 'Filter ('c1 = cast(123.123 as string)) +- 'UnresolvedRelation [t3], [], false == Analyzed Logical Plan == c1: int Project [c1#186] +- Filter (c1#186 = cast(cast(123.123 as string) as int)) +- SubqueryAlias spark_catalog.default.t3 +- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []] == Optimized Logical Plan == Filter (isnotnull(c1#186) AND (c1#186 = 123)) +- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []] == Physical Plan == *(1) Filter (isnotnull(c1#186) AND (c1#186 = 123)) +- Scan hive default.t3 [c1#186], HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

In the previous case, the result of coercion is Int.
I think we need to create an issue to track this.
@viirya @alamb

alamb · 2022-06-26T12:39:16Z

datafusion/expr/src/binary_rule.rs

@@ -166,6 +166,7 @@ pub fn comparison_eq_coercion(
        .or_else(|| temporal_coercion(lhs_type, rhs_type))
        .or_else(|| string_coercion(lhs_type, rhs_type))
        .or_else(|| null_coercion(lhs_type, rhs_type))
+        .or_else(|| string_numeric_coercion(lhs_type, rhs_type))


This makes a lot of sense -- thanks. 👍

alamb · 2022-06-26T12:39:43Z

datafusion/physical-expr/src/planner.rs

@@ -294,43 +292,18 @@ pub fn create_physical_expr(
                            input_schema,
                            execution_props,
                        ),
-                        // TODO refactor the logic of coercion the data type


cc @liukun4515

This pr resolve some TODO inList for me.
it make sense to me

datafusion/physical-expr/src/planner.rs

alamb · 2022-06-26T12:45:54Z

datafusion/physical-expr/src/planner.rs

                    })
                    .collect::<Result<Vec<_>>>()?;

-                expressions::in_list(value_expr, list_exprs, negated, input_schema)
+                let (cast_expr, cast_list_exprs) =
+                    in_list_cast(value_expr, list_exprs, input_schema)?;


This is much better in my opinion. Thank you @viirya

alamb · 2022-06-26T12:46:53Z

datafusion/core/src/physical_plan/planner.rs

@@ -1739,8 +1739,6 @@ mod tests {
            col("c1").and(col("c1")),
            // u8 AND u8
            col("c3").and(col("c3")),
-            // utf8 = u32


Why remove this case?

utf8 and u32 now have coerced type (utf8).

alamb · 2022-06-26T12:49:56Z

datafusion/core/src/physical_plan/planner.rs

@@ -1887,7 +1884,7 @@ mod tests {
            .project(vec![col("c1").in_list(list, false)])?
            .build()?;
        let execution_plan = plan(&logical_plan).await?;
-        let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, CastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }], negated: false, set: Some(InSet { set:";
+        let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }], negated: false, set: None }";


Here is this expression formatted for anyone else who is interested:

(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, TryCastExpr { expr: Literal { value: Int64(1) },cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }], negated: false, set: None }

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

liukun4515

LGTM

Ted-Jiang · 2022-06-27T07:08:36Z

datafusion/expr/src/binary_rule.rs

@@ -185,6 +186,17 @@ fn comparison_order_coercion(
        .or_else(|| null_coercion(lhs_type, rhs_type))
 }

+fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
+    use arrow::datatypes::DataType::*;
+    match (lhs_type, rhs_type) {


I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)

explain select * from part where p_partkey in (1, 2, '3'); +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment | | | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) | | | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] | | physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] | | | CoalesceBatchesExec: target_batch_size=4096 | | | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int, int,utf8 cast to -> int, int, int,

In my opinion, after apply this patch it will get int, int,utf8cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄

test in this patch

explain select * from part where p_partkey in (1, 2, '3'); +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment | | | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) | | | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] | | physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] | | | CoalesceBatchesExec: target_batch_size=4096 | | | FilterExec: CAST(p_partkey@0 AS Utf8) IN ([TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, Literal { value: Utf8("3") }]) | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] | | | | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I guess that you are basically talk the same as #2794 (comment), right?

Actually the string_numeric_coercion rule coerces Utf8 and LargeUtf8 to numeric type in its first version. But a few test cases in sql::expr::test_in_list_scalar failed. For example,

test_expression!("'2' IN ('a','b',1)", "false");

Because 'a' and 'b' cannot converted to int, they will be null. So the result of this in_list expression is null, instead of false now. There are also other similar cases.

So I changed the coercion rule to use Utf8 and LargeUtf8 to more fit with existing logic, without changing too much from existing behavior.

I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.

As it is somehow following current behavior, I can address it in the other issue #2799.

I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)

explain select * from part where p_partkey in (1, 2, '3'); +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment | | | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) | | | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] | | physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] | | | CoalesceBatchesExec: target_batch_size=4096 | | | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int, int,utf8 cast to -> int, int, int,

In my opinion, after apply this patch it will get int, int,utf8cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄

Yes, the performance is greater for comparing integer.

Now the coercion rule is unstable, we should do mush work in that.

I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.

I think what is important here is to have a consistent set of semantics. I don't have any particular preference related to the automatic coercion of int, int, utf8 as I think there are different tradeoffs

For example coercing int, int, utf8 to utf8, utf8 utf8 would allow a predicate like c1 IN (1, 2 'foo') to run without error (assuming c1 can be coerced to utf8), but would result in a runtime error if we attempted to automatically coerce to int, int int. However, as @Ted-Jiang notes, the performance will be slower for predicates like c1 IN (1, 2, '3')

Best practice would be to explicitly cast all columns to i32 in the query if they were supposed to be compared as i32:

c1 in (1::smallint, 2::smallint, '3'::smallint)

but I realize that may not be practical for all users. 🤔

viirya · 2022-06-27T16:37:14Z

Thanks @alamb @liukun4515 @Ted-Jiang

github-actions bot added the physical-expr Physical Expressions label Jun 25, 2022

Use coerced type in inlist expr planning

9f56adf

viirya force-pushed the fix_inlist_types branch from e2df99c to 9f56adf Compare June 25, 2022 20:34

github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions labels Jun 25, 2022

Add coerce rule

5f5259d

viirya force-pushed the fix_inlist_types branch from e020a1b to 5f5259d Compare June 25, 2022 23:28

viirya commented Jun 25, 2022

View reviewed changes

alamb mentioned this pull request Jun 26, 2022

Test arrow without null (DEMO) #2787

Closed

alamb approved these changes Jun 26, 2022

View reviewed changes

viirya and others added 4 commits June 26, 2022 10:13

Update datafusion/physical-expr/src/planner.rs

8b040c6

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-expr/src/planner.rs

e8a5686

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-expr/src/planner.rs

368da33

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Fix test

26881a6

liukun4515 mentioned this pull request Jun 27, 2022

The framework about expression type coercion #1356

Closed

7 tasks

liukun4515 approved these changes Jun 27, 2022

View reviewed changes

liukun4515 mentioned this pull request Jun 27, 2022

coercion rule about eq and InList between string type and numeric type #2799

Open

Ted-Jiang reviewed Jun 27, 2022

View reviewed changes

liukun4515 merged commit 7afd437 into apache:master Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use coerced type in inlist expr planning #2794

Use coerced type in inlist expr planning #2794

viirya commented Jun 25, 2022 •

edited by alamb

viirya commented Jun 25, 2022 •

edited

codecov-commenter commented Jun 25, 2022 •

edited

viirya Jun 25, 2022

alamb Jun 26, 2022

liukun4515 Jun 27, 2022

viirya Jun 25, 2022

alamb left a comment

alamb Jun 26, 2022

liukun4515 Jun 27, 2022

alamb Jun 26, 2022

alamb Jun 26, 2022

liukun4515 Jun 27, 2022

alamb Jun 26, 2022

alamb Jun 26, 2022

viirya Jun 26, 2022

alamb Jun 26, 2022

liukun4515 left a comment

Ted-Jiang Jun 27, 2022

Ted-Jiang Jun 27, 2022

viirya Jun 27, 2022 •

edited

viirya Jun 27, 2022

liukun4515 Jun 27, 2022

alamb Jun 27, 2022

viirya commented Jun 27, 2022

		// TODO: Can't cast from list type to value type directly
		// We should use the coercion rule to get the common data type

Use coerced type in inlist expr planning #2794

Use coerced type in inlist expr planning #2794

Conversation

viirya commented Jun 25, 2022 • edited by alamb

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

viirya commented Jun 25, 2022 • edited

codecov-commenter commented Jun 25, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jun 27, 2022

viirya commented Jun 25, 2022 •

edited by alamb

viirya commented Jun 25, 2022 •

edited

codecov-commenter commented Jun 25, 2022 •

edited

viirya Jun 27, 2022 •

edited