Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use coerced type in inlist expr planning #2794

Merged
merged 6 commits into from Jun 27, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 25, 2022

Which issue does this PR close?

Closes #2793
Closes #2787

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Jun 25, 2022
@viirya
Copy link
Member Author

viirya commented Jun 25, 2022

After this change, sql::expr::in_list_array can pass with apache/arrow-rs#1943

For another failed test case sql::expr::test_in_list_scalar, we don't have Utf8 -> Int64 type coerce rule. But I feel that it is another issue. So I will fix it in another PR. There are some tests failed too due to the lack of coerce rule of Utf8 -> Int64, I will add it here together.

cc @alamb

@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions labels Jun 25, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 25, 2022

Codecov Report

Merging #2794 (26881a6) into master (7c60412) will decrease coverage by 0.03%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #2794      +/-   ##
==========================================
- Coverage   85.14%   85.11%   -0.04%     
==========================================
  Files         273      273              
  Lines       48248    48243       -5     
==========================================
- Hits        41079    41060      -19     
- Misses       7169     7183      +14     
Impacted Files Coverage Δ
datafusion/expr/src/binary_rule.rs 84.28% <75.00%> (-0.22%) ⬇️
datafusion/core/src/physical_plan/planner.rs 80.83% <100.00%> (-0.04%) ⬇️
...atafusion/physical-expr/src/expressions/in_list.rs 71.35% <100.00%> (-4.11%) ⬇️
datafusion/physical-expr/src/planner.rs 92.36% <100.00%> (+0.56%) ⬆️
datafusion/physical-expr/src/expressions/cast.rs 98.00% <0.00%> (-0.40%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 73.91% <0.00%> (ø)
datafusion/core/src/physical_plan/metrics/value.rs 87.43% <0.00%> (+0.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c60412...26881a6. Read the comment docs.

})
.collect::<Result<Vec<_>>>()?;

expressions::in_list(value_expr, list_exprs, negated, input_schema)
let (cast_expr, cast_list_exprs) =
in_list_cast(value_expr, list_exprs, input_schema)?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check data type casting for in_list during expression planning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better in my opinion. Thank you @viirya

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I just go through the code for the data coerced for the InList.
When I want to support decimal in the InList, i created the pr #2764 and added coercion rule for InList in the in_list.rs file.

It make sense to me that you move the coerced rule from inlist to there.

From my option known from codebase of other database, the coercion should be done before generating the physical plan or physical expr.

We have discussed about the coercion rule in #1356 (comment)

Comment on lines -316 to -317
// TODO: Can't cast from list type to value type directly
// We should use the coercion rule to get the common data type
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use type coercion rule now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @viirya -- thank you!

The only thing I am worried about is the change of the planner.rs case so it no longer checks c1 = c2 -- but as long as there is a good reason for that change this PR looks really nice to me

cc @Ted-Jiang and @liukun4515

@@ -185,6 +186,17 @@ fn comparison_order_coercion(
.or_else(|| null_coercion(lhs_type, rhs_type))
}

fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the rule between string and number.
I check some situation in the spark:

spark-sql> desc t3;
c1                      int

spark-sql> explain extended select * from t3 where c1 = cast(123.123 as string);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('c1 = cast(123.123 as string))
   +- 'UnresolvedRelation [t3], [], false

== Analyzed Logical Plan ==
c1: int
Project [c1#186]
+- Filter (c1#186 = cast(cast(123.123 as string) as int))
   +- SubqueryAlias spark_catalog.default.t3
      +- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

== Optimized Logical Plan ==
Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

== Physical Plan ==
*(1) Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- Scan hive default.t3 [c1#186], HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

In the previous case, the result of coercion is Int.
I think we need to create an issue to track this.
@viirya @alamb

@@ -166,6 +166,7 @@ pub fn comparison_eq_coercion(
.or_else(|| temporal_coercion(lhs_type, rhs_type))
.or_else(|| string_coercion(lhs_type, rhs_type))
.or_else(|| null_coercion(lhs_type, rhs_type))
.or_else(|| string_numeric_coercion(lhs_type, rhs_type))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense -- thanks. 👍

@@ -294,43 +292,18 @@ pub fn create_physical_expr(
input_schema,
execution_props,
),
// TODO refactor the logic of coercion the data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr resolve some TODO inList for me.
it make sense to me

datafusion/physical-expr/src/planner.rs Show resolved Hide resolved
datafusion/physical-expr/src/planner.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/planner.rs Outdated Show resolved Hide resolved
})
.collect::<Result<Vec<_>>>()?;

expressions::in_list(value_expr, list_exprs, negated, input_schema)
let (cast_expr, cast_list_exprs) =
in_list_cast(value_expr, list_exprs, input_schema)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better in my opinion. Thank you @viirya

@@ -1739,8 +1739,6 @@ mod tests {
col("c1").and(col("c1")),
// u8 AND u8
col("c3").and(col("c3")),
// utf8 = u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 and u32 now have coerced type (utf8).

@@ -1887,7 +1884,7 @@ mod tests {
.project(vec![col("c1").in_list(list, false)])?
.build()?;
let execution_plan = plan(&logical_plan).await?;
let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, CastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }], negated: false, set: Some(InSet { set:";
let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }], negated: false, set: None }";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is this expression formatted for anyone else who is interested:

(InListExpr {
  expr: Column { name: \"c1\", index: 0 },
  list: [Literal { value: Utf8(\"a\") },
                 TryCastExpr { expr: Literal { value: Int64(1) },cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 },
                 TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, 
                 TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }],
  negated: false,
  set: None
  }

viirya and others added 4 commits June 26, 2022 10:13
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -185,6 +186,17 @@ fn comparison_order_coercion(
.or_else(|| null_coercion(lhs_type, rhs_type))
}

fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)

explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                      |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment                                                                                                               |
|               |   Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])                                                                                                                                                                                                            |
|               |     TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])]                                                                  |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment]                           |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                             |
|               |     FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }])                                                                 |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                   |
|               |         ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
|               |                                                                                                                                                                                                                                                                           |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int, int,utf8 cast to -> int, int, int,

In my opinion, after apply this patch it will get int, int,utf8cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test in this patch

explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                      |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment                                                                                                               |
|               |   Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])                                                                                                                                                                                                            |
|               |     TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])]                                                                  |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment]                           |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                             |
|               |     FilterExec: CAST(p_partkey@0 AS Utf8) IN ([TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, Literal { value: Utf8("3") }])                                                     |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                   |
|               |         ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
|               |                                                                                                                                                                                                                                                                           |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Copy link
Member Author

@viirya viirya Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that you are basically talk the same as #2794 (comment), right?

Actually the string_numeric_coercion rule coerces Utf8 and LargeUtf8 to numeric type in its first version. But a few test cases in sql::expr::test_in_list_scalar failed. For example,

test_expression!("'2' IN ('a','b',1)", "false");

Because 'a' and 'b' cannot converted to int, they will be null. So the result of this in_list expression is null, instead of false now. There are also other similar cases.

So I changed the coercion rule to use Utf8 and LargeUtf8 to more fit with existing logic, without changing too much from existing behavior.

I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is somehow following current behavior, I can address it in the other issue #2799.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)

explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                      |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment                                                                                                               |
|               |   Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])                                                                                                                                                                                                            |
|               |     TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])]                                                                  |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment]                           |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                             |
|               |     FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }])                                                                 |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                   |
|               |         ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
|               |                                                                                                                                                                                                                                                                           |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int, int,utf8 cast to -> int, int, int,

In my opinion, after apply this patch it will get int, int,utf8cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄

Yes, the performance is greater for comparing integer.

Now the coercion rule is unstable, we should do mush work in that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.

I think what is important here is to have a consistent set of semantics. I don't have any particular preference related to the automatic coercion of int, int, utf8 as I think there are different tradeoffs

For example coercing int, int, utf8 to utf8, utf8 utf8 would allow a predicate like c1 IN (1, 2 'foo') to run without error (assuming c1 can be coerced to utf8), but would result in a runtime error if we attempted to automatically coerce to int, int int. However, as @Ted-Jiang notes, the performance will be slower for predicates like c1 IN (1, 2, '3')

Best practice would be to explicitly cast all columns to i32 in the query if they were supposed to be compared as i32:

c1 in (1::smallint, 2::smallint, '3'::smallint)

but I realize that may not be practical for all users. 🤔

@liukun4515 liukun4515 merged commit 7afd437 into apache:master Jun 27, 2022
@viirya
Copy link
Member Author

viirya commented Jun 27, 2022

Thanks @alamb @liukun4515 @Ted-Jiang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use coerced data type from value and list expressions during planning inlist expression
5 participants