New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use coerced type in inlist expr planning #2794
Conversation
After this change, For another failed test case cc @alamb |
Codecov Report
@@ Coverage Diff @@
## master #2794 +/- ##
==========================================
- Coverage 85.14% 85.11% -0.04%
==========================================
Files 273 273
Lines 48248 48243 -5
==========================================
- Hits 41079 41060 -19
- Misses 7169 7183 +14
Continue to review full report at Codecov.
|
}) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
expressions::in_list(value_expr, list_exprs, negated, input_schema) | ||
let (cast_expr, cast_list_exprs) = | ||
in_list_cast(value_expr, list_exprs, input_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check data type casting for in_list during expression planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better in my opinion. Thank you @viirya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I just go through the code for the data coerced for the InList.
When I want to support decimal in the InList, i created the pr #2764 and added coercion rule for InList in the in_list.rs
file.
It make sense to me that you move the coerced rule from inlist
to there.
From my option known from codebase of other database, the coercion should be done before generating the physical plan or physical expr.
We have discussed about the coercion rule in #1356 (comment)
// TODO: Can't cast from list type to value type directly | ||
// We should use the coercion rule to get the common data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use type coercion rule now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @viirya -- thank you!
The only thing I am worried about is the change of the planner.rs case so it no longer checks c1 = c2
-- but as long as there is a good reason for that change this PR looks really nice to me
cc @Ted-Jiang and @liukun4515
@@ -185,6 +186,17 @@ fn comparison_order_coercion( | |||
.or_else(|| null_coercion(lhs_type, rhs_type)) | |||
} | |||
|
|||
fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the rule between string and number.
I check some situation in the spark:
spark-sql> desc t3;
c1 int
spark-sql> explain extended select * from t3 where c1 = cast(123.123 as string);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('c1 = cast(123.123 as string))
+- 'UnresolvedRelation [t3], [], false
== Analyzed Logical Plan ==
c1: int
Project [c1#186]
+- Filter (c1#186 = cast(cast(123.123 as string) as int))
+- SubqueryAlias spark_catalog.default.t3
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
== Physical Plan ==
*(1) Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- Scan hive default.t3 [c1#186], HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
In the previous case, the result of coercion is Int
.
I think we need to create an issue to track this.
@viirya @alamb
@@ -166,6 +166,7 @@ pub fn comparison_eq_coercion( | |||
.or_else(|| temporal_coercion(lhs_type, rhs_type)) | |||
.or_else(|| string_coercion(lhs_type, rhs_type)) | |||
.or_else(|| null_coercion(lhs_type, rhs_type)) | |||
.or_else(|| string_numeric_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense -- thanks. 👍
@@ -294,43 +292,18 @@ pub fn create_physical_expr( | |||
input_schema, | |||
execution_props, | |||
), | |||
// TODO refactor the logic of coercion the data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liukun4515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr resolve some TODO inList for me.
it make sense to me
}) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
expressions::in_list(value_expr, list_exprs, negated, input_schema) | ||
let (cast_expr, cast_list_exprs) = | ||
in_list_cast(value_expr, list_exprs, input_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better in my opinion. Thank you @viirya
@@ -1739,8 +1739,6 @@ mod tests { | |||
col("c1").and(col("c1")), | |||
// u8 AND u8 | |||
col("c3").and(col("c3")), | |||
// utf8 = u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8 and u32 now have coerced type (utf8).
@@ -1887,7 +1884,7 @@ mod tests { | |||
.project(vec![col("c1").in_list(list, false)])? | |||
.build()?; | |||
let execution_plan = plan(&logical_plan).await?; | |||
let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, CastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }], negated: false, set: Some(InSet { set:"; | |||
let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }], negated: false, set: None }"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is this expression formatted for anyone else who is interested:
(InListExpr {
expr: Column { name: \"c1\", index: 0 },
list: [Literal { value: Utf8(\"a\") },
TryCastExpr { expr: Literal { value: Int64(1) },cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }],
negated: false,
set: None
}
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -185,6 +186,17 @@ fn comparison_order_coercion( | |||
.or_else(|| null_coercion(lhs_type, rhs_type)) | |||
} | |||
|
|||
fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | |||
use arrow::datatypes::DataType::*; | |||
match (lhs_type, rhs_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa
(not this commit)
explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment |
| | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) |
| | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] |
| | CoalesceBatchesExec: target_batch_size=4096 |
| | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) |
| | RepartitionExec: partitioning=RoundRobinBatch(16) |
| | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
int, int,utf8
cast to -> int, int, int
,
In my opinion, after apply this patch it will get int, int,utf8cast to ->
utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change to
int` will get better performance in build hasSet, Am i right? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test in this patch
explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment |
| | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) |
| | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] |
| | CoalesceBatchesExec: target_batch_size=4096 |
| | FilterExec: CAST(p_partkey@0 AS Utf8) IN ([TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, Literal { value: Utf8("3") }]) |
| | RepartitionExec: partitioning=RoundRobinBatch(16) |
| | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
| | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that you are basically talk the same as #2794 (comment), right?
Actually the string_numeric_coercion
rule coerces Utf8 and LargeUtf8 to numeric type in its first version. But a few test cases in sql::expr::test_in_list_scalar
failed. For example,
test_expression!("'2' IN ('a','b',1)", "false");
Because 'a' and 'b' cannot converted to int, they will be null. So the result of this in_list expression is null, instead of false
now. There are also other similar cases.
So I changed the coercion rule to use Utf8 and LargeUtf8 to more fit with existing logic, without changing too much from existing behavior.
I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is somehow following current behavior, I can address it in the other issue #2799.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in
748b6a65a5fa801595fd80a3c7b2728be3c9cdaa
(not this commit)explain select * from part where p_partkey in (1, 2, '3'); +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment | | | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) | | | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] | | physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] | | | CoalesceBatchesExec: target_batch_size=4096 | | | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
int, int,utf8
cast to ->int, int, int
,In my opinion, after apply this patch it will get int, int,utf8
cast to ->
utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change to
int` will get better performance in build hasSet, Am i right? 😄
Yes, the performance is greater for comparing integer.
Now the coercion rule is unstable, we should do mush work in that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.
I think what is important here is to have a consistent set of semantics. I don't have any particular preference related to the automatic coercion of int, int, utf8
as I think there are different tradeoffs
For example coercing int, int, utf8
to utf8, utf8 utf8
would allow a predicate like c1 IN (1, 2 'foo')
to run without error (assuming c1 can be coerced to utf8
), but would result in a runtime error if we attempted to automatically coerce to int, int int
. However, as @Ted-Jiang notes, the performance will be slower for predicates like c1 IN (1, 2, '3')
Best practice would be to explicitly cast all columns to i32
in the query if they were supposed to be compared as i32
:
c1 in (1::smallint, 2::smallint, '3'::smallint)
but I realize that may not be practical for all users. 🤔
Thanks @alamb @liukun4515 @Ted-Jiang |
Which issue does this PR close?
Closes #2793
Closes #2787
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?