Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support decimal data type for the optimizer rule of PreCastLitInComparisonExpressions #3245

Merged
merged 5 commits into from Aug 27, 2022

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Aug 24, 2022

Which issue does this PR close?

part of #3031

Rationale for this change

What changes are included in this PR?

In our case, we have many columns with the decimal data type. The decimal column will be applied with filter like

select xxx from c1  = literal

c1 is decimal data type, the literal may be integer or decimal data type.

After this pr, the plan will try to cast the literal to the type of c1.

For example in a table:

❯ \d aggregate_test_100
+---------------+--------------+--------------------+-------------+-------------------+-------------+
| table_catalog | table_schema | table_name         | column_name | data_type         | is_nullable |
+---------------+--------------+--------------------+-------------+-------------------+-------------+
| datafusion    | public       | aggregate_test_100 | c1          | Utf8              | NO          |
| datafusion    | public       | aggregate_test_100 | c2          | Int32             | NO          |
| datafusion    | public       | aggregate_test_100 | c3          | Decimal128(12, 0) | NO          |
| datafusion    | public       | aggregate_test_100 | c4          | Int16             | NO          |
| datafusion    | public       | aggregate_test_100 | c5          | Int32             | NO          |
| datafusion    | public       | aggregate_test_100 | c6          | Int64             | NO          |
| datafusion    | public       | aggregate_test_100 | c7          | Int16             | NO          |
| datafusion    | public       | aggregate_test_100 | c8          | Int32             | NO          |
| datafusion    | public       | aggregate_test_100 | c9          | Int64             | NO          |
| datafusion    | public       | aggregate_test_100 | c10         | Utf8              | NO          |
| datafusion    | public       | aggregate_test_100 | c11         | Float32           | NO          |
| datafusion    | public       | aggregate_test_100 | c12         | Float64           | NO          |
| datafusion    | public       | aggregate_test_100 | c13         | Utf8              | NO          |

The query plan:

❯ explain select c1,c3 from aggregate_test_100 where c3  = 12;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                         |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #aggregate_test_100.c1, #aggregate_test_100.c3                                                                                                   |
|               |   **Filter: #aggregate_test_100.c3 = Decimal128(Some(12),12,0)**                                                                                                 |
|               |     TableScan: aggregate_test_100 projection=[c1, c3], partial_filters=[#aggregate_test_100.c3 = Decimal128(Some(12),12,0)]                                  |
| physical_plan | ProjectionExec: expr=[c1@0 as c1, c3@1 as c3]                                                                                                                |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                |
|               |     FilterExec: c3@1 = Some(12),12,0                                                                                                                         |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                      |
|               |         CsvExec: files=[Users/kliu3/Documents/github/arrow-datafusion/target/debug/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1, c3] |
|               |                                                                                                                                                              |
+---------------+----------------------------------

explain verbose

❯ explain verbose select c1,c3 from aggregate_test_100 where c3  = 12;
+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                             | plan                                                                                                                                                         |
+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                  | Projection: #aggregate_test_100.c1, #aggregate_test_100.c3                                                                                                   |
|                                                       |   Filter: #aggregate_test_100.c3 = Int64(12)                                                                                                                 |
|                                                       |     TableScan: aggregate_test_100                                                                                                                            |
| logical_plan after simplify_expressions               | SAME TEXT AS ABOVE                                                                                                                                           |
| logical_plan after pre_cast_lit_in_comparison         | Projection: #aggregate_test_100.c1, #aggregate_test_100.c3                                                                                                   |
|                                                       |   Filter: #aggregate_test_100.c3 = Decimal128(Some(12),12,0)                                                                                                 |
|                                                       |     TableScan: aggregate_test_100

Are there any user-facing changes?

@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2022

Codecov Report

Merging #3245 (3c64aef) into master (92110dd) will increase coverage by 0.02%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master    #3245      +/-   ##
==========================================
+ Coverage   85.89%   85.92%   +0.02%     
==========================================
  Files         294      294              
  Lines       53373    53442      +69     
==========================================
+ Hits        45845    45918      +73     
+ Misses       7528     7524       -4     
Impacted Files Coverage Δ
...fusion/optimizer/src/pre_cast_lit_in_comparison.rs 91.95% <91.66%> (+8.62%) ⬆️
datafusion/expr/src/logical_plan/plan.rs 78.55% <0.00%> (-0.18%) ⬇️
datafusion/core/src/datasource/file_format/csv.rs 98.91% <0.00%> (ø)
datafusion/core/tests/sql/create_drop.rs 95.74% <0.00%> (+0.45%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

target_type: &DataType,
) -> Result<bool> {
if integer_lit_value.is_null() {
) -> Result<(bool, Option<ScalarValue>)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function always returns either (true, Some(_)) or (false, None) so maybe it should just return the Option without the bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion.
Done for your comments.
PTAL @andygrove

match lit_value_target_type {
None => Ok((false, None)),
Some(value) => {
match value >= target_min && value <= target_max {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use an if statement rather than a match on a boolean expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@liukun4515
Copy link
Contributor Author

@andygrove PTAL, But I think it will conflict with your #3260

let (target_min, target_max) = match target_type {
DataType::Int8 => (i8::MIN as i128, i8::MAX as i128),
DataType::Int16 => (i16::MIN as i128, i16::MAX as i128),
DataType::Int32 => (i32::MIN as i128, i32::MAX as i128),
DataType::Int64 => (i64::MIN as i128, i64::MAX as i128),
DataType::Decimal128(precision, _) => (
MIN_DECIMAL_FOR_EACH_PRECISION[*precision - 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here to explain what is going on?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liukun4515. LGTM, although I am not an expert wth decimal manipulation.

@liukun4515 liukun4515 merged commit b1db5ff into apache:master Aug 27, 2022
@liukun4515 liukun4515 deleted the support_decimal_type_#3031 branch August 27, 2022 04:29
@ursabot
Copy link

ursabot commented Aug 27, 2022

Benchmark runs are scheduled for baseline = 90a0e7c and contender = b1db5ff. b1db5ff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants