Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Coalesce casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion #10268

Merged
merged 43 commits into from
May 26, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Apr 27, 2024

Which issue does this PR close?

Closes #10261
Closes #10241
Part of #10507

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest labels Apr 27, 2024
@@ -92,14 +92,19 @@ pub enum TypeSignature {
/// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])`
Variadic(Vec<DataType>),
/// One or more arguments of an arbitrary but equal type.
/// DataFusion attempts to coerce all argument types to match the first argument's type
/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion.
///
/// # Examples
/// Given types in signature should be coercible to the same final type.
/// A function such as `make_array` is `VariadicEqual`.
///
/// `make_array(i32, i64) -> make_array(i64, i64)`
VariadicEqual,

This comment was marked as outdated.

// Note that not all rules in `comparison_coercion` can be reused here.
// For example, all numeric types can be coerced into Utf8 for comparison,
// but not for function arguments.
_ => comparison_binary_numeric_coercion(type_into, type_from).and_then(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is introduced in #9459, so I think it is safe to remove together with this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @viirya

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 closed this Apr 28, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 reopened this Apr 28, 2024
@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules labels Apr 28, 2024
@jayzhan211 jayzhan211 changed the title Avoid casting for coalesce, introduce signature VariadicEqualOrNull for coalesce. Fix Coalesce casting logic that follows Postgres and DuckDB. Introduce signature that do non-comparison coercion Apr 28, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot removed the optimizer Optimizer rules label Apr 28, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot removed the physical-expr Physical Expressions label Apr 28, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the optimizer Optimizer rules label May 12, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review May 12, 2024 08:40
@jayzhan211 jayzhan211 marked this pull request as draft May 14, 2024 08:23
@jayzhan211 jayzhan211 marked this pull request as ready for review May 14, 2024 11:27
@jayzhan211 jayzhan211 marked this pull request as draft May 14, 2024 11:31
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review May 14, 2024 13:18
@alamb
Copy link
Contributor

alamb commented May 21, 2024

In general I am concerned about the potential downstream effects of this change. I don't fully understand them

What I would ideally like to do is to run the influxdb_iox regression tests with this change and see what happens.

I am not sure I have the time to do that in the next week -- maybe @appletreeisyellow does 🤔

@alamb
Copy link
Contributor

alamb commented May 21, 2024

Basically I worry this is just changing behavior rather than fixing a bug and will result in churn for no benefit downstream. I may be mis understanding the change and rationale however

@jayzhan211
Copy link
Contributor Author

I believe the coercion rule is quite messy as it currently stands. It would be more understandable and maintainable to move the coercion rule from coerced_from to each individual function. This way, it would be clear which coercion rule applies to each function or signature. Having a single function handle multiple coercion rules makes the code harder to reason about and could cause downstream issues, as you have pointed out. I've also filed an issue to track this at @10507.

Would it be better to start by removing the rule from coerce_from and moving it to the specific function or signature?

@appletreeisyellow
Copy link
Contributor

I am not sure I have the time to do that in the next week -- maybe @appletreeisyellow does 🤔

@alamb I'm happy to coordinate with our performance team and run an influxdb_iox regression tests if needed

@alamb
Copy link
Contributor

alamb commented May 22, 2024

I don't think this needs a performance test -- I was thinking just the main test suite

@alamb
Copy link
Contributor

alamb commented May 22, 2024

I believe the coercion rule is quite messy as it currently stands. It would be more understandable and maintainable to move the coercion rule from coerced_from to each individual function. This way, it would be clear which coercion rule applies to each function or signature. Having a single function handle multiple coercion rules makes the code harder to reason about and could cause downstream issues, as you have pointed out. I've also filed an issue to track this at @10507.

Would it be better to start by removing the rule from coerce_from and moving it to the specific function or signature?

Hi @jayzhan211 -- I am sorry I haven't had a chance to devote more time to this and review. I will try and find some time in the next few days

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented May 22, 2024

I don't think this needs a performance test -- I was thinking just the main test suite

I was thinking performance regression 😅 Got it! I can put up an Influxdata test against this branch, sometime this afternoon

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here is that @appletreeisyellow did run the internal influx db tests against this branch and we didn't find any problems. Not that it is required that we don't have to change our code, but I have found that our test suite has a bit broader coverage and thus the fact it doesn't have problems suggests to me this PR is less likely to have knock on effects

I had another look at this PR this morning and Other than the changes in select.slt I think it looks great to me

Thanks again @jayzhan211

@@ -1473,7 +1473,7 @@ DROP TABLE t;

# related to https://github.com/apache/datafusion/issues/8814
statement ok
create table t(x int, y int) as values (1,1), (2,2), (3,3), (0,0), (4,0);
create table t(x bigint, y bigint) as values (1,1), (2,2), (3,3), (0,0), (4,0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please keep the existing test and add a new test that uses bigint so it is clearer what behavior is changing?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@@ -1778,7 +1778,7 @@ AS VALUES
('BB', 6, 1);

query TII
select col1, col2, coalesce(sum_col3, 0) as sum_col3
select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also we can change this test back too

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb
Copy link
Contributor

alamb commented May 26, 2024

Looks good, and existing tests are passing 👍 🚀

@alamb alamb merged commit 0836500 into apache:main May 26, 2024
23 checks passed
@jayzhan211
Copy link
Contributor Author

Thanks @alamb and @appletreeisyellow

jayzhan211 added a commit to jayzhan211/arrow-datafusion that referenced this pull request May 26, 2024
…Introduce signature that do non-comparison coercion (apache#10268)

* remove casting for coalesce

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* crate only visibility

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* polish comment

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* improve test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* backup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* introduce new signautre for coalesce

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* ignore err msg

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix doc

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* switch to type_resolution coercion

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix i64 and u64 case

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more tests

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add null case

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rename to type_union_resolution

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add comment

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add comment

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup since rebase

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add more test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix msg

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm pure_string_coercion

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm duplicate

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* change type in select.slt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix slt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb
Copy link
Contributor

alamb commented May 27, 2024

Thank you for sticking with it @jayzhan211 -- so much is going on!@

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest
Projects
None yet
4 participants