Bug fix: Window frame range value outside the type range #5384

mustafasrepo · 2023-02-24T07:09:34Z

Which issue does this PR close?

Closes #5346.

Rationale for this change

What changes are included in this PR?

If we cannot cast range value to the target type. We now check for If range value can be successfully casted to largest type of the family (for Int types largest type is Int64, for Uint types it is Uint64 and so on). If we can accomplish so this range value is treated as unbounded. See #5346 for context.

Are these changes tested?

Yes

Are there any user-facing changes?

# Conflicts: # datafusion/core/tests/sql/window.rs

waynexia · 2023-02-24T11:20:20Z

datafusion/expr/src/type_coercion.rs

+/// Determine whether the given data type `dt` is a `Utf8`.
+pub fn is_utf8(dt: &DataType) -> bool {
    matches!(dt, DataType::Utf8)
 }


Not related to this PR, but I think we missed DataType::LargeUtf8 in #5234. cc @jackwener

waynexia · 2023-02-24T11:56:22Z

datafusion/optimizer/src/type_coercion.rs

+                    Err(DataFusionError::NotImplemented(format!(
+                        "Cannot cast {:?} to {:?}",
+                        value, target_type
+                    )))


If the largest type doesn't work then this coercion is impossible rather than unimplemented?

Indeed you are right. Fixed it. Thanks.

avantgardnerio

Much cleaner overall @mustafasrepo . If you could clarify my confusion on the behavior of Null I'd be willing to hit "merge".

avantgardnerio · 2023-02-24T16:40:00Z

datafusion/optimizer/src/type_coercion.rs

+                        value, target_type
+                    )))
+                },
+                |_| ScalarValue::try_from(target_type),


I don't understand - this is ignoring the result of coerce_scalar() and re-parsing it as ScalarValue::try_from(target_type)?

Consider the case when target type is Int8 and range value used in the query is 10000 like following: OVER(ORDER BY c1 RANGE BETWEEN 10000 PRECEDING AND 10000 FOLLOWING)). 10000 cannot be casted to Int8. In this case we try to cast value to Int64 type. If it succeeds it means that 10000 couldn't be casted to Int8 because of overflow in the first case. It means that we can treat the OVER(ORDER BY c1 RANGE BETWEEN 10000 PRECEDING AND 10000 FOLLOWING) as OVER(ORDER BY c1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) without loss of generality, since we know that range would cover the whole table. In this case we are not interested in the casted version (We ignore the result. Then return NULL to convert range to UNBOUNDED. By the way we also use NULL for the types in the form ScalarValue::Int8(None) not just for the types ScalarValue::Null in case that terminology is misleading.). We just need to know if value can be casted to large type. If large type casting fails also it means that some strange range entered not compatible with the original column like following: OVER(ORDER BY c1 RANGE BETWEEN '3 day' PRECEDING AND '1 hour' FOLLOWING). In that case we return error.

Thanks for the clarification!

avantgardnerio · 2023-02-24T16:43:24Z

datafusion/optimizer/src/type_coercion.rs

+/// If the coercion is successful, we return an `Ok` value with the result.
+/// If the coercion fails because `target_type` is not wide enough (i.e. we
+/// can not coerce to `target_type`, but we can to a wider type in the same
+/// family), we return a `Null` value of this type to signal this situation.


Where is the Null introduced? ScalarValue::try_from(target_type)?

Rust has Option and Result to indicate an operation was unsuccessful, is there a reason to use a valid value (Null) which callers must be aware of, vs returning one of the above to indicate an error? I could see how callers that did not read this documentation might continue on with the Null introducing subtle errors later in the call stack.

Oh, it looks like this is because that's the existing behavior of WindowFrameBound?:

pub enum WindowFrameBound { Preceding(ScalarValue), CurrentRow, Following(ScalarValue), }

avantgardnerio · 2023-02-24T16:44:21Z

datafusion/expr/src/type_coercion.rs


-pub fn is_uft8(dt: &DataType) -> bool {
+/// Determine whether the given data type `dt` is a `Utf8`.
+pub fn is_utf8(dt: &DataType) -> bool {


Thanks for fixing this.

avantgardnerio · 2023-02-24T16:50:36Z

datafusion/optimizer/src/type_coercion.rs

 }

-/// Casts the ScalarValue `value` to coerced type.
-// When coerced type is `Interval` we use `parse_interval` since `try_from_string` not


The issue was not introduced by this PR, but it seems like it would be a good idea to update try_from_string()...

ursabot · 2023-02-24T19:19:59Z

Benchmark runs are scheduled for baseline = d6ef463 and contender = 49237a2. 49237a2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

mustafasrepo and others added 5 commits February 23, 2023 14:59

Initial Implementation

2a49020

Change error type

a1e20d4

just use largest type

78985b1

Refactors, simplifications, comment improvements

518f1ea

Merge branch 'main' into bug/window_range_coerce

19d31d0

# Conflicts: # datafusion/core/tests/sql/window.rs

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 24, 2023

add new test

b9f907a

waynexia reviewed Feb 24, 2023

View reviewed changes

Change error type

f953d3e

avantgardnerio approved these changes Feb 24, 2023

View reviewed changes

avantgardnerio merged commit 49237a2 into apache:main Feb 24, 2023

jackwener mentioned this pull request Feb 25, 2023

minor: add forgotten large_utf8 #5393

Merged

mustafasrepo deleted the bug/window_range_coerce branch March 2, 2023 10:54

andygrove added the bug Something isn't working label Mar 12, 2023

Bug fix: Window frame range value outside the type range #5384

Bug fix: Window frame range value outside the type range #5384

Uh oh!

Conversation

mustafasrepo commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mustafasrepo Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avantgardnerio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ursabot commented Feb 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mustafasrepo commented Feb 24, 2023 •

edited

Loading

mustafasrepo Feb 24, 2023 •

edited

Loading