Skip to content

Conversation

@mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Feb 24, 2023

Which issue does this PR close?

Closes #5346.

Rationale for this change

What changes are included in this PR?

If we cannot cast range value to the target type. We now check for If range value can be successfully casted to largest type of the family (for Int types largest type is Int64, for Uint types it is Uint64 and so on). If we can accomplish so this range value is treated as unbounded. See #5346 for context.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 24, 2023
Comment on lines +75 to 78
/// Determine whether the given data type `dt` is a `Utf8`.
pub fn is_utf8(dt: &DataType) -> bool {
matches!(dt, DataType::Utf8)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I think we missed DataType::LargeUtf8 in #5234. cc @jackwener

Comment on lines 472 to 475
Err(DataFusionError::NotImplemented(format!(
"Cannot cast {:?} to {:?}",
value, target_type
)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the largest type doesn't work then this coercion is impossible rather than unimplemented?

Copy link
Contributor Author

@mustafasrepo mustafasrepo Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you are right. Fixed it. Thanks.

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner overall @mustafasrepo . If you could clarify my confusion on the behavior of Null I'd be willing to hit "merge".

value, target_type
)))
},
|_| ScalarValue::try_from(target_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand - this is ignoring the result of coerce_scalar() and re-parsing it as ScalarValue::try_from(target_type)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case when target type is Int8 and range value used in the query is 10000 like following: OVER(ORDER BY c1 RANGE BETWEEN 10000 PRECEDING AND 10000 FOLLOWING)). 10000 cannot be casted to Int8. In this case we try to cast value to Int64 type. If it succeeds it means that 10000 couldn't be casted to Int8 because of overflow in the first case. It means that we can treat the OVER(ORDER BY c1 RANGE BETWEEN 10000 PRECEDING AND 10000 FOLLOWING) as OVER(ORDER BY c1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) without loss of generality, since we know that range would cover the whole table. In this case we are not interested in the casted version (We ignore the result. Then return NULL to convert range to UNBOUNDED. By the way we also use NULL for the types in the form ScalarValue::Int8(None) not just for the types ScalarValue::Null in case that terminology is misleading.). We just need to know if value can be casted to large type. If large type casting fails also it means that some strange range entered not compatible with the original column like following: OVER(ORDER BY c1 RANGE BETWEEN '3 day' PRECEDING AND '1 hour' FOLLOWING). In that case we return error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

/// If the coercion is successful, we return an `Ok` value with the result.
/// If the coercion fails because `target_type` is not wide enough (i.e. we
/// can not coerce to `target_type`, but we can to a wider type in the same
/// family), we return a `Null` value of this type to signal this situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the Null introduced? ScalarValue::try_from(target_type)?

Rust has Option and Result to indicate an operation was unsuccessful, is there a reason to use a valid value (Null) which callers must be aware of, vs returning one of the above to indicate an error? I could see how callers that did not read this documentation might continue on with the Null introducing subtle errors later in the call stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like this is because that's the existing behavior of WindowFrameBound?:

pub enum WindowFrameBound {
    Preceding(ScalarValue),
    CurrentRow,
    Following(ScalarValue),
}


pub fn is_uft8(dt: &DataType) -> bool {
/// Determine whether the given data type `dt` is a `Utf8`.
pub fn is_utf8(dt: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

}

/// Casts the ScalarValue `value` to coerced type.
// When coerced type is `Interval` we use `parse_interval` since `try_from_string` not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was not introduced by this PR, but it seems like it would be a good idea to update try_from_string()...

@avantgardnerio avantgardnerio merged commit 49237a2 into apache:main Feb 24, 2023
@ursabot
Copy link

ursabot commented Feb 24, 2023

Benchmark runs are scheduled for baseline = d6ef463 and contender = 49237a2. 49237a2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@mustafasrepo mustafasrepo deleted the bug/window_range_coerce branch March 2, 2023 10:54
@andygrove andygrove added the bug Something isn't working label Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Window function: Internal error: Operator + is not implemented for types

6 participants