-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: support Spark-compatible abs math function part 1 - non-ANSI mode
#18205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @comphead for code review, thank you. |
|
|
||
| # abs: signed int minimal values | ||
| query IIII | ||
| select abs(c1), abs(c2), abs(c3), abs(c4) from test_nullable_integer where dataset = 'mins' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering would be that easier to test like
query II
select abs(1), abs(-1)
----
1 1
?
instead of creating/dropping tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing abs(-128), abs(-32768) and abs(-2147483648) doesn't work b/c type widening.
Doing abs(-128::TINYLINT), abs(-32768::SMALLINT), abs(-2147483648::INT), abs(-9223372036854775808::BIGINT) throws casting error. For example, DataFusion error: Arrow error: Cast error: Can't cast value 128 to type Int8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug in SQL parsing:
> select -128::tinyint;
Arrow error: Cast error: Can't cast value 128 to type Int8
> select (-128)::tinyint;
+-------------+
| Int64(-128) |
+-------------+
| -128 |
+-------------+
1 row(s) fetched.
Elapsed 0.003 seconds.- It casts the 128 value without accounting for the negative; might need to raise an issue for this? Not sure if this is intended behaviour or not
So can wrap it in parentheses to ensure the correct precedence, or alternatively use arrow_cast:
> select arrow_cast(-128, 'Int8');
+--------------------------------------+
| arrow_cast(Int64(-128),Utf8("Int8")) |
+--------------------------------------+
| -128 |
+--------------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.| 0 0 | ||
| 1 1 | ||
| 1 1 | ||
| NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its better to use inline query, in this example the answers and input data are out of order and it might be more difficult to read
| ## PySpark 3.5.5 Result: {"abs(INTERVAL '-1-1' YEAR TO MONTH)": 13, "typeof(abs(INTERVAL '-1-1' YEAR TO MONTH))": 'interval year to month', "typeof(INTERVAL '-1-1' YEAR TO MONTH)": 'interval year to month'} | ||
| #query | ||
| #SELECT abs(INTERVAL '-1-1' YEAR TO MONTH::interval year to month); | ||
| query error DataFusion error: This feature is not implemented: Unsupported SQL type INTERVAL YEAR TO MONTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets create a github ticket to fix this and refer to it in the comments in addition to the error.
Looks like abs works with intervals for Spark only
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've raised a question on the epic on how we plan to support ansi mode:
From what I see in this PR, this is done via an extra argument to abs (though I'm not sure it's actually being passed through coerce_types correctly 🤔 )
| #[test] | ||
| fn test_abs_u8_scalar() { | ||
| with_fail_on_error(|fail_on_error| { | ||
| let args = ColumnarValue::Scalar(ScalarValue::UInt8(Some(u8::MAX))); | ||
| let fail_on_error_arg = | ||
| ColumnarValue::Scalar(ScalarValue::Boolean(Some(fail_on_error))); | ||
| match spark_abs(&[args, fail_on_error_arg]) { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::UInt8(Some(result)))) => { | ||
| assert_eq!(result, u8::MAX); | ||
| Ok(()) | ||
| } | ||
| Err(e) => { | ||
| if fail_on_error { | ||
| assert!( | ||
| e.to_string().contains("ARITHMETIC_OVERFLOW"), | ||
| "Error message did not match. Actual message: {e}" | ||
| ); | ||
| Ok(()) | ||
| } else { | ||
| panic!("Didn't expect error, but got: {e:?}") | ||
| } | ||
| } | ||
| _ => unreachable!(), | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test design is very confusing; we can't tell if a test case is meant to return Ok or Err as it automatically does the "correct" verification for each case. This automatic way of passing the test on Err should be switched so if we have a test case that is meant to return Err, that is the only thing we check for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jefffrey You're right, thanks for the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the test cases, please take a look for me, thank you.
| fn arithmetic_overflow_error(from_type: &str) -> DataFusionError { | ||
| ArrowError( | ||
| Box::from(arrow::error::ArrowError::ComputeError(format!( | ||
| "arithmetic overflow from {from_type}", | ||
| ))), | ||
| None, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should return a DataFusionError::Execution here instead of creating an arrow error and wrapping it in datafusion error, given the error occurs in our datafusion code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is removed. I reused marcos from DF's own abs implementation and arithmetic overflow is thrown from the marcos.
| let n = $ARRAY.as_any().downcast_ref::<$TYPE>(); | ||
| match n { | ||
| Some(array) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we unwrap n directly instead of matching on it, as we are guaranteed it would be of the correct array type; same goes for ansi_compute_op below
e152413 to
832a6ed
Compare
|
Thanks @Jefffrey |
| ## PySpark 3.5.5 Result: {"abs(INTERVAL '-1-1' YEAR TO MONTH)": 13, "typeof(abs(INTERVAL '-1-1' YEAR TO MONTH))": 'interval year to month', "typeof(INTERVAL '-1-1' YEAR TO MONTH)": 'interval year to month'} | ||
| #query | ||
| #SELECT abs(INTERVAL '-1-1' YEAR TO MONTH::interval year to month); | ||
| # See GitHub issue for ANSI interval support: https://github.com/apache/datafusion/issues/18793 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi can cast to specific interval type like so:
datafusion/datafusion/sqllogictest/test_files/aggregate.slt
Lines 2345 to 2347 in 0304cda
| (arrow_cast('-1 year', 'Interval(YearMonth)')), | |
| (arrow_cast('13 months', 'Interval(YearMonth)')), | |
| (arrow_cast('1 year', 'Interval(YearMonth)')); |
| ---- | ||
| -128 -32768 -2147483648 -9223372036854775808 | ||
|
|
||
| # abs: floats, NULL and NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hsiang-c can we also add -Inf, Inf for float/double and -0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, added.
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
datafusion-sparkSpark Compatible Functions #15914Rationale for this change
abs()behaves differently than DataFusion.spark.sql.ansi.enabled. When ANSI mode is off, arithmetic overflow doesn't throw exception like DataFusion does.absdatafusion-comet#2595What changes are included in this PR?
absmath functionv4.0.1abs expression for numeric types only and non-ANSI mode, i.e.spark.sql.ansi.enabled=falseTasks breakdown
Are these changes tested?
test_files/spark/math/abs.sltAre there any user-facing changes?
Yes, the abs function can be specified in the SQL.