Skip to content

Conversation

@hsiang-c
Copy link
Contributor

@hsiang-c hsiang-c commented Oct 21, 2025

Which issue does this PR close?

Rationale for this change

  • Apache Spark's abs() behaves differently than DataFusion.
  • Apache Spark's ANSI-compliant dialect can be toggled by SparkConf spark.sql.ansi.enabled. When ANSI mode is off, arithmetic overflow doesn't throw exception like DataFusion does.
  • DataFusion Comet can leverage it at fix: re-enable Comet abs datafusion-comet#2595

What changes are included in this PR?

  • This is the 1st PR to support non-ANSI mode Spark-compatible abs math function
  • Mimics Apache Spark v4.0.1 abs expression for numeric types only and non-ANSI mode, i.e. spark.sql.ansi.enabled=false

Tasks breakdown

Non-ANSI mode ANSI mode ANSI Interval Types
this PR #18828 TODO

Are these changes tested?

  • unit tests
  • sqllogictest: test_files/spark/math/abs.slt

Are there any user-facing changes?

Yes, the abs function can be specified in the SQL.

  • Arithmetic overflow will NOT be thrown on arithmetic overflow.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Oct 21, 2025
@hsiang-c
Copy link
Contributor Author

cc @comphead for code review, thank you.


# abs: signed int minimal values
query IIII
select abs(c1), abs(c2), abs(c3), abs(c4) from test_nullable_integer where dataset = 'mins'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering would be that easier to test like

query II
select abs(1), abs(-1)
----
1 1

?

instead of creating/dropping tables

Copy link
Contributor Author

@hsiang-c hsiang-c Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing abs(-128), abs(-32768) and abs(-2147483648) doesn't work b/c type widening.

Doing abs(-128::TINYLINT), abs(-32768::SMALLINT), abs(-2147483648::INT), abs(-9223372036854775808::BIGINT) throws casting error. For example, DataFusion error: Arrow error: Cast error: Can't cast value 128 to type Int8

Copy link
Contributor

@Jefffrey Jefffrey Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug in SQL parsing:

> select -128::tinyint;
Arrow error: Cast error: Can't cast value 128 to type Int8
> select (-128)::tinyint;
+-------------+
| Int64(-128) |
+-------------+
| -128        |
+-------------+
1 row(s) fetched.
Elapsed 0.003 seconds.
  • It casts the 128 value without accounting for the negative; might need to raise an issue for this? Not sure if this is intended behaviour or not

So can wrap it in parentheses to ensure the correct precedence, or alternatively use arrow_cast:

> select arrow_cast(-128, 'Int8');
+--------------------------------------+
| arrow_cast(Int64(-128),Utf8("Int8")) |
+--------------------------------------+
| -128                                 |
+--------------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.

0 0
1 1
1 1
NULL NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its better to use inline query, in this example the answers and input data are out of order and it might be more difficult to read

## PySpark 3.5.5 Result: {"abs(INTERVAL '-1-1' YEAR TO MONTH)": 13, "typeof(abs(INTERVAL '-1-1' YEAR TO MONTH))": 'interval year to month', "typeof(INTERVAL '-1-1' YEAR TO MONTH)": 'interval year to month'}
#query
#SELECT abs(INTERVAL '-1-1' YEAR TO MONTH::interval year to month);
query error DataFusion error: This feature is not implemented: Unsupported SQL type INTERVAL YEAR TO MONTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets create a github ticket to fix this and refer to it in the comments in addition to the error.

Looks like abs works with intervals for Spark only

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've raised a question on the epic on how we plan to support ansi mode:

#15914 (comment)

From what I see in this PR, this is done via an extra argument to abs (though I'm not sure it's actually being passed through coerce_types correctly 🤔 )

Comment on lines 492 to 342
#[test]
fn test_abs_u8_scalar() {
with_fail_on_error(|fail_on_error| {
let args = ColumnarValue::Scalar(ScalarValue::UInt8(Some(u8::MAX)));
let fail_on_error_arg =
ColumnarValue::Scalar(ScalarValue::Boolean(Some(fail_on_error)));
match spark_abs(&[args, fail_on_error_arg]) {
Ok(ColumnarValue::Scalar(ScalarValue::UInt8(Some(result)))) => {
assert_eq!(result, u8::MAX);
Ok(())
}
Err(e) => {
if fail_on_error {
assert!(
e.to_string().contains("ARITHMETIC_OVERFLOW"),
"Error message did not match. Actual message: {e}"
);
Ok(())
} else {
panic!("Didn't expect error, but got: {e:?}")
}
}
_ => unreachable!(),
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test design is very confusing; we can't tell if a test case is meant to return Ok or Err as it automatically does the "correct" verification for each case. This automatic way of passing the test on Err should be switched so if we have a test case that is meant to return Err, that is the only thing we check for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jefffrey You're right, thanks for the feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the test cases, please take a look for me, thank you.

Comment on lines 158 to 123
fn arithmetic_overflow_error(from_type: &str) -> DataFusionError {
ArrowError(
Box::from(arrow::error::ArrowError::ComputeError(format!(
"arithmetic overflow from {from_type}",
))),
None,
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should return a DataFusionError::Execution here instead of creating an arrow error and wrapping it in datafusion error, given the error occurs in our datafusion code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is removed. I reused marcos from DF's own abs implementation and arithmetic overflow is thrown from the marcos.

Comment on lines 119 to 121
let n = $ARRAY.as_any().downcast_ref::<$TYPE>();
match n {
Some(array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we unwrap n directly instead of matching on it, as we are guaranteed it would be of the correct array type; same goes for ansi_compute_op below

@hsiang-c hsiang-c force-pushed the spark_abs branch 2 times, most recently from e152413 to 832a6ed Compare November 8, 2025 23:56
@github-actions github-actions bot added the functions Changes to functions implementation label Nov 8, 2025
@github-actions github-actions bot added sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Nov 9, 2025
@github-actions github-actions bot removed sql SQL Planner development-process Related to development process of DataFusion labels Nov 9, 2025
@hsiang-c
Copy link
Contributor Author

Thanks @Jefffrey

## PySpark 3.5.5 Result: {"abs(INTERVAL '-1-1' YEAR TO MONTH)": 13, "typeof(abs(INTERVAL '-1-1' YEAR TO MONTH))": 'interval year to month', "typeof(INTERVAL '-1-1' YEAR TO MONTH)": 'interval year to month'}
#query
#SELECT abs(INTERVAL '-1-1' YEAR TO MONTH::interval year to month);
# See GitHub issue for ANSI interval support: https://github.com/apache/datafusion/issues/18793
Copy link
Contributor

@Jefffrey Jefffrey Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi can cast to specific interval type like so:

(arrow_cast('-1 year', 'Interval(YearMonth)')),
(arrow_cast('13 months', 'Interval(YearMonth)')),
(arrow_cast('1 year', 'Interval(YearMonth)'));

----
-128 -32768 -2147483648 -9223372036854775808

# abs: floats, NULL and NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hsiang-c can we also add -Inf, Inf for float/double and -0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hsiang-c and @Jefffrey for the review

Pending some more tests on real numbers

@comphead comphead added this pull request to the merge queue Nov 19, 2025
Merged via the queue into apache:main with commit ac9c6b4 Nov 19, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants