Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38604][SQL] Keep ceil and floor with only a single argument the same as before #35913

Closed
wants to merge 2 commits into from

Conversation

revans2
Copy link
Contributor

@revans2 revans2 commented Mar 19, 2022

What changes were proposed in this pull request?

This PR proposes to make ceil(column) in Scala/Python/Java APIs to keep the previous behaviour by returning LongType instead of DecimalType.

Why are the changes needed?

We shouldn't break compatibility and should keep the old behaviour of ceil(column) that returns LongType.

Does this PR introduce any user-facing change?

No to end users.

This fixes a return type. In old version, ceil(column) returns LongType.
6242145 mistakenly missed this case for Scala/Python APIs so it returns DecimalType.

This PR restores the previous behaviour back.

How was this patch tested?

Unit test was added

@awdavidson
Copy link
Contributor

@revans2 I think it would be wise to add tests part of this PR to prevent the same issue being re-introduced? I guess it would make sense to add them in the same place as the commit changing the functionality 6242145 ?

@HyukjinKwon
Copy link
Member

Yeah, +1 for @awdavidson's advice. The change itself looks making sense from a cursory look. cc @cloud-fan FYI.

@@ -1783,7 +1783,9 @@ object functions {
* @group math_funcs
* @since 1.4.0
*/
def ceil(e: Column): Column = ceil(e, lit(0))
def ceil(e: Column): Column = withExpr {
UnresolvedFunction(Seq("ceil"), Seq(e.expr), isDistinct = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it just a code cleanup or it does fix a bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, the problem you hit is: ceil(input) and ceil(input, 0) return the same result, but use different expressions which break some custom catalyst rules?

If that's the case, I'd suggest we also fix CeilFloorExpressionBuilderBase and call buildWithOneParam if the scale is 0, to make ceil(input) and ceil(input, 0) exactly the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I agree as the current change does not take into consideration that ceil(input) and ceil(input, 0) return different types although they mean the same thing (which could confuse end users).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can call ceil(input)/floor(input) if scale = 0, but we need the test case to validate the modifications. @awdavidson @revans2 Could you please provide us the test case to be sure ? thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bug.

scala> spark.range(1).selectExpr("id", "ceil(id) as one_arg_sql", "ceil(id, 0) as two_arg_sql").select(col("*"), ceil(col("id")).alias("one_arg_func"), ceil(col("id"), lit(0)).alias("two_arg_func")).printSchema
root
 |-- id: long (nullable = false)
 |-- one_arg_sql: long (nullable = true)
 |-- two_arg_sql: decimal(20,0) (nullable = true)
 |-- one_arg_func: decimal(20,0) (nullable = true)
 |-- two_arg_func: decimal(20,0) (nullable = true)
 

scala> spark.range(1).selectExpr("cast(id as double) as id").selectExpr("id", "ceil(id) as one_arg_sql", "ceil(id, 0) as two_arg_sql").select(col("*"), ceil(col("id")).alias("one_arg_func"), ceil(col("id"), lit(0)).alias("two_arg_func")).printSchema
root
 |-- id: double (nullable = false)
 |-- one_arg_sql: long (nullable = true)
 |-- two_arg_sql: decimal(30,0) (nullable = true)
 |-- one_arg_func: decimal(30,0) (nullable = true)
 |-- two_arg_func: decimal(30,0) (nullable = true) 

Without this patch the SQL and scala APIs produce different results and the scala API produces a result that is different from what was in Spark 3.2.

I documented this in the JIRA https://issues.apache.org/jira/browse/SPARK-38604

After this patch the single argument version behaves like it did in 3.2 and is also consistent with the SQL API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added test cases that explicitly check the result type.

From a consistency standpoint if the return type is going to depend on the scale, then the scale can only ever be a literal value. If we want to break backwards compatibility, then I would suggest that we also fix the overflow issue https://issues.apache.org/jira/browse/SPARK-28135 with double being round to a long. That technically also applies to a double being cast to a decimal type and then rounded.

@cloud-fan
Copy link
Contributor

This maybe not the right place to discuss it, but this PR reminds me of the new overload def ceil(e: Column, scale: Column). Since we only allow literals as the scale parameter, shall we make its type Int instead of Column? cc @sathiyapk @HyukjinKwon

@HyukjinKwon
Copy link
Member

I'm fine if we're 100% sure scale will be a literal in the future too.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take this fix first, after adding a test.

It's a bit arguable whether or not ceil(input) and ceil(input, 0) should be treated exactly the same. Seems fine to keep it as it is as there is no backward compatibility issues.

@revans2
Copy link
Contributor Author

revans2 commented Mar 21, 2022

The test failure looks to be unrelated. It is in docker tests that appear to have nothing to do wit this change.

[info] org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED *** (3 minutes, 54 seconds)
[info]   The code passed to eventually never returned normally. Attempted 137 times over 3.0002535127 minutes. (DockerJDBCIntegrationSuite.scala:166)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 22, 2022

Merged to master and branch-3.3.

HyukjinKwon pushed a commit that referenced this pull request Mar 22, 2022
…e same as before

This is just the fix. I didn't add any tests yet. I am happy to do it, I just wasn't sure where the right place to put the tests would be. Once I have tests I will cherry-pick this back to the 3.3 branch and put up a PR for that too. I am also happy to update the comments because this is a bit confusing that there is no indication that things have changed.

Closes #35913 from revans2/ceil_floor_no_arg_behavior.

Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 692e4b0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants