Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42635][SQL][3.3] Fix the TimestampAdd expression #40264

Closed

Conversation

chenhao-db
Copy link
Contributor

This is a backport of #40237.

What changes were proposed in this pull request?

This PR fixed the counter-intuitive behaviors of the TimestampAdd expression mentioned in https://issues.apache.org/jira/browse/SPARK-42635. See the following user-facing changes for details.

Does this PR introduce any user-facing change?

Yes. This PR fixes the three problems mentioned in SPARK-42635:

  1. When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic.
  2. Adding month, quarter, and year silently ignores Int overflow during unit conversion.
  3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion.

Some examples of the result changes:

Old results:

// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 (this is correct, put it here for comparison)
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = 1969-09-01 00:00:00
timestampadd(day, 106751992, 1970-01-01 00:00:00) = -290308-12-22 15:58:10.448384

New results:

// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = throw overflow exception
timestampadd(day, 106751992, 1970-01-01 00:00:00) = throw overflow exception

How was this patch tested?

Pass existing tests and some new tests.

This PR fixed the counter-intuitive behaviors of the `TimestampAdd` expression mentioned in https://issues.apache.org/jira/browse/SPARK-42635. See the following *user-facing* changes for details.

Yes. This PR fixes the three problems mentioned in SPARK-42635:

1. When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic.
2. Adding month, quarter, and year silently ignores `Int` overflow during unit conversion.
3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores `Long` overflow during unit conversion.

Some examples of the result changes:

Old results:

```
// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 (this is correct, put it here for comparison)
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = 1969-09-01 00:00:00
timestampadd(day, 106751992, 1970-01-01 00:00:00) = -290308-12-22 15:58:10.448384
```

New results:

```
// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = throw overflow exception
timestampadd(day, 106751992, 1970-01-01 00:00:00) = throw overflow exception
```

Pass existing tests and some new tests.

Closes apache#40237 from chenhao-db/SPARK-42635.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@github-actions github-actions bot added the SQL label Mar 3, 2023
@chenhao-db
Copy link
Contributor Author

@MaxGekk Please take a look, thanks for reviewing!

@MaxGekk
Copy link
Member

MaxGekk commented Mar 3, 2023

@chenhao-db Could you fix the build errors:

[error] /home/runner/work/apache-spark/apache-spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala:2047:7: not found: value checkErrorInExpression
[error]       checkErrorInExpression[SparkArithmeticException](TimestampAdd("DAY",
[error]       ^
[error] /home/runner/work/apache-spark/apache-

@chenhao-db
Copy link
Contributor Author

@MaxGekk It seems that checkErrorInExpression doesn't exist in 3.3, so I still have to use the old checkExceptionInExpression. Is that okay?

@MaxGekk
Copy link
Member

MaxGekk commented Mar 4, 2023

Seems like the test failure is related to the changes:

[info] - SPARK-42635: timestampadd unit conversion overflow *** FAILED *** (12 milliseconds)
[info]   (non-codegen mode) Expected error message is `[DATETIME_OVERFLOW] Datetime operation overflow`, but `Datetime operation overflow: add 106751992 DAY to TIMESTAMP '1970-01-01 00:00:00'.` found (ExpressionEvalHelper.scala:176)

@chenhao-db
Copy link
Contributor Author

@MaxGekk I see. In old versions Spark doesn't include the error class in the error message: https://github.com/apache/spark/blob/branch-3.3/core/src/main/scala/org/apache/spark/ErrorInfo.scala#L74. I just removed the error class prefix in the expected error message.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 4, 2023

+1, LGTM. All GAs passed. Merging to 3.3.
Thank you, @chenhao-db.

MaxGekk pushed a commit that referenced this pull request Mar 4, 2023
This is a backport of #40237.

### What changes were proposed in this pull request?
This PR fixed the counter-intuitive behaviors of the `TimestampAdd` expression mentioned in https://issues.apache.org/jira/browse/SPARK-42635. See the following *user-facing* changes for details.

### Does this PR introduce _any_ user-facing change?

Yes. This PR fixes the three problems mentioned in SPARK-42635:

1. When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic.
2. Adding month, quarter, and year silently ignores `Int` overflow during unit conversion.
3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores `Long` overflow during unit conversion.

Some examples of the result changes:

Old results:

```
// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 (this is correct, put it here for comparison)
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = 1969-09-01 00:00:00
timestampadd(day, 106751992, 1970-01-01 00:00:00) = -290308-12-22 15:58:10.448384
```

New results:

```
// In America/Los_Angeles timezone:
timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00
timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59
timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 04:00:00
// In UTC timezone:
timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = throw overflow exception
timestampadd(day, 106751992, 1970-01-01 00:00:00) = throw overflow exception
```

### How was this patch tested?

Pass existing tests and some new tests.

Closes #40264 from chenhao-db/cherry-pick-SPARK-42635.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk MaxGekk closed this Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants