Skip to content

[SPARK-36323][SQL] Support ANSI interval literals for TimeWindow#33551

Closed
sarutak wants to merge 3 commits intoapache:masterfrom
sarutak:window-interval
Closed

[SPARK-36323][SQL] Support ANSI interval literals for TimeWindow#33551
sarutak wants to merge 3 commits intoapache:masterfrom
sarutak:window-interval

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Jul 28, 2021

What changes were proposed in this pull request?

This PR proposes to support ANSI interval literals for TimeWindow.

Why are the changes needed?

Watermark also supports ANSI interval literals so it's great to support for TimeWindow.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test.

@github-actions github-actions bot added the SQL label Jul 28, 2021
@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46264/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46264/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46275/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46275/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141751 has finished for PR 33551 at commit 85eb021.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case NonFatal(e) =>
throw QueryCompilationErrors.cannotParseTimeDelayError(interval, e)
}
cal.days * MICROS_PER_DAY + cal.microseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR but:

  1. * and + can overflow. I think we should use exact arithmetic ops here.
  2. one day in CalendarInterval can be 23, 24, or 25 hours, see
    * and one day may be equal to 23, 24 or 25 hours (daylight saving).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.K, let's fix this in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, I've fixed in this PR.
For 2, it's about intervals, not timestamps so we don't need to consider daylight saving here right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2, it's about intervals, not timestamps so we don't need to consider daylight saving here right?

ok. Let's keep the assumption of 24 hours per day here. Not sure it will work fine during daylight saving time but it seems we don't have enough test coverage for that at the moment.

Maybe, open an JIRA to test the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'll check it out and open a JIRA.

throw new IllegalArgumentException(
s"Intervals greater than a month is not supported ($interval).")
val ymIntervalErrMsg = s"Intervals greater than a month is not supported ($interval)."
val cal = try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to

val parsedDelay = try {
if (delayThreshold.toLowerCase(Locale.ROOT).trim.startsWith("interval")) {
CatalystSqlParser.parseExpression(delayThreshold) match {
case Literal(months: Int, _: YearMonthIntervalType) =>
new CalendarInterval(months, 0, 0)
case Literal(micros: Long, _: DayTimeIntervalType) =>
new CalendarInterval(0, 0, micros)
}
} else {
IntervalUtils.stringToInterval(UTF8String.fromString(delayThreshold))
}
} catch {
case NonFatal(e) =>
throw QueryCompilationErrors.cannotParseTimeDelayError(delayThreshold, e)
}
. Could you put the code to a common helper function, and re-use it here and in Dataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I tried to put a common helper function but they are slightly different so I didn't.
But it's O.K, to try to factor out the similar code.

import org.apache.spark.sql.types.{LongType, StructField, StructType, TimestampNTZType, TimestampType}

class TimeWindowSuite extends SparkFunSuite with ExpressionEvalHelper with PrivateMethodTester {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid unnecessary changes. This can cause conflicts in down stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Thanks.

Seq(StructField("start", TimestampNTZType), StructField("end", TimestampNTZType))))
}

test("SPARK-36323: Support ANSI interval literals for TimeWindow") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test when spark.sql.legacy.interval.enabled is true?

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141762 has finished for PR 33551 at commit 2db8f0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AlterTableColumnCommand extends UnaryCommand
  • case class AlterTableAddColumns(
  • case class AlterTableReplaceColumns(

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46295/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141783 has finished for PR 33551 at commit 5b80f21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merging to master/3.2.
Thank you, @sarutak .

@MaxGekk MaxGekk closed this in db18866 Jul 29, 2021
MaxGekk pushed a commit that referenced this pull request Jul 29, 2021
### What changes were proposed in this pull request?

This PR proposes to support ANSI interval literals for `TimeWindow`.

### Why are the changes needed?

Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33551 from sarutak/window-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit db18866)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Aug 3, 2021
…legacy.interval.enabled is true

### What changes were proposed in this pull request?

This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815.

### Why are the changes needed?

SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`.
We can't detect this issue because there is no test which considers the legacy interval type at that time.
In SPARK-36323 (#33551), this issue was resolved but it's better to add test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33606 from sarutak/test-watermark-with-legacy-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Aug 3, 2021
…legacy.interval.enabled is true

### What changes were proposed in this pull request?

This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815.

### Why are the changes needed?

SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`.
We can't detect this issue because there is no test which considers the legacy interval type at that time.
In SPARK-36323 (#33551), this issue was resolved but it's better to add test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33606 from sarutak/test-watermark-with-legacy-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 92cdb17)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants