Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34605][SQL] Support java.time.Duration as an external type of the day-time interval type #31729

Closed
wants to merge 4 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 3, 2021

What changes were proposed in this pull request?

In the PR, I propose to extend Spark SQL API to accept java.time.Duration as an external type of recently added new Catalyst type - DayTimeIntervalType (see #31614). The Java class java.time.Duration has similar semantic to ANSI SQL day-time interval type, and it is the most suitable to be an external type for DayTimeIntervalType. In more details:

  1. Added DurationConverter which converts java.time.Duration instances to/from internal representation of the Catalyst type DayTimeIntervalType (to Long type). The DurationConverter object uses new methods of IntervalUtils:
    • durationToMicros() converts the input duration to the total length in microseconds. If this duration is too large to fit Long, the method throws the exception ArithmeticException. Note: the input duration has nanosecond precision, the method casts the nanos part to microseconds by dividing by 1000.
    • microsToDuration() obtains a java.time.Duration representing a number of microseconds.
  2. Support new type DayTimeIntervalType in RowEncoder via the methods createDeserializerForDuration() and createSerializerForJavaDuration().
  3. Extended the Literal API to construct literals from java.time.Duration instances.

Why are the changes needed?

  1. To allow users parallelization of java.time.Duration collections, and construct day-time interval columns. Also to collect such columns back to the driver side.
  2. This will allow to write tests in other sub-tasks of SPARK-27790.

Does this PR introduce any user-facing change?

The PR extends existing functionality. So, users can parallelize instances of the java.time.Duration class and collect them back:

scala> val ds = Seq(java.time.Duration.ofDays(10)).toDS
ds: org.apache.spark.sql.Dataset[java.time.Duration] = [value: daytimeinterval]

scala> ds.collect
res0: Array[java.time.Duration] = Array(PT240H)

How was this patch tested?

  • Added a few tests to CatalystTypeConvertersSuite to check conversion from/to java.time.Duration.
  • Checking row encoding by new tests in RowEncoderSuite.
  • Making literals of DayTimeIntervalType are tested in LiteralExpressionSuite
  • Check collecting by DatasetSuite and JavaDatasetSuite.

@SparkQA
Copy link

SparkQA commented Mar 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40303/

@github-actions github-actions bot added the SQL label Mar 3, 2021
@SparkQA
Copy link

SparkQA commented Mar 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40303/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Test build #135721 has finished for PR 31729 at commit b5b021e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-34605][SQL] Support java.time.Duration as an external type of the day-time interval type [SPARK-34605][SQL] Support java.time.Duration as an external type of the day-time interval type Mar 4, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 4, 2021

@cloud-fan @HyukjinKwon @dongjoon-hyun @viirya @maropu @yaooqinn Could you review this PR, please.

@yaooqinn
Copy link
Member

yaooqinn commented Mar 4, 2021

I believe that Duration fits ANSI day-time interval well. But not related directly to this PR, what will we use for the year-month?

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 4, 2021

what will we use for the year-month?

@yaooqinn I think of java.time.Period (https://docs.oracle.com/javase/8/docs/api/java/time/Period.html)

@yaooqinn
Copy link
Member

yaooqinn commented Mar 4, 2021

what will we use for the year-month?

@yaooqinn I think of java.time.Period (https://docs.oracle.com/javase/8/docs/api/java/time/Period.html)

Does it mean that we only use the period with year and month filed and fail for the day part?

BTW, what is the behavior of Dataframe.show(), will it be something like PT1H2H3S

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 4, 2021

Does it mean that we only use the period with year and month filed ...

Yes, we will use only year and month fields.

and fail for the day part?

Let's discuss this in https://issues.apache.org/jira/browse/SPARK-34615

BTW, what is the behavior of Dataframe.show(), will it be something like PT1H2H3S

No-no. It will be in the format that conforms to the ANSI SQL standard like:

INTERVAL ‘12 45:20:59.123’ DAY TO SECOND

BTW, the format PT1H2H3S is from the ISO-8601 standard.

Seq(
"PT20.123456S",
"P2DT3H4M",
"P2D").foreach { intervalStr =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many people are not familiar with this format. Can we add comments or use a different way to create Duration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cam we add some negative cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I created durations using another Duration methods. Also added negative tests.

}

test("SPARK-34605: construct literals from arrays of java.time.Duration") {
val duration0 = Duration.parse("P2DT3H4M")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@yaooqinn
Copy link
Member

yaooqinn commented Mar 4, 2021

It's better we update the sql-ref-datatypes.md within this PR or create a separate JIRA to do later.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 4, 2021

It's better we update the sql-ref-datatypes.md

@yaooqinn Let's update docs separately. I opened the sub-task SPARK-34619 for that.

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40335/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40335/

@SparkQA
Copy link

SparkQA commented Mar 4, 2021

Test build #135752 has finished for PR 31729 at commit b826afd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants