Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29440][SQL] Support java.time.Duration as an external type of CalendarIntervalType #26092

Closed
wants to merge 19 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 11, 2019

What changes were proposed in this pull request?

In the PR, I propose to convert values of the CalendarIntervalType Catalyst's type to the java.time.Duration values when such values are need outside of Spark, for example in UDF. If an INTERVAL values has non-zero months field, it is converted to number of seconds assuming 2629746 seconds per months. This average number of seconds per month was given by assuming that the average year of the Gregorian calendar 365.2425 days long (see https://en.wikipedia.org/wiki/Gregorian_calendar): 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746.

For example:

scala> val plusDay = udf((i: java.time.Duration) => i.plusDays(1))
plusDay: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$1855/165450258@485996f7,CalendarIntervalType,List(Some(Schema(CalendarIntervalType,true))),None,true,true)

scala> val df = spark.sql("SELECT interval 40 minutes as i")
df: org.apache.spark.sql.DataFrame = [i: interval]

scala> df.show
+-------------------+                                                           
|                  i|
+-------------------+
|interval 40 minutes|
+-------------------+

scala> df.select(plusDay('i)).show(false)
+--------------------------+
|UDF(i)                    |
+--------------------------+
|interval 1 days 40 minutes|
+--------------------------+

I added an implicit encoder for java.time.Duration which allows to create Spark dataframe from an external collections:

scala> Seq(Duration.ofDays(10), Duration.ofHours(10)).toDS.show(false)
+-----------------------+
|value                  |
+-----------------------+
|interval 1 weeks 3 days|
|interval 10 hours      |
+-----------------------+

Why are the changes needed?

This should allow to users:

  • Write UDF over interval inputs
  • Use Java 8 libraries for java.time.Duration in manipulations on collected values or in UDFs
  • Create dataframes from a collection of java.time.Duration values.

Does this PR introduce any user-facing change?

Yes, currently collect() returns not public class CalendarInterval:

scala> spark.sql("select interval 1 week").collect().apply(0).get(0).isInstanceOf[org.apache.spark.unsafe.types.CalendarInterval]
res2: Boolean = true

After the changes:

scala> spark.sql("select interval 1 week").collect().apply(0).get(0).isInstanceOf[Duration]
res8: Boolean = true

How was this patch tested?

  • Added new testes to CatalystTypeConvertersSuite to check conversion of CalendarIntervalType to/from java.time.Duration
  • By JavaUDFSuite/ UDFSuite to test usage of Duration type in Scala/Java UDFs.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 11, 2019

ping @cloud-fan @hvanhovell

@hvanhovell
Copy link
Contributor

I have a bit of a problem with the assumption that a month has a fixed number of seconds. This is just not true, the duration of a month is not a constant and should be determined by the current date.

If you really want to go down this path than we need to make sure an interval can only be month or micro second based.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2019

... should be determined by the current date.

@hvanhovell Strictly speaking, it depends not only on the current date but on the current time zone as well. Final result of <local date-time> + interval 1 month 10 hour can be different in your and mine time zones because if we can have different Daylight Saving Time Rules. So, if you and me collect intervals and add them to some dates, we could have different results, right.

Also, the result depends on the order of applying interval components. Adding interval's months and microseconds to a date, and interval's microseconds and months after that, this could produce different dates. For now, the CalendarInterval class does not define the order.

... the assumption that a month has a fixed number of seconds. This is just not true ...

Maybe this is not true but such assumption has been already made by Spark in some places, for example:

private def parseDuration(duration: String): Long = {
val cal = CalendarInterval.fromCaseInsensitiveString(duration)
if (cal.milliseconds < 0 || cal.months < 0) {
throw new IllegalArgumentException(s"Provided duration ($duration) is not positive")
}
cal.milliseconds + cal.months * MILLIS_PER_MONTH
}

def getDelayMs(delay: CalendarInterval): Long = {
delay.milliseconds + delay.months * MILLIS_PER_MONTH
}


and other DBMS like PostgreSQL have predefined constant month duration:
https://github.com/postgres/postgres/blob/97c39498e5ca9208d3de5a443a2282923619bf91/src/include/datatype/timestamp.h#L77 which is used in calculation of interval length: https://github.com/postgres/postgres/blob/bffe1bd68457e43925c362d8728ce3b25bdf1c94/src/backend/utils/adt/timestamp.c#L5016-L5022

I just want to say that such assumption can be made as other DBMS do.

If you really want to go down this path than we need to make sure an interval can only be month or micro second based.

This is what I have been trying to say in many PRs like #25022 and #25981 (comment)

One more thing, where interval in Spark come from? They can be specified in sql queries (via interval literals by an user but cannot be loaded from datasources because interval type is not supported now) or appear as result of timestamp1 - timestamp2. The last one (timestamp subtraction) set months to zero always:

new CalendarInterval(0, end.asInstanceOf[Long] - start.asInstanceOf[Long])
. So, months can be non-zero only if it is specified by an user in a sql query. Your objection is related to only this use case. If the user has strong concerns regarding to average month duration, he/she could change interval literal or we can fail his/her query if the months component is non-zero in conversion to Duration. This is arguable but we could introduce special SQL config for this and fail his/her query with clear error message, if you insist.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111944 has finished for PR 26092 at commit 4aeb9b0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2019

I think this should be hidden under a flag. For example, spark.sql.datetime.java8API.enabled or a separate SQL config for only interval conversion.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2019

@srowen
Copy link
Member

srowen commented Oct 13, 2019

(I don't think we need a feature flag for this.)

The problem here is that java.time.Duration isn't quite the same thing as the INTERVAL type. The latter supports (separately) a number of months because, of course, a month isn't a fixed unit of time. Duration does not.

It seems fine to support Duration -> INTERVAL conversion as it is not ambiguous. The other way is ambiguous if there are months involved. I could imagine allowing it for INTERVAL with 0 months but continuing to fail otherwise? at least we would not introduce surprising behavior while supporting the unambiguous cases.

I don't think an interval is timezone-dependent, or should not be. It's a number of microseconds, and months, and timezone only matters w.r.t. converting from epoch time to/from a date/time representation. That is, yes, "date + interval" can be different in different timezones but this is true just because a date maps to different times depending on the timezone.

What would this do in Pyspark?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2019

@cloud-fan
Copy link
Contributor

My opinion on this is to expose the CalendarInterval class directly, with 2 new methods extractPeriod and extractDuration. Semantically, CalendarInterval is java Period + Duration. I don't think we can map CalendarInterval to Duration as it's kind of a truncation.

Like @MaxGekk said, we can also separate the interval type to year-month interval and day-time interval. But it's a lot of effort to change the type system and is not compatible with parquet.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 15, 2019

I don't think we can map CalendarInterval to Duration as it's kind of a truncation.

ok. What about mapping of java.time.Duration (and maybe java.time.Period) to CalendarInterval?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 16, 2019

I am closing this since there is no consensus.

@MaxGekk MaxGekk closed this Oct 16, 2019
@MaxGekk MaxGekk deleted the interval-duration-extrn-type branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants