-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29440][SQL] Support java.time.Duration as an external type of CalendarIntervalType #26092
Conversation
ping @cloud-fan @hvanhovell |
I have a bit of a problem with the assumption that a month has a fixed number of seconds. This is just not true, the duration of a month is not a constant and should be determined by the current date. If you really want to go down this path than we need to make sure an interval can only be month or micro second based. |
@hvanhovell Strictly speaking, it depends not only on the current date but on the current time zone as well. Final result of Also, the result depends on the order of applying interval components. Adding interval's
Maybe this is not true but such assumption has been already made by Spark in some places, for example: spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/GroupStateImpl.scala Lines 161 to 168 in 18b7ad2
Lines 29 to 31 in 18b7ad2
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Line 631 in 18b7ad2
and other DBMS like PostgreSQL have predefined constant month duration: https://github.com/postgres/postgres/blob/97c39498e5ca9208d3de5a443a2282923619bf91/src/include/datatype/timestamp.h#L77 which is used in calculation of interval length: https://github.com/postgres/postgres/blob/bffe1bd68457e43925c362d8728ce3b25bdf1c94/src/backend/utils/adt/timestamp.c#L5016-L5022 I just want to say that such assumption can be made as other DBMS do.
This is what I have been trying to say in many PRs like #25022 and #25981 (comment) One more thing, where interval in Spark come from? They can be specified in sql queries (via interval literals by an user but cannot be loaded from datasources because interval type is not supported now) or appear as result of Line 2111 in eecef75
months can be non-zero only if it is specified by an user in a sql query. Your objection is related to only this use case. If the user has strong concerns regarding to average month duration, he/she could change interval literal or we can fail his/her query if the months component is non-zero in conversion to Duration . This is arguable but we could introduce special SQL config for this and fail his/her query with clear error message, if you insist.
|
jenkins, retest this, please |
Test build #111944 has finished for PR 26092 at commit
|
I think this should be hidden under a flag. For example, |
(I don't think we need a feature flag for this.) The problem here is that It seems fine to support I don't think an interval is timezone-dependent, or should not be. It's a number of microseconds, and months, and timezone only matters w.r.t. converting from epoch time to/from a date/time representation. That is, yes, "date + interval" can be different in different timezones but this is true just because a date maps to different times depending on the timezone. What would this do in Pyspark? |
I would use |
My opinion on this is to expose the Like @MaxGekk said, we can also separate the interval type to year-month interval and day-time interval. But it's a lot of effort to change the type system and is not compatible with parquet. |
ok. What about mapping of |
I am closing this since there is no consensus. |
What changes were proposed in this pull request?
In the PR, I propose to convert values of the
CalendarIntervalType
Catalyst's type to thejava.time.Duration
values when such values are need outside of Spark, for example in UDF. If anINTERVAL
values has non-zeromonths
field, it is converted to number of seconds assuming2629746
seconds per months. This average number of seconds per month was given by assuming that the average year of the Gregorian calendar365.2425
days long (see https://en.wikipedia.org/wiki/Gregorian_calendar):60 * 60 * 24 * 365.2425
=31556952.0
=12 * 2629746
.For example:
I added an implicit encoder for
java.time.Duration
which allows to create Spark dataframe from an external collections:Why are the changes needed?
This should allow to users:
java.time.Duration
in manipulations on collected values or in UDFsjava.time.Duration
values.Does this PR introduce any user-facing change?
Yes, currently
collect()
returns not public classCalendarInterval
:After the changes:
How was this patch tested?
CatalystTypeConvertersSuite
to check conversion ofCalendarIntervalType
to/fromjava.time.Duration
JavaUDFSuite
/UDFSuite
to test usage ofDuration
type in Scala/Java UDFs.