New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29364][SQL] Return an interval from datediff()
in the ANSI mode and Spark dialect
#26034
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @MaxGekk . We agreed to use POSTGRESQL
dialect for this kind of new behaviors.
Given that, new ANSI-compatible behavior should return Interval
for POSTGRESQL
dialect instead of SPARK
dialect (which is this PR aims to do). Is there a reason why you aim to use this as a default SPARK
dialect?
override def dataType: DataType = IntegerType | ||
private val returnInterval: Boolean = { | ||
val isSparkDialect = SQLConf.get.getConf(DIALECT) == Dialect.SPARK.toString() | ||
SQLConf.get.ansiEnabled && isSparkDialect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, one-liner SQLConf.get.ansiEnabled && SQLConf.get.getConf(DIALECT) == Dialect.POSTGRESQL.toString
is enough. Please note that I suggested Dialect.POSTGRESQL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that I suggested
Dialect.POSTGRESQL
If you do what you suggest, this will change results of date.sql
:
-- !query 46
SELECT f1 - date '2000-01-01' AS `Days From 2K` FROM DATE_TBL
-- !query 46 schema
-struct<Days From 2K:int>
+struct<Days From 2K:interval>
-- !query 46 output
--1035
--1036
--1037
--1400
--1401
--1402
--1403
--15542
--15607
-13977
-14343
-14710
-91
-92
-93
+interval -2 years -10 months
+interval -2 years -10 months -1 days
+interval -2 years -9 months -4 weeks -2 days
+interval -3 years -10 months
+interval -3 years -10 months -1 days
+interval -3 years -10 months -2 days
+interval -3 years -9 months -4 weeks -2 days
+interval -42 years -6 months -2 weeks -4 days
+interval -42 years -8 months -3 weeks -1 days
+interval 3 months
+interval 3 months 1 days
+interval 3 months 2 days
+interval 38 years 3 months 1 weeks
+interval 39 years 3 months 1 weeks 1 days
+interval 40 years 3 months 1 weeks 2 days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test suite aims to test ANSI
and PostgreSQL
dialect. Given that, the regeneration will be a correct solution for the question. Or, if we want to test only Spark
dialect there, we can turn off ANSI
mode before that test case by adding one-line configuration. (Also, turn on back after that test case).
cc @maropu , @gatorsmile , @gengliangwang .
@@ -836,7 +836,7 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
} | |||
} | |||
|
|||
test("datediff") { | |||
test("datediff returns an integer") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we didn't touch the other test cases, but at least for this, we had better have withSQLConf(SQLConf.ANSI_ENABLED.key -> "true", SQLConf.DIALECT.key -> "...
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, it will not test SQLConf.ANSI_ENABLED.key -> "false"
which is default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I blindly copied. What I mean was setting SQLConf
with the default configuration. The default configurations can be changed easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add asserts to check this assumption otherwise if defaults for the SQL config will be changed in the future, this test will not check default behavior.
Test build #111807 has finished for PR 26034 at commit
|
Because PostgreSQL returns integers for date subtracts: # select pg_typeof(date'2019-10-05' - date'1970-01-01');
pg_typeof
-----------
integer
(1 row) I don't know how I can use the PostgreSQL dialect for this behavior. |
Could you point out where we agreed. Just in case, this changes are not related to PostgreSQL at all. |
Oh, |
it doesn't in dates subtract.
This introduces new behavior in non-default configuration when
What is the purpose of such comparison. Would you like to see how others conform to the SQL standard? If let's say most of them doesn't, Spark also should not? |
@MaxGekk . If a popular DBMS like PostgreSQL ignores that SQL ANSI standard, why do we need to respect that? We are not a SQL reference model, we are trying to improve our SQL user UX. If there is no meaningful reference for this, I also need to give -1 for this. That's the reason why I ask you to enforce the description. You need to prove that this PR is not a kind of over-engineering which is no actual users are familiar. |
I haven't found any DBMS that return interval for date subtract so far.
This is more interesting. Let's look at an use case: an user needs to find the difference between 2 date columns and add it to a timestamp column. Now it is hard to do that because date subtract returns an integer, but timestamp +/- accepts only intervals: spark-sql> select timestamp'now' + (date'2019-10-01' - date'2019-09-15');
Error in query: cannot resolve '(TIMESTAMP('2019-10-06 22:42:28.374') + datediff(DATE '2019-10-01', DATE '2019-09-15'))' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 22:42:28.374') + datediff(DATE '2019-10-01', DATE '2019-09-15'))' (timestamp and int).; line 1 pos 7;
'Project [unresolvedalias((1570390948374000 + datediff(18170, 18154)), None)]
+- OneRowRelation and we cannot cast spark-sql> select timestamp'now' + cast(date'2019-10-01' - date'2019-09-15' as interval);
Error in query: cannot resolve 'CAST(datediff(DATE '2019-10-01', DATE '2019-09-15') AS INTERVAL)' due to data type mismatch: cannot cast int to interval; line 1 pos 24;
'Project [unresolvedalias((1570391177861000 + cast(datediff(18170, 18154) as interval)), None)]
+- OneRowRelation |
Thank you for the detailed explanation, @MaxGekk . Please add the following into the PR description clearly. That is what I can recommend at this stage.
|
cc @cloud-fan , @gatorsmile , @maropu again since there is no reference for this. |
I looked into this problem a while ago. IIRC some DBs do not support The SQL standard defines the behavior of |
Test build #111845 has finished for PR 26034 at commit
|
I would agree with you. I will create a separate expressions |
datediff()
in the ANSI mode and Spark dialect
… SQL standard ### What changes were proposed in this pull request? Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example: ```sql > select date'tomorrow' - date'yesterday'; interval 2 days ``` Closes #26034 ### Why are the changes needed? - To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt). - Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')` ### Does this PR introduce any user-facing change? Before the query below returns number of days: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; 399 ``` After it returns an interval: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; interval 1 years 1 months 4 days ``` ### How was this patch tested? - by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`. - by existing tests in `date.sql` Closes #26112 from MaxGekk/date-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com> (cherry picked from commit d11cbf2) Signed-off-by: Yuming Wang <wgyumg@gmail.com>
… SQL standard Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example: ```sql > select date'tomorrow' - date'yesterday'; interval 2 days ``` Closes apache#26034 - To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt). - Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')` Before the query below returns number of days: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; 399 ``` After it returns an interval: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; interval 1 years 1 months 4 days ``` - by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`. - by existing tests in `date.sql` Closes apache#26112 from MaxGekk/date-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>
What changes were proposed in this pull request?
Modified the
DateDiff
expression to returnCalendarIntervalType
according to the SQL standard , see 4.5.3 Operations involving datetimes and intervals. This affects thedatediff
function and dates subtraction. By default, the expression return an integer which is the number of days between dates but whenspark.sql.ansi.enabled
is set totrue
andspark.sql.dialect
isSpark
, result type is theINTERVAL
type. For example:Note: I haven't found any DBMS that return an interval for date subtract for now.
Why are the changes needed?
date operand 1
-date operand 2
must be the interval type.select timestamp'now' + (date'2019-10-01' - date'2019-09-15')
Does this PR introduce any user-facing change?
Yes, when SQL ANSI is enabled in the Spark dialect.
Before the query below returns number of days:
After it returns an interval:
How was this patch tested?
DateExpressionsSuite
.DateDiff
that are checked due to [SPARK-21914][SQL][TESTS] Check results of expression examples #25942