Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29364][SQL] Return an interval from datediff() in the ANSI mode and Spark dialect #26034

Closed
wants to merge 5 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 5, 2019

What changes were proposed in this pull request?

Modified the DateDiff expression to return CalendarIntervalType according to the SQL standard , see 4.5.3 Operations involving datetimes and intervals. This affects the datediff function and dates subtraction. By default, the expression return an integer which is the number of days between dates but when spark.sql.ansi.enabled is set to true and spark.sql.dialect is Spark, result type is the INTERVAL type. For example:

> select date'tomorrow' - date'yesterday';
interval 2 days

Note: I haven't found any DBMS that return an interval for date subtract for now.

Why are the changes needed?

  • To conform the SQL standard which states the result type of date operand 1 - date operand 2 must be the interval type.
  • Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: select timestamp'now' + (date'2019-10-01' - date'2019-09-15')

Does this PR introduce any user-facing change?

Yes, when SQL ANSI is enabled in the Spark dialect.
Before the query below returns number of days:

spark-sql> SET spark.sql.ansi.enabled=true;
spark.sql.ansi.enabled	true
spark-sql> SET spark.sql.dialect=Spark;
spark.sql.dialect	Spark
spark-sql> select date'2019-10-05' - date'2018-09-01';
399

After it returns an interval:

spark-sql> SET spark.sql.ansi.enabled=true;
spark.sql.ansi.enabled	true
spark-sql> SET spark.sql.dialect=Spark;
spark.sql.dialect	Spark
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days

How was this patch tested?

@MaxGekk MaxGekk changed the title [SPARK-29364][SQL] Return intervals from date subtract according to SQL standard [SPARK-29364][SQL] Return an interval from date subtract according to SQL standard Oct 5, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @MaxGekk . We agreed to use POSTGRESQL dialect for this kind of new behaviors.
Given that, new ANSI-compatible behavior should return Interval for POSTGRESQL dialect instead of SPARK dialect (which is this PR aims to do). Is there a reason why you aim to use this as a default SPARK dialect?

override def dataType: DataType = IntegerType
private val returnInterval: Boolean = {
val isSparkDialect = SQLConf.get.getConf(DIALECT) == Dialect.SPARK.toString()
SQLConf.get.ansiEnabled && isSparkDialect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, one-liner SQLConf.get.ansiEnabled && SQLConf.get.getConf(DIALECT) == Dialect.POSTGRESQL.toString is enough. Please note that I suggested Dialect.POSTGRESQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that I suggested Dialect.POSTGRESQL

If you do what you suggest, this will change results of date.sql:

-- !query 46
 SELECT f1 - date '2000-01-01' AS `Days From 2K` FROM DATE_TBL
 -- !query 46 schema
-struct<Days From 2K:int>
+struct<Days From 2K:interval>
 -- !query 46 output
--1035
--1036
--1037
--1400
--1401
--1402
--1403
--15542
--15607
-13977
-14343
-14710
-91
-92
-93
+interval -2 years -10 months
+interval -2 years -10 months -1 days
+interval -2 years -9 months -4 weeks -2 days
+interval -3 years -10 months
+interval -3 years -10 months -1 days
+interval -3 years -10 months -2 days
+interval -3 years -9 months -4 weeks -2 days
+interval -42 years -6 months -2 weeks -4 days
+interval -42 years -8 months -3 weeks -1 days
+interval 3 months
+interval 3 months 1 days
+interval 3 months 2 days
+interval 38 years 3 months 1 weeks
+interval 39 years 3 months 1 weeks 1 days
+interval 40 years 3 months 1 weeks 2 days

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test suite aims to test ANSI and PostgreSQL dialect. Given that, the regeneration will be a correct solution for the question. Or, if we want to test only Spark dialect there, we can turn off ANSI mode before that test case by adding one-line configuration. (Also, turn on back after that test case).

cc @maropu , @gatorsmile , @gengliangwang .

@@ -836,7 +836,7 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
}
}

test("datediff") {
test("datediff returns an integer") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we didn't touch the other test cases, but at least for this, we had better have withSQLConf(SQLConf.ANSI_ENABLED.key -> "true", SQLConf.DIALECT.key -> "... here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, it will not test SQLConf.ANSI_ENABLED.key -> "false" which is default behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I blindly copied. What I mean was setting SQLConf with the default configuration. The default configurations can be changed easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add asserts to check this assumption otherwise if defaults for the SQL config will be changed in the future, this test will not check default behavior.

@SparkQA
Copy link

SparkQA commented Oct 5, 2019

Test build #111807 has finished for PR 26034 at commit efc9cb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

Is there a reason why you aim to use this as a default SPARK dialect?

Because PostgreSQL returns integers for date subtracts:

# select pg_typeof(date'2019-10-05' - date'1970-01-01');
 pg_typeof 
-----------
 integer
(1 row)

I don't know how I can use the PostgreSQL dialect for this behavior.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

Hi, @MaxGekk . We agreed to use POSTGRESQL dialect for this kind of new behaviors.

Could you point out where we agreed. Just in case, this changes are not related to PostgreSQL at all.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 5, 2019

Oh, PostgreSQL doesn't follow ANSI like this PR?
If then, please add the other DBMS reference for this ANSI feature into the PR description.
If this is different from the existing Apache Spark behavior and also different from PostgreSQL, it makes the decision on this PR more harder.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

Oh, PostgreSQL doesn't follow ANSI like this PR?

it doesn't in dates subtract.

If this is different from the existing Apache Spark behavior and also different from PostgreSQL, it makes the decision on this PR more harder.

This introduces new behavior in non-default configuration when spark.sql.ansi.enabled = true which means by definition "When true, Spark tries to conform to the ANSI SQL specification ..."

If then, please add the other DBMS reference for this ANSI feature into the PR description.

What is the purpose of such comparison. Would you like to see how others conform to the SQL standard? If let's say most of them doesn't, Spark also should not?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 6, 2019

@MaxGekk . If a popular DBMS like PostgreSQL ignores that SQL ANSI standard, why do we need to respect that? We are not a SQL reference model, we are trying to improve our SQL user UX. If there is no meaningful reference for this, I also need to give -1 for this. That's the reason why I ask you to enforce the description. You need to prove that this PR is not a kind of over-engineering which is no actual users are familiar.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

If a popular DBMS like PostgreSQL ignores that SQL ANSI standard, why do we need to respect that?

I haven't found any DBMS that return interval for date subtract so far.

we are trying to improve our SQL user UX.

This is more interesting. Let's look at an use case: an user needs to find the difference between 2 date columns and add it to a timestamp column. Now it is hard to do that because date subtract returns an integer, but timestamp +/- accepts only intervals:

spark-sql> select timestamp'now' + (date'2019-10-01' - date'2019-09-15');
Error in query: cannot resolve '(TIMESTAMP('2019-10-06 22:42:28.374') + datediff(DATE '2019-10-01', DATE '2019-09-15'))' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 22:42:28.374') + datediff(DATE '2019-10-01', DATE '2019-09-15'))' (timestamp and int).; line 1 pos 7;
'Project [unresolvedalias((1570390948374000 + datediff(18170, 18154)), None)]
+- OneRowRelation

and we cannot cast INT to INTERVAL:

spark-sql> select timestamp'now' + cast(date'2019-10-01' - date'2019-09-15' as interval);
Error in query: cannot resolve 'CAST(datediff(DATE '2019-10-01', DATE '2019-09-15') AS INTERVAL)' due to data type mismatch: cannot cast int to interval; line 1 pos 24;
'Project [unresolvedalias((1570391177861000 + cast(datediff(18170, 18154) as interval)), None)]
+- OneRowRelation

@dongjoon-hyun
Copy link
Member

Thank you for the detailed explanation, @MaxGekk . Please add the following into the PR description clearly. That is what I can recommend at this stage.

I haven't found any DBMS that return interval for date subtract so far.

@dongjoon-hyun
Copy link
Member

cc @cloud-fan , @gatorsmile , @maropu again since there is no reference for this.

@cloud-fan
Copy link
Contributor

I looked into this problem a while ago. IIRC some DBs do not support dateValue - dateValue, they can only subtract date values via functions like datediff.

The SQL standard defines the behavior of dateValue - dateValue, but not a specific function. It seems that the common behavior of the datediff function is to return a int, shall we keep it and only change dateValue - dateValue?

@SparkQA
Copy link

SparkQA commented Oct 7, 2019

Test build #111845 has finished for PR 26034 at commit a55fe7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2019

shall we keep it and only change dateValue - dateValue?

I would agree with you. I will create a separate expressions DateSubtract (in a separate PR) which returns the INTERVAL type and use it in dateValue - dateValue. I like this approach, actually.

@MaxGekk MaxGekk changed the title [SPARK-29364][SQL] Return an interval from date subtract according to SQL standard [SPARK-29364][SQL] Return an interval from datediff() in the ANSI mode and Spark dialect Oct 14, 2019
@wangyum wangyum closed this in d11cbf2 Oct 16, 2019
wangyum pushed a commit that referenced this pull request Oct 16, 2019
… SQL standard

### What changes were proposed in this pull request?
Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example:
```sql
> select date'tomorrow' - date'yesterday';
interval 2 days
```

Closes #26034

### Why are the changes needed?
- To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3  Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')`

### Does this PR introduce any user-facing change?
Before the query below returns number of days:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
399
```
After it returns an interval:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days
```

### How was this patch tested?
- by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`.
- by existing tests in `date.sql`

Closes #26112 from MaxGekk/date-subtract.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
(cherry picked from commit d11cbf2)
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Dec 9, 2019
… SQL standard

Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example:
```sql
> select date'tomorrow' - date'yesterday';
interval 2 days
```

Closes apache#26034

- To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3  Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')`

Before the query below returns number of days:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
399
```
After it returns an interval:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days
```

- by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`.
- by existing tests in `date.sql`

Closes apache#26112 from MaxGekk/date-subtract.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
@MaxGekk MaxGekk deleted the date-diff-interval branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants