Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres #26412

Closed
wants to merge 32 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Nov 6, 2019

What changes were proposed in this pull request?

Add an analyzer rule to convert unresolved Add, Subtract, etc. to TimeAdd, DateAdd, etc. according to the following policy:

 /**
   * For [[Add]]:
   * 1. if both side are interval, stays the same;
   * 2. else if one side is interval, turns it to [[TimeAdd]];
   * 3. else if one side is date, turns it to [[DateAdd]] ;
   * 4. else stays the same.
   *
   * For [[Subtract]]:
   * 1. if both side are interval, stays the same;
   * 2. else if the right side is an interval, turns it to [[TimeSub]];
   * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
   * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
   * 5. else if the left side is date, turns it to [[DateSub]];
   * 6. else turns it to stays the same.
   *
   * For [[Multiply]]:
   * 1. If one side is interval, turns it to [[MultiplyInterval]];
   * 2. otherwise, stays the same.
   *
   * For [[Divide]]:
   * 1. If the left side is interval, turns it to [[DivideInterval]];
   * 2. otherwise, stays the same.
   */

Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in DateTimeOperations coercion rule.

Why are the changes needed?

Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.

Does this PR introduce any user-facing change?

  1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.
  2. datetime arithmetic operations become NullIntolerant, e.g. select timestamp'1999-12-31 00:00:00' - null is valid now

How was this patch tested?

add ut

@HyukjinKwon
Copy link
Member

cc @maropu and @MaxGekk

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113314 has finished for PR 26412 at commit 0b293db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 27, 2019

Ah, I see. The change looks reasonable to me. Just in case, can you check behaivours in the other systems?

@yaooqinn
Copy link
Member Author

also check with presto

presto> select date('1900-01-01') - null;
 _col0
-------
 NULL
(1 row)

Query 20191127_065501_00001_9md27, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@SparkQA
Copy link

SparkQA commented Nov 27, 2019

Test build #114511 has finished for PR 26412 at commit e7225a3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

case Subtract(l @ DateType(), r @ IntegerType()) => DateSub(l, r)
case Subtract(l @ DateType(), r @ NullType()) => DateSub(l, Cast(r, IntegerType))
case Subtract(l @ DateType(), r @ DateType()) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge the multiple rule above into one like this?

      case b @ BinaryOperator(l @ DateType(), r @ NullType()) =>
        b.withNewChildren(Seq(l, Cast(r, IntegerType)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we might leave a trivial bug here if we set spark.sql.optimizer.maxIterations=1, it will not be transformed to DateAdd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm..., I personally think that behaivour looks a little weired to me. Probably, the root cause is that Subtract(l @ DateType(), r @ NullType()).checkInputDataTypes.isSuccess returns true. To fix this issue, we might need to modify that check code to return false. cc: @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, add with numeric type and null type is also handled in TypeCoercion too

case Subtract(l @ DateType(), r @ DateType()) =>
if (SQLConf.get.usePostgreSQLDialect) DateDiff(l, r) else SubtractDates(l, r)
case Subtract(l @ TimestampType(), r @ TimestampType()) =>
SubtractTimestamps(l, r)
case Subtract(l @ TimestampType(), r @ DateType()) =>
SubtractTimestamps(l, Cast(r, TimestampType))
case Subtract(l @ TimestampType(), r @ NullType()) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about null - timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need them too. checked with pg

@SparkQA
Copy link

SparkQA commented Nov 27, 2019

Test build #114528 has finished for PR 26412 at commit e7225a3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2019

Test build #114534 has finished for PR 26412 at commit b925517.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the BinaryArithmetic operators are NullIntolerant. Why this is only against Date/Timestamp types?

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 2, 2019

IIUC, the NullIntolerant is not for this type coercion thing.

@cloud-fan
Copy link
Contributor

I think it's all because we hack the Add operator to do date add. Now we need to add more hacks in the type coercion rules.

How about we create UnresolvedAdd in the parser, and convert it to either Add or DateAdd in the analyzer?

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 2, 2019

I think it's all because we hack the Add operator to do date add. Now we need to add more hacks in the type coercion rules.

How about we create UnresolvedAdd in the parser, and convert it to either Add or DateAdd in the analyzer?

This is better, I will follow this sugguestion, thanks.

Cast(TimeSub(l, r), l.dataType)
case (CalendarIntervalType, TimestampType | DateType | StringType) =>
Cast(TimeSub(r, l), r.dataType)
case (DateType | NullType, DateType) => if (conf.usePostgreSQLDialect) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle NullType here? The Subtract should work for null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the result same but do not semantic equal, is that OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not, subtract(null, date) will not pass type checking

} else {
SubtractDates(l, r)
}
case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SubtractDates(l, r)
}
case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r)
case (DateType | NullType, TimestampType) => SubtractTimestamps(Cast(l, TimestampType), r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

case (_, _) => Subtract(l, r)
}
case UnresolvedMultiply(l, r) => (l.dataType, r.dataType) match {
case (CalendarIntervalType, _: NumericType | NullType) => MultiplyInterval(l, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}
case UnresolvedSubtract(l, r) => (l.dataType, r.dataType) match {
case (TimestampType | DateType | StringType, CalendarIntervalType) =>
Cast(TimeSub(l, r), l.dataType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that TimeSub is replaceable by TimeAdd(l, UnaryMinus(r)), which make it useless

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114718 has finished for PR 26412 at commit e8b75ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

It seems like UnresolvedBinaryExpression brings some troubles and may add maintenance overhead.

How about this:

  1. We still create Add in the parser
  2. type coercion rules only deal with the normal Add operation, e.g. int + int, interval + interval.
  3. the new rule ResolveBinaryArithmetic finds the unresolved Add, and turn them into DateAdd, etc. depending on the data types.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114886 has finished for PR 26412 at commit 5dd632c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class UnresolvedBinaryExpression(operator: String)
  • case class UnresolvedAdd(left: Expression, right: Expression)

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 5, 2019

It seems like UnresolvedBinaryExpression brings some troubles and may add maintenance overhead.

How about this:

  1. We still create Add in the parser
  2. type coercion rules only deal with the normal Add operation, e.g. int + int, interval + interval.
  3. the new rule ResolveBinaryArithmetic finds the unresolved Add, and turn them into DateAdd, etc. depending on the data types.

Simply replace the UnresolvedXX and change them back to the old ones

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114893 has finished for PR 26412 at commit c84d46e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 5, 2019

retest this please

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

-- !query 13 schema
struct<date_sub(DATE '2001-10-01', 7):date>
struct<CAST(TIMESTAMP '2011-11-11 11:11:11' + INTERVAL '2 days' AS TIMESTAMP):timestamp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid adding cast if not necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it's the existing behavior too, we can revisit it later.

-- !query 17 schema
struct<DATE '2019-01-01':date>
struct<CAST(CAST(2011-11-11 AS TIMESTAMP) - INTERVAL '2 days' AS STRING):string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's super weird that this returns string. What was the behavior before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it's the existing behavior. We can revisit it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDTIMinus.java#L111-121

   // Allowed operations:
    // IntervalYearMonth - IntervalYearMonth = IntervalYearMonth
    // Date - IntervalYearMonth = Date (operands not reversible)
    // Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
    // IntervalDayTime - IntervalDayTime = IntervalDayTime
    // Date - IntervalYearMonth = Timestamp (operands not reversible)
    // Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
    // Timestamp - Timestamp = IntervalDayTime
    // Date - Date = IntervalDayTime
    // Timestamp - Date = IntervalDayTime (operands reversible)
    // Date - Int = Date

Hive's behavior is more convictive, we can check this later.

@cloud-fan
Copy link
Contributor

looks pretty good, let's see how tests go this time.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114896 has finished for PR 26412 at commit c84d46e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114897 has finished for PR 26412 at commit a44948e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b9cae37 Dec 5, 2019
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
… Postgres

# What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy:
```scala
 /**
   * For [[Add]]:
   * 1. if both side are interval, stays the same;
   * 2. else if one side is interval, turns it to [[TimeAdd]];
   * 3. else if one side is date, turns it to [[DateAdd]] ;
   * 4. else stays the same.
   *
   * For [[Subtract]]:
   * 1. if both side are interval, stays the same;
   * 2. else if the right side is an interval, turns it to [[TimeSub]];
   * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
   * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
   * 5. else if the left side is date, turns it to [[DateSub]];
   * 6. else turns it to stays the same.
   *
   * For [[Multiply]]:
   * 1. If one side is interval, turns it to [[MultiplyInterval]];
   * 2. otherwise, stays the same.
   *
   * For [[Divide]]:
   * 1. If the left side is interval, turns it to [[DivideInterval]];
   * 2. otherwise, stays the same.
   */
```
Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule.
### Why are the changes needed?

Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.

### Does this PR introduce any user-facing change?

1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.

### How was this patch tested?

add ut

Closes apache#26412 from yaooqinn/SPARK-29774.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 16, 2019

Sorry, @cloud-fan, I just checked the cc.

The result is unexpected. In ResolveAlias, we only generate the alias if expression is resolved. How does pyspark generate alias for its Row object? cc @HyukjinKwon

I don't think there's any differences for the column names being generated in PySpark specifically.

@gengliangwang
Copy link
Member

@yaooqinn Thanks for the work, but I don't know the behavior before this PR from the PR description and discussions. I would suggest adding that in the PR description as well.

I have to check with Spark 2.4.4 to find the previous behavior:

> spark.sql("select timestamp'1999-12-31 00:00:00' - null").show()
org.apache.spark.sql.AnalysisException: cannot resolve '(TIMESTAMP('1999-12-31 00:00:00.0') - NULL)' due to data type mismatch: differing types in '(TIMESTAMP('1999-12-31 00:00:00.0') - NULL)' (timestamp and null).; line 1 pos 7;
'Project [unresolvedalias((946627200000000 - null), None)]
+- OneRowRelation

@yaooqinn
Copy link
Member Author

Hi @gengliangwang, thanks for your suggestion, I have updated the description. Can you check whether it is clear enough.

@cloud-fan
Copy link
Contributor

date_add/date_sub functions only accept int/tinynit/smallint as the second arg

Do you mean time_add/time_sub? BTW we should have a migration guide for it.

@yaooqinn
Copy link
Member Author

It is date_add and date_sub, we have made them ExpectsInputTypes . I'll raise a followup to add a migration guide.

cloud-fan pushed a commit that referenced this pull request Dec 18, 2019
…ate_sub

### What changes were proposed in this pull request?

add a migration guide for date_add and date_sub to indicates their behavior change. It a followup for #26412

### Why are the changes needed?
add a migration guide

### Does this PR introduce any user-facing change?

yes, doc change

### How was this patch tested?

no

Closes #26932 from yaooqinn/SPARK-29774-f.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Mar 24, 2020
…ate_add/date_sub functions

### What changes were proposed in this pull request?

#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Mar 24, 2020
…ate_add/date_sub functions

### What changes were proposed in this pull request?

#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1d0f549)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ate_add/date_sub functions

### What changes were proposed in this pull request?

apache#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes apache#27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants