[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres #26412

yaooqinn · 2019-11-06T11:24:54Z

What changes were proposed in this pull request?

Add an analyzer rule to convert unresolved Add, Subtract, etc. to TimeAdd, DateAdd, etc. according to the following policy:

 /**
   * For [[Add]]:
   * 1. if both side are interval, stays the same;
   * 2. else if one side is interval, turns it to [[TimeAdd]];
   * 3. else if one side is date, turns it to [[DateAdd]] ;
   * 4. else stays the same.
   *
   * For [[Subtract]]:
   * 1. if both side are interval, stays the same;
   * 2. else if the right side is an interval, turns it to [[TimeSub]];
   * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
   * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
   * 5. else if the left side is date, turns it to [[DateSub]];
   * 6. else turns it to stays the same.
   *
   * For [[Multiply]]:
   * 1. If one side is interval, turns it to [[MultiplyInterval]];
   * 2. otherwise, stays the same.
   *
   * For [[Divide]]:
   * 1. If the left side is interval, turns it to [[DivideInterval]];
   * 2. otherwise, stays the same.
   */

Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in DateTimeOperations coercion rule.

Why are the changes needed?

Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.

Does this PR introduce any user-facing change?

date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.
datetime arithmetic operations become NullIntolerant, e.g. select timestamp'1999-12-31 00:00:00' - null is valid now

How was this patch tested?

add ut

… Postgres

HyukjinKwon · 2019-11-06T11:55:16Z

cc @maropu and @MaxGekk

SparkQA · 2019-11-06T11:57:47Z

Test build #113314 has finished for PR 26412 at commit 0b293db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-27T06:11:18Z

Ah, I see. The change looks reasonable to me. Just in case, can you check behaivours in the other systems?

yaooqinn · 2019-11-27T06:55:46Z

also check with presto

presto> select date('1900-01-01') - null;
 _col0
-------
 NULL
(1 row)

Query 20191127_065501_00001_9md27, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

SparkQA · 2019-11-27T08:05:02Z

Test build #114511 has finished for PR 26412 at commit e7225a3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-11-27T09:41:20Z

retest this please

maropu · 2019-11-27T11:20:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

      case Subtract(l @ DateType(), r @ IntegerType()) => DateSub(l, r)
+      case Subtract(l @ DateType(), r @ NullType()) => DateSub(l, Cast(r, IntegerType))
      case Subtract(l @ DateType(), r @ DateType()) =>


Can we merge the multiple rule above into one like this?

case b @ BinaryOperator(l @ DateType(), r @ NullType()) => b.withNewChildren(Seq(l, Cast(r, IntegerType)))

If so, we might leave a trivial bug here if we set spark.sql.optimizer.maxIterations=1, it will not be transformed to DateAdd

hmm..., I personally think that behaivour looks a little weired to me. Probably, the root cause is that Subtract(l @ DateType(), r @ NullType()).checkInputDataTypes.isSuccess returns true. To fix this issue, we might need to modify that check code to return false. cc: @cloud-fan

FYI, add with numeric type and null type is also handled in TypeCoercion too

cloud-fan · 2019-11-27T12:10:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

      case Subtract(l @ DateType(), r @ DateType()) =>
        if (SQLConf.get.usePostgreSQLDialect) DateDiff(l, r) else SubtractDates(l, r)
      case Subtract(l @ TimestampType(), r @ TimestampType()) =>
        SubtractTimestamps(l, r)
      case Subtract(l @ TimestampType(), r @ DateType()) =>
        SubtractTimestamps(l, Cast(r, TimestampType))
+      case Subtract(l @ TimestampType(), r @ NullType()) =>


how about null - timestamp?

yes, we need them too. checked with pg

SparkQA · 2019-11-27T13:50:31Z

Test build #114528 has finished for PR 26412 at commit e7225a3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-27T17:11:08Z

Test build #114534 has finished for PR 26412 at commit b925517.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

All the BinaryArithmetic operators are NullIntolerant. Why this is only against Date/Timestamp types?

yaooqinn · 2019-12-02T02:33:00Z

IIUC, the NullIntolerant is not for this type coercion thing.

cloud-fan · 2019-12-02T04:19:58Z

I think it's all because we hack the Add operator to do date add. Now we need to add more hacks in the type coercion rules.

How about we create UnresolvedAdd in the parser, and convert it to either Add or DateAdd in the analyzer?

yaooqinn · 2019-12-02T07:12:42Z

I think it's all because we hack the Add operator to do date add. Now we need to add more hacks in the type coercion rules.

How about we create UnresolvedAdd in the parser, and convert it to either Add or DateAdd in the analyzer?

This is better, I will follow this sugguestion, thanks.