[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

cloud-fan · 2020-03-20T08:52:18Z

What changes were proposed in this pull request?

#26412 introduced a behavior change that date_add/date_sub functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

Why are the changes needed?

avoid breaking changes

Does this PR introduce any user-facing change?

Yes, now 3.0 can run date_add('2011-11-11', '1') like 2.4

How was this patch tested?

new tests.

cloud-fan · 2020-03-20T08:53:21Z

cc @yaooqinn @maropu @HyukjinKwon

maropu · 2020-03-20T09:36:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      // Skip nodes who's children have not been resolved yet.
+      case e if !e.childrenResolved => e
+      case DateAdd(l, r) if r.dataType == StringType && r.foldable =>
+        DateAdd(l, AnsiCast(r, IntegerType))


We need to depend on AnsiCast here, even though we can know cast errors in an analysis phase? IMHO an analysis exception with an explicit error message is better than the current number format exception: https://github.com/apache/spark/pull/27965/files#diff-79dd276be45ede6f34e24ad7005b0a7cR279

Also, is the behaviour independent of the ANSI mode? Any string->int casts should not be allowed when ansi=true.

Any string->int casts should not be allowed when ansi=true.

Do you mean type coercion? Many databases can cast string to int and they are ANSI-compliant.

Ah, yes. I rechecked the references of implicit casts in some databases, as you said, the cast seems to be acceptable.

maropu · 2020-03-20T09:50:28Z

docs/sql-migration-guide.md

@@ -113,7 +113,7 @@ license: |

 ### UDFs and Built-in Functions

-  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
+  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types (excluding string literals) are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.


In the current behaviour of this PR, e.g. date_add(cast('1964-05-23' as date), '12.34') will cause AnalysisException seems to be wrong.

yaooqinn · 2020-03-20T09:52:08Z

I'm -1 for such kind of restore.

It restores only string literal cases, not the non-literal column cases, so it makes these functions more complicated to understand.
the string literals have to be int-castable, this makes the scope even smaller. So, I'm wondering why a user might prefer an integer value with extra single quotes to the integer itself.

maropu · 2020-03-20T09:52:16Z

I think this issue is worth filing a new jira for better traceability.

SparkQA · 2020-03-20T09:58:11Z

Test build #120086 has finished for PR 27965 at commit 24df520.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-20T11:41:57Z

@yaooqinn the reason why almost all the mainstream databases accept string literals as SQL function arguments is because it's convenient and common. Some BI tools even generate SQL queries with string literals in function arguments regardless of the required type.

Spark is mostly compatible with this use case because our type coercion can implicitly cast string to other types. This is bad as it can lead to runtime null values for invalid string values, and we are trying to improve it under the ANSI mode.

#26412 breaks this use case. I'm OK to add back string type support completely, but I do agree the previous behavior was very confusing. This PR proposes a temporary workaround to support valid string literals only, and I believe we will have a holistic solution of ANSI type coercion in the next release, which handles string literals and invalid string values.

maropu · 2020-03-20T11:51:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+  /**
+   * A special rule to support string literal as the second argument of date_add/date_sub functions,
+   * to keep backward compatibility.
+   * TODO: revisit the type coercion rules for string.


nit: how about filing a jira for this TODO then leave a jira ID here?

AFAIK we started the ANSI type coercion (you have a design doc right?). Did we create a JIRA ticket at that time?

Ur, I wrote the design doc though, I didn't file a jira at that time. Probably, @gengliangwang migith have done so?

…nctions

maropu

Looks fine to me. cc: @gengliangwang who's faimilar with the coercion rules.

yaooqinn · 2020-03-20T12:34:59Z

docs/sql-migration-guide.md

@@ -113,7 +113,7 @@ license: |

 ### UDFs and Built-in Functions

-  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
+  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. Note that, string literals are still allowed, but Spark will throw Analysis Exception if the string is not a valid integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.


accepts -> accept; fractional and string types are not valid anymore seems to be inconsistent with string literals are still allowed

SparkQA · 2020-03-20T14:48:28Z

Test build #120099 has finished for PR 27965 at commit 94e2fce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-20T15:07:13Z

Test build #120095 has finished for PR 27965 at commit 9caf98c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-20T17:19:11Z

The failures seems to be relevant. Could you take a look?

dongjoon-hyun · 2020-03-20T17:20:02Z

docs/sql-migration-guide.md

@@ -113,7 +113,7 @@ license: |

 ### UDFs and Built-in Functions

-  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
+  - Since Spark 3.0, the `date_add` and `date_sub` functions only accept int, smallint, tinyint as the 2nd argument, fractional and non-literal string are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), 12.34)` will cause `AnalysisException`. Note that, string literals are still allowed, but Spark will throw Analysis Exception if the string content is not a valid integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.


The new approach looks good.

dongjoon-hyun · 2020-03-20T17:34:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+   * to keep backward compatibility as a temporary workaround.
+   * TODO(SPARK-28589): implement ANSI type type coercion and handle string literals.
+   */
+  object StringLiteralCoercion extends TypeCoercionRule {


This causes a behavior difference in arithmatic operations, too. Could you describe the following change in the PR description? New one looks reasonable to me.

2.4.5 and 3.0.0-preview2

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show org.apache.spark.sql.AnalysisException: cannot resolve '(CAST('2020-03-28' AS DATE) + CAST('1' AS DOUBLE))' due to data type mismatch: differing types in '(CAST('2020-03-28' AS DATE) + CAST('1' AS DOUBLE))' (date and double).; line 1 pos 8;

This PR.

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show +-------------------------------------+ |date_add(CAST(2020-03-28 AS DATE), 1)| +-------------------------------------+ | 2020-03-29| +-------------------------------------+

You forgot to write a function name in the second query above?

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show ^^^^

No~ The both queries are the same. What I meant was it's the behavior of this PR; this PR extends expressions, too.

dongjoon-hyun

The PR title/description/migration guide needs to be broaden. Or, we should narrow down to the two function directly.

dongjoon-hyun · 2020-03-20T17:38:16Z

cc @gatorsmile

cloud-fan · 2020-03-23T07:55:14Z

@dongjoon-hyun good catch! I've fixed it and added tests.

SparkQA · 2020-03-23T09:17:05Z

Test build #120184 has finished for PR 27965 at commit 879238c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-23T16:39:35Z

Test build #120194 has finished for PR 27965 at commit bd6be16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-24T04:07:40Z

merging to master/3.0, thanks for review!

…ate_add/date_sub functions ### What changes were proposed in this pull request? #26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d0f549) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ate_add/date_sub functions ### What changes were proposed in this pull request? apache#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes apache#27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maropu reviewed Mar 20, 2020

View reviewed changes

cloud-fan changed the title ~~[SPARK-29774][SQL][followup] support string literal as the second argument of date_add/date_sub functions~~ [SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions Mar 20, 2020

maropu reviewed Mar 20, 2020

View reviewed changes

support string literal as the second argument of date_add/date_sub fu…

9caf98c

…nctions

cloud-fan force-pushed the string branch from 24df520 to 9caf98c Compare March 20, 2020 12:10

maropu approved these changes Mar 20, 2020

View reviewed changes

yaooqinn reviewed Mar 20, 2020

View reviewed changes

address comments

94e2fce

dongjoon-hyun added the SQL label Mar 20, 2020

dongjoon-hyun reviewed Mar 20, 2020

View reviewed changes

dongjoon-hyun requested changes Mar 20, 2020

View reviewed changes

reduce scope

bd6be16

cloud-fan force-pushed the string branch from 879238c to bd6be16 Compare March 23, 2020 11:05

cloud-fan closed this in 1d0f549 Mar 24, 2020

cloud-fan mentioned this pull request May 20, 2020

[SPARK-31797][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions #28534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

cloud-fan commented Mar 20, 2020

cloud-fan commented Mar 20, 2020

maropu Mar 20, 2020

maropu Mar 20, 2020

cloud-fan Mar 20, 2020

maropu Mar 20, 2020 •

edited

Loading

maropu Mar 20, 2020

yaooqinn commented Mar 20, 2020

maropu commented Mar 20, 2020

SparkQA commented Mar 20, 2020

cloud-fan commented Mar 20, 2020

maropu Mar 20, 2020 •

edited

Loading

cloud-fan Mar 20, 2020

maropu Mar 20, 2020

maropu left a comment

yaooqinn Mar 20, 2020

SparkQA commented Mar 20, 2020

SparkQA commented Mar 20, 2020

dongjoon-hyun commented Mar 20, 2020

dongjoon-hyun Mar 20, 2020

dongjoon-hyun Mar 20, 2020 •

edited

Loading

maropu Mar 20, 2020

dongjoon-hyun Mar 22, 2020 •

edited

Loading

maropu Mar 22, 2020

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 20, 2020

cloud-fan commented Mar 23, 2020

SparkQA commented Mar 23, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 24, 2020

[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

Conversation

cloud-fan commented Mar 20, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Mar 20, 2020

maropu commented Mar 20, 2020

SparkQA commented Mar 20, 2020

cloud-fan commented Mar 20, 2020

maropu Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 20, 2020

SparkQA commented Mar 20, 2020

dongjoon-hyun commented Mar 20, 2020

Choose a reason for hiding this comment

dongjoon-hyun Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Mar 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 20, 2020

cloud-fan commented Mar 23, 2020

SparkQA commented Mar 23, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 24, 2020

maropu Mar 20, 2020 •

edited

Loading

maropu Mar 20, 2020 •

edited

Loading

dongjoon-hyun Mar 20, 2020 •

edited

Loading

dongjoon-hyun Mar 22, 2020 •

edited

Loading