Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions #27965

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

#26412 introduced a behavior change that date_add/date_sub functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

Why are the changes needed?

avoid breaking changes

Does this PR introduce any user-facing change?

Yes, now 3.0 can run date_add('2011-11-11', '1') like 2.4

How was this patch tested?

new tests.

@cloud-fan
Copy link
Contributor Author

cc @yaooqinn @maropu @HyukjinKwon

// Skip nodes who's children have not been resolved yet.
case e if !e.childrenResolved => e
case DateAdd(l, r) if r.dataType == StringType && r.foldable =>
DateAdd(l, AnsiCast(r, IntegerType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to depend on AnsiCast here, even though we can know cast errors in an analysis phase? IMHO an analysis exception with an explicit error message is better than the current number format exception: https://github.com/apache/spark/pull/27965/files#diff-79dd276be45ede6f34e24ad7005b0a7cR279

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the behaviour independent of the ANSI mode? Any string->int casts should not be allowed when ansi=true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any string->int casts should not be allowed when ansi=true.

Do you mean type coercion? Many databases can cast string to int and they are ANSI-compliant.

Copy link
Member

@maropu maropu Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I rechecked the references of implicit casts in some databases, as you said, the cast seems to be acceptable.

@@ -113,7 +113,7 @@ license: |

### UDFs and Built-in Functions

- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types (excluding string literals) are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current behaviour of this PR, e.g. date_add(cast('1964-05-23' as date), '12.34') will cause AnalysisException seems to be wrong.

@yaooqinn
Copy link
Member

I'm -1 for such kind of restore.

  1. It restores only string literal cases, not the non-literal column cases, so it makes these functions more complicated to understand.
  2. the string literals have to be int-castable, this makes the scope even smaller. So, I'm wondering why a user might prefer an integer value with extra single quotes to the integer itself.

@maropu
Copy link
Member

maropu commented Mar 20, 2020

I think this issue is worth filing a new jira for better traceability.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120086 has finished for PR 27965 at commit 24df520.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-29774][SQL][followup] support string literal as the second argument of date_add/date_sub functions [SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions Mar 20, 2020
@cloud-fan
Copy link
Contributor Author

@yaooqinn the reason why almost all the mainstream databases accept string literals as SQL function arguments is because it's convenient and common. Some BI tools even generate SQL queries with string literals in function arguments regardless of the required type.

Spark is mostly compatible with this use case because our type coercion can implicitly cast string to other types. This is bad as it can lead to runtime null values for invalid string values, and we are trying to improve it under the ANSI mode.

#26412 breaks this use case. I'm OK to add back string type support completely, but I do agree the previous behavior was very confusing. This PR proposes a temporary workaround to support valid string literals only, and I believe we will have a holistic solution of ANSI type coercion in the next release, which handles string literals and invalid string values.

/**
* A special rule to support string literal as the second argument of date_add/date_sub functions,
* to keep backward compatibility.
* TODO: revisit the type coercion rules for string.
Copy link
Member

@maropu maropu Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about filing a jira for this TODO then leave a jira ID here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we started the ANSI type coercion (you have a design doc right?). Did we create a JIRA ticket at that time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I wrote the design doc though, I didn't file a jira at that time. Probably, @gengliangwang migith have done so?

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. cc: @gengliangwang who's faimilar with the coercion rules.

@@ -113,7 +113,7 @@ license: |

### UDFs and Built-in Functions

- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. Note that, string literals are still allowed, but Spark will throw Analysis Exception if the string is not a valid integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepts -> accept; fractional and string types are not valid anymore seems to be inconsistent with string literals are still allowed

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120099 has finished for PR 27965 at commit 94e2fce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120095 has finished for PR 27965 at commit 9caf98c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

The failures seems to be relevant. Could you take a look?

@@ -113,7 +113,7 @@ license: |

### UDFs and Built-in Functions

- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
- Since Spark 3.0, the `date_add` and `date_sub` functions only accept int, smallint, tinyint as the 2nd argument, fractional and non-literal string are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), 12.34)` will cause `AnalysisException`. Note that, string literals are still allowed, but Spark will throw Analysis Exception if the string content is not a valid integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach looks good.

* to keep backward compatibility as a temporary workaround.
* TODO(SPARK-28589): implement ANSI type type coercion and handle string literals.
*/
object StringLiteralCoercion extends TypeCoercionRule {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a behavior difference in arithmatic operations, too. Could you describe the following change in the PR description? New one looks reasonable to me.

2.4.5 and 3.0.0-preview2

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show
org.apache.spark.sql.AnalysisException: cannot resolve '(CAST('2020-03-28' AS DATE) + CAST('1' AS DOUBLE))' due to data type mismatch: differing types in '(CAST('2020-03-28' AS DATE) + CAST('1' AS DOUBLE))' (date and double).; line 1 pos 8;

This PR.

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show
+-------------------------------------+
|date_add(CAST(2020-03-28 AS DATE), 1)|
+-------------------------------------+
|                           2020-03-29|
+-------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to write a function name in the second query above?

scala> sql("select (cast('2020-03-28' AS DATE) + '1')").show
                 ^^^^

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No~ The both queries are the same. What I meant was it's the behavior of this PR; this PR extends expressions, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description/migration guide needs to be broaden. Or, we should narrow down to the two function directly.

@dongjoon-hyun
Copy link
Member

cc @gatorsmile

@cloud-fan
Copy link
Contributor Author

@dongjoon-hyun good catch! I've fixed it and added tests.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120184 has finished for PR 27965 at commit 879238c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120194 has finished for PR 27965 at commit bd6be16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

merging to master/3.0, thanks for review!

@cloud-fan cloud-fan closed this in 1d0f549 Mar 24, 2020
cloud-fan added a commit that referenced this pull request Mar 24, 2020
…ate_add/date_sub functions

### What changes were proposed in this pull request?

#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1d0f549)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ate_add/date_sub functions

### What changes were proposed in this pull request?

apache#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes apache#27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants