-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres #26412
Conversation
Test build #113314 has finished for PR 26412 at commit
|
Ah, I see. The change looks reasonable to me. Just in case, can you check behaivours in the other systems? |
also check with presto presto> select date('1900-01-01') - null;
_col0
-------
NULL
(1 row)
Query 20191127_065501_00001_9md27, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s] |
Test build #114511 has finished for PR 26412 at commit
|
retest this please |
case Subtract(l @ DateType(), r @ IntegerType()) => DateSub(l, r) | ||
case Subtract(l @ DateType(), r @ NullType()) => DateSub(l, Cast(r, IntegerType)) | ||
case Subtract(l @ DateType(), r @ DateType()) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge the multiple rule above into one like this?
case b @ BinaryOperator(l @ DateType(), r @ NullType()) =>
b.withNewChildren(Seq(l, Cast(r, IntegerType)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, we might leave a trivial bug here if we set spark.sql.optimizer.maxIterations=1
, it will not be transformed to DateAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm..., I personally think that behaivour looks a little weired to me. Probably, the root cause is that Subtract(l @ DateType(), r @ NullType()).checkInputDataTypes.isSuccess
returns true. To fix this issue, we might need to modify that check code to return false. cc: @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, add
with numeric type and null type is also handled in TypeCoercion
too
case Subtract(l @ DateType(), r @ DateType()) => | ||
if (SQLConf.get.usePostgreSQLDialect) DateDiff(l, r) else SubtractDates(l, r) | ||
case Subtract(l @ TimestampType(), r @ TimestampType()) => | ||
SubtractTimestamps(l, r) | ||
case Subtract(l @ TimestampType(), r @ DateType()) => | ||
SubtractTimestamps(l, Cast(r, TimestampType)) | ||
case Subtract(l @ TimestampType(), r @ NullType()) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about null - timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need them too. checked with pg
Test build #114528 has finished for PR 26412 at commit
|
Test build #114534 has finished for PR 26412 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the BinaryArithmetic operators are NullIntolerant. Why this is only against Date/Timestamp types?
IIUC, the |
I think it's all because we hack the How about we create |
This is better, I will follow this sugguestion, thanks. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
Cast(TimeSub(l, r), l.dataType) | ||
case (CalendarIntervalType, TimestampType | DateType | StringType) => | ||
Cast(TimeSub(r, l), r.dataType) | ||
case (DateType | NullType, DateType) => if (conf.usePostgreSQLDialect) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle NullType
here? The Subtract
should work for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, the result same but do not semantic equal, is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not, subtract(null, date) will not pass type checking
} else { | ||
SubtractDates(l, r) | ||
} | ||
case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SubtractDates(l, r) | ||
} | ||
case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r) | ||
case (DateType | NullType, TimestampType) => SubtractTimestamps(Cast(l, TimestampType), r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
case (_, _) => Subtract(l, r) | ||
} | ||
case UnresolvedMultiply(l, r) => (l.dataType, r.dataType) match { | ||
case (CalendarIntervalType, _: NumericType | NullType) => MultiplyInterval(l, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
case UnresolvedSubtract(l, r) => (l.dataType, r.dataType) match { | ||
case (TimestampType | DateType | StringType, CalendarIntervalType) => | ||
Cast(TimeSub(l, r), l.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that TimeSub
is replaceable by TimeAdd(l, UnaryMinus(r))
, which make it useless
Test build #114718 has finished for PR 26412 at commit
|
It seems like How about this:
|
Test build #114886 has finished for PR 26412 at commit
|
Simply replace the |
Test build #114893 has finished for PR 26412 at commit
|
retest this please |
import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
import org.apache.spark.sql.catalyst.dsl.plans._ | ||
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
import org.apache.spark.sql.catalyst.dsl.plans._ | ||
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
-- !query 13 schema | ||
struct<date_sub(DATE '2001-10-01', 7):date> | ||
struct<CAST(TIMESTAMP '2011-11-11 11:11:11' + INTERVAL '2 days' AS TIMESTAMP):timestamp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid adding cast if not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it's the existing behavior too, we can revisit it later.
-- !query 17 schema | ||
struct<DATE '2019-01-01':date> | ||
struct<CAST(CAST(2011-11-11 AS TIMESTAMP) - INTERVAL '2 days' AS STRING):string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's super weird that this returns string. What was the behavior before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it's the existing behavior. We can revisit it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, but follows the behavior before https://github.com/apache/spark/pull/26412/files/c84d46ea6d384dcb1f442ca54abad48e59c92bb3#diff-383a8cdd0a9c58cae68e0a79295520a3L846
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Allowed operations:
// IntervalYearMonth - IntervalYearMonth = IntervalYearMonth
// Date - IntervalYearMonth = Date (operands not reversible)
// Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
// IntervalDayTime - IntervalDayTime = IntervalDayTime
// Date - IntervalYearMonth = Timestamp (operands not reversible)
// Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
// Timestamp - Timestamp = IntervalDayTime
// Date - Date = IntervalDayTime
// Timestamp - Date = IntervalDayTime (operands reversible)
// Date - Int = Date
Hive's behavior is more convictive, we can check this later.
looks pretty good, let's see how tests go this time. |
Test build #114896 has finished for PR 26412 at commit
|
Test build #114897 has finished for PR 26412 at commit
|
thanks, merging to master! |
… Postgres # What changes were proposed in this pull request? Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy: ```scala /** * For [[Add]]: * 1. if both side are interval, stays the same; * 2. else if one side is interval, turns it to [[TimeAdd]]; * 3. else if one side is date, turns it to [[DateAdd]] ; * 4. else stays the same. * * For [[Subtract]]: * 1. if both side are interval, stays the same; * 2. else if the right side is an interval, turns it to [[TimeSub]]; * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]]; * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]]; * 5. else if the left side is date, turns it to [[DateSub]]; * 6. else turns it to stays the same. * * For [[Multiply]]: * 1. If one side is interval, turns it to [[MultiplyInterval]]; * 2. otherwise, stays the same. * * For [[Divide]]: * 1. If the left side is interval, turns it to [[DivideInterval]]; * 2. otherwise, stays the same. */ ``` Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule. ### Why are the changes needed? Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark. ### Does this PR introduce any user-facing change? 1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results. ### How was this patch tested? add ut Closes apache#26412 from yaooqinn/SPARK-29774. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sorry, @cloud-fan, I just checked the cc.
I don't think there's any differences for the column names being generated in PySpark specifically. |
@yaooqinn Thanks for the work, but I don't know the behavior before this PR from the PR description and discussions. I would suggest adding that in the PR description as well. I have to check with Spark 2.4.4 to find the previous behavior:
|
Hi @gengliangwang, thanks for your suggestion, I have updated the description. Can you check whether it is clear enough. |
Do you mean time_add/time_sub? BTW we should have a migration guide for it. |
It is date_add and date_sub, we have made them |
…ate_sub ### What changes were proposed in this pull request? add a migration guide for date_add and date_sub to indicates their behavior change. It a followup for #26412 ### Why are the changes needed? add a migration guide ### Does this PR introduce any user-facing change? yes, doc change ### How was this patch tested? no Closes #26932 from yaooqinn/SPARK-29774-f. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ate_add/date_sub functions ### What changes were proposed in this pull request? #26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ate_add/date_sub functions ### What changes were proposed in this pull request? #26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d0f549) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ate_add/date_sub functions ### What changes were proposed in this pull request? apache#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes apache#27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved
Add
,Subtract
, etc. toTimeAdd
,DateAdd
, etc. according to the following policy:Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in
DateTimeOperations
coercion rule.Why are the changes needed?
Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.
Does this PR introduce any user-facing change?
NullIntolerant
, e.g.select timestamp'1999-12-31 00:00:00' - null
is valid nowHow was this patch tested?
add ut