[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

GuoPhilipse · 2020-05-18T08:52:21Z

What changes were proposed in this pull request?
As we know,long datatype is interpreted as milliseconds when conversion to timestamp in hive, while long is interpreted as seconds when conversion to timestamp in spark, we have been facing error data during migrating hive sql to spark sql. with compatibility flag we can fix this error,

Why are the changes needed?
we have many sqls runing in product, so we need a compatibility flag to make them migrating smoothly ,meanwhile do not change the user behavior in spark.

Does this PR introduce any user-facing change?
if user use this patch ,then user should set this paramter ,
if not, user do not need to do anything.

How was this patch tested?
unit test added

As we know,long datatype is interpreted as milliseconds when conversion to timestamp in hive, while long is interpreted as seconds when conversion to timestamp in spark, we have many sqls runing in product, so we need a compatibility flag to make them migrating smoothly ,meanwhile do not change the user behavior in spark.

…710-3 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

…710-2 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

…710-1 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

AmplabJenkins · 2020-05-18T08:58:26Z

Can one of the admins verify this patch?

As we know,long datatype is interpreted as milliseconds when conversion to timestamp in hive, while long is interpreted as seconds when conversion to timestamp in spark, we have been facing error data during migrating hive sql to spark sql. with compatibility flag we can fix this error,

…710-4 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

bart-samwel · 2020-05-18T15:10:43Z

@cloud-fan @MaxGekk FYI

There's also PR #28534, which tries to solve the same thing using explicit functions.

To be honest, I'm not a big fan of using compatibility flags unless we're actually planning to deprecate the old behavior and change the behavior by default. Realistically, the next time we can change the default behavior is in Spark 4.0, which is likely to be several years out. And until then, throughout the Spark 3.x line, you may have Spark deployments out there where some query unexpectedly has different semantics than on other Spark deployments. The behavior change also doesn't stick if you then port that same workload over to other deployments of Spark, and given that it's not made explicit in the queries what they mean, and there's no errors, you may silently produce incorrect results after changing the deployment.

If anything, I'd be in favor of:

Doing the thing from PR [SPARK-31797][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions #28534 (adding TIMESTAMP_FROM_SECONDS etc.).
If we really care enough to change the behavior (and hence break existing workloads), we should use a legacy compatibility flag that disables this CAST by default, and to let people choose between the (legacy) Spark behavior or the (new) Hive behavior. With the strong advice in the "this is disabled" error message to migrate to the functions above instead and to leave the setting at "disabled". Then people can shoot themselves in the foot if they really want to, but then at least we told them so.

GuoPhilipse · 2020-05-18T16:16:05Z

Thanks Bart for you advice, we are now in upgrading execution engine(from mr to spark), during the period, if the spark shows less expected than hive, we will switch back automated ,util we got an idea to solve that case.so the reasons for me to raise this pr are: 1) we need same sql to run on hive/spark during migrating,if spark failed or behaviored less expected., so with a compatibility flag ,as you said, we can easily migrate them and no need to change user's sqls, btw,we can change user's behavior after we migrate all task to spark(maybe at spark3.0) to accecpt spark's dialect or new features. 2) currently. if we do nothing, the migrating hive tasks risks in error data, which will be serious, if we block this case with a legacy,then we need to detect these tasks before migrating ahead to notice our user, it will also need some additional work. so my point is the #28534, is need after migrating, user have to use the function correctly if we put hive aside. but during the migrating,i think many big companies may suffer less during embracing spark if we have a elegant solution. The above is my poor view,thanks again for your advice. Best Regards! At 2020-05-18 23:11:00, "Bart Samwel" <notifications@github.com> wrote: @cloud-fan@MaxGekk FYI There's also PR #28534, which tries to solve the same thing using explicit functions. To be honest, I'm not a big fan of using compatibility flags unless we're actually planning to deprecate the old behavior and change the behavior by default. Realistically, the next time we can change the default behavior is in Spark 4.0, which is likely to be several years out. And until then, throughout the Spark 3.x line, you may have Spark deployments out there where some query unexpectedly has different semantics than on other Spark deployments. The behavior change also doesn't stick if you then port that same workload over to other deployments of Spark, and given that it's not made explicit in the queries what they mean, and there's no errors, you may silently produce incorrect results after changing the deployment. If anything, I'd be in favor of: Doing the thing from PR #28534 (adding TIMESTAMP_FROM_SECONDS etc.). If we really care enough to change the behavior (and hence break existing workloads), we should use a legacy compatibility flag that disables this CAST by default, and to let people choose between the (legacy) Spark behavior or the (new) Hive behavior. With the strong advice in the "this is disabled" error message to migrate to the functions above instead and to leave the setting at "disabled". Then people can shoot themselves in the foot if they really want to, but then at least we told them so. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cloud-fan · 2020-05-19T05:06:39Z

we need same sql to run on hive/spark during migrating,if spark failed or behaviored less expected. so with a compatibility flag ,as you said, we can easily migrate them and no need to change user's sqls

We did something similar before, with the pgsql dialect. This project was canceled, because it's too much effort to keep 2 systems having exactly the same behaviors. And people may keep adding other dialects, which could increase maintenance costs dramatically.

Hive is a bit different as Spark already provides a lot of Hive compatibility. But still, it's not the right direction for Spark to provide 100% compatibility with another system.

For this particular case, I agree with @bart-samwel that we can fail by default for cast long to timestamp, and provide a legacy config to allow it with spark or hive behavior. This is a non-standard and weird behavior to allow cast long to timestamp, so for long-term we do want to forbid it, with clear error message to suggest using TIMESTAMP_MILLIS or TIMESTAMP_MICROS functions.

bart-samwel · 2020-05-19T08:30:00Z

@GuoPhilipse I agree that if you want to do the smooth migration like that, then you need to have all queries using a subset of the language that works in both systems. It's a luxury to have that, but it's nice to have. If you had the new functions AND the legacy config, then you would be in a good place. First migrate, then move over to the new functions after migration. So I propose to do both things. @cloud-fan do you agree?

GuoPhilipse · 2020-05-19T10:31:47Z

For this particular case, I agree with @bart-samwel that we can fail by default for cast long to timestamp, and provide a legacy config to allow it with spark or hive behavior. This is a non-standard and weird behavior to allow cast long to timestamp, so for long-term we do want to forbid it, with clear error message to suggest using TIMESTAMP_MILLIS or TIMESTAMP_MICROS functions.

…

---Yes, i agree with you Wenchen , i will create an new jira to work on it. At 2020-05-19 13:06:53, "Wenchen Fan" <notifications@github.com> wrote: we need same sql to run on hive/spark during migrating,if spark failed or behaviored less expected. so with a compatibility flag ,as you said, we can easily migrate them and no need to change user's sqls We did something similar before, with the pgsql dialect. This project was canceled, because it's too much effort to keep 2 systems having exactly the same behaviors. And people may keep adding other dialects, which could increase maintenance costs dramatically. Hive is a bit different as Spark already provides a lot of Hive compatibility. But still, it's not the right direction for Spark to provide 100% compatibility with another system. For this particular case, I agree with @bart-samwel that we can fail by default for cast long to timestamp, and provide a legacy config to allow it with spark or hive behavior. This is a non-standard and weird behavior to allow cast long to timestamp, so for long-term we do want to forbid it, with clear error message to suggest using TIMESTAMP_MILLIS or TIMESTAMP_MICROS functions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

HyukjinKwon · 2020-05-22T11:21:43Z

Let me close this for now, see also #28593

GuoPhilipse added 6 commits May 18, 2020 16:24

Merge pull request #7 from tongcheng-elong/GuoPhilipse-patch-SPARK-31…

c64f032

…710-3 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

Merge pull request #6 from tongcheng-elong/GuoPhilipse-patch-SPARK-31…

a3d0720

…710-2 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

Merge pull request #5 from tongcheng-elong/GuoPhilipse-patch-SPARK-31…

6e85d61

…710-1 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

probot-autolabeler bot added the SQL label May 18, 2020

GuoPhilipse added 5 commits May 18, 2020 17:00

Merge branch 'master' into master

a8814e0

Merge pull request #9 from tongcheng-elong/GuoPhilipse-patch-SPARK-31…

d6403d5

…710-4 [SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

Merge pull request #8 from tongcheng-elong/GuoPhilipse-patch-SPARK-31710

cfd254e

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp

bart-samwel mentioned this pull request May 18, 2020

[SPARK-31797][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions #28534

Closed

GuoPhilipse mentioned this pull request May 20, 2020

[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593

Closed

HyukjinKwon closed this May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

GuoPhilipse commented May 18, 2020

AmplabJenkins commented May 18, 2020

bart-samwel commented May 18, 2020

GuoPhilipse commented May 18, 2020 via email

cloud-fan commented May 19, 2020

bart-samwel commented May 19, 2020

GuoPhilipse commented May 19, 2020 via email

HyukjinKwon commented May 22, 2020 •

edited

Loading

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

Conversation

GuoPhilipse commented May 18, 2020

AmplabJenkins commented May 18, 2020

bart-samwel commented May 18, 2020

GuoPhilipse commented May 18, 2020 via email

cloud-fan commented May 19, 2020

bart-samwel commented May 19, 2020

GuoPhilipse commented May 19, 2020 via email

HyukjinKwon commented May 22, 2020 • edited Loading

HyukjinKwon commented May 22, 2020 •

edited

Loading