-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31797][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions #28534
Conversation
This is a behavior change for existing Spark workloads. Even if the code is not compatible with Hive here, we shouldn't break existing workloads. See also the versioning policy.. If we really want to fix this, we would have to introduce a compatibility flag, and fail this construct by default so that people can choose. A better solution would be to introduce functions that are explicit about the meaning of their source operand. E.g. this is how it is done in BigQuery, with functions like @cloud-fan @MaxGekk FYI |
@bart-samwel Nice suggestion, I am both ok in conform to BigQuery or Hive. |
This is not standard behavior. It's OK to follow Hive, but this behavior has been released a long time ago, and is the behavior of Spark for many years. It's too late to change now. Agree with @bart-samwel to add a dedicated function to convert milli/micro seconds to timestamp. |
@cloud-fan @bart-samwel Thanks, I will make a revert and to add a dedicated function to support it. |
487197b
to
39eac6c
Compare
@cloud-fan @MaxGekk Could you please help me review it ? |
ok to test |
Test build #122749 has finished for PR 28534 at commit
|
Test build #122765 has finished for PR 28534 at commit
|
@cloud-fan Could you please verify this patch |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
withSQLConf() { | ||
checkEvaluation( | ||
GetTimestamp( | ||
Literal(1580184371847000L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with literals that don't end in three 0s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will replace three 0s with 123 in order to avoid mistake
Literal("milli")), 1580184371847000000L) | ||
checkEvaluation( | ||
GetTimestamp( | ||
Literal(1580184371847000L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test with negative (pre-Unix epoch) timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will add negative test.
checkExceptionInExpression[IllegalArgumentException]( | ||
GetTimestamp( | ||
Literal(1580184371847000L), | ||
Literal("other")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test with a non-literal input. The error should still make sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel Could you please give me an example about what a non-literal means.
@@ -846,6 +853,8 @@ abstract class ToTimestamp | |||
case NonFatal(_) => null | |||
} | |||
} | |||
case LongType => | |||
Math.multiplyExact(t.asInstanceOf[Long], scaleFactor.asInstanceOf[Long]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Do we do range checks on timestamps? (Do we even have published valid ranges?) Because if we do, then this could produce valid Long values that are out of range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a check when decide the value of scaleFactor
to avoid the result of multiply out of range of Long value.
Test build #122799 has finished for PR 28534 at commit
|
019d6b2
to
c2be46b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct some issue base on suggestions
withSQLConf() { | ||
checkEvaluation( | ||
GetTimestamp( | ||
Literal(1580184371847000L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will replace three 0s with 123 in order to avoid mistake
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
@@ -846,6 +853,8 @@ abstract class ToTimestamp | |||
case NonFatal(_) => null | |||
} | |||
} | |||
case LongType => | |||
Math.multiplyExact(t.asInstanceOf[Long], scaleFactor.asInstanceOf[Long]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a check when decide the value of scaleFactor
to avoid the result of multiply out of range of Long value.
checkExceptionInExpression[IllegalArgumentException]( | ||
GetTimestamp( | ||
Literal(1580184371847000L), | ||
Literal("other")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel Could you please give me an example about what a non-literal means.
Literal("milli")), 1580184371847000000L) | ||
checkEvaluation( | ||
GetTimestamp( | ||
Literal(1580184371847000L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will add negative test.
@bart-samwel a patch is added, Could you please help me review it. |
case LongType => | ||
val input = t.asInstanceOf[Long] | ||
if ( minRange < input && input < maxRange ) { | ||
Math.multiplyExact(input, scaleFactor.asInstanceOf[Long]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math.multiplyExact
throws ArithmeticException
. You don't need to explicitly check the ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My earlier concern was about whether we support the full range of Long as valid timestamps, or if the actual supported range is smaller. I googled a bit, and it seems that the full range is supported. So that's fine then without checks.
@@ -806,6 +806,17 @@ abstract class ToTimestamp | |||
case NonFatal(_) => null | |||
} | |||
|
|||
private lazy val scaleFactor = right.toString match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, look at other places in the file how to handle foldable expressions.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
@@ -789,7 +789,7 @@ abstract class ToTimestamp | |||
protected def downScaleFactor: Long | |||
|
|||
override def inputTypes: Seq[AbstractDataType] = | |||
Seq(TypeCollection(StringType, DateType, TimestampType), StringType) | |||
Seq(TypeCollection(StringType, DateType, TimestampType, LongType ), StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only LongType. Can be any IntegralType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IntegerType
should also be considered.
$javaType ${ev.value} = ${CodeGenerator.defaultValue(dataType)}; | ||
if (!${ev.isNull}) { | ||
if ( (${minRange}L < ${eval1.value}) && (${eval1.value} < ${maxRange}L) ) { | ||
${ev.value} = ${eval1.value} * $scaleFactor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent with non-codegen where you use Math.multiplyExact
@@ -846,6 +857,15 @@ abstract class ToTimestamp | |||
case NonFatal(_) => null | |||
} | |||
} | |||
case LongType => | |||
val input = t.asInstanceOf[Long] | |||
if ( minRange < input && input < maxRange ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, why do you exclude min and max values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, min and max should also be considered as correct input.Thanks for your advise.
"input [" + ${eval1.value} + "] not from " + ${minRange}L + " to " | ||
+ ${maxRange}L + " for format " + "${right.toString}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this string interpolation work in Java?
test("SPARK-31710:Fix millisecond and microsecond convert to timestamp in to_timestamp") { | ||
withSQLConf() { | ||
checkEvaluation( | ||
GetTimestamp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we are going to add 3 new functions: TIMESTAMP_SECONDS
, TIMESTAMP_MILLIS
and TIMESTAMP_MICROS
, to follow big query. Why do we overload GetTimestamp
? Passing unit string as parameter is fragile, as we need to define/document the supported units and the behavior of invalid units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your idea is right, I will add the three functions later and make a revert.
I'd like to cross reference this with PR #28568 which is trying to solve this same issue with a compatibility flag. I think those two solutions are orthogonal -- there can be a flag to change the behavior (to easy migration from Hive), and there can be a better way to specify the conversion type explicitly. So I think the direction of this PR should go in no matter what happens with the other PR. |
Test build #122804 has finished for PR 28534 at commit
|
c2be46b
to
0355ce4
Compare
@@ -401,6 +401,92 @@ case class DayOfYear(child: Expression) extends UnaryExpression with ImplicitCas | |||
} | |||
} | |||
|
|||
@ExpressionDescription( | |||
usage = "_FUNC_(date) - Returns timestamp from seconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FUNC_(seconds) - Creates timestamp from the number of seconds since UTC epoch.
} | ||
|
||
@ExpressionDescription( | ||
usage = "_FUNC_(date) - Returns timestamp from milliseconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FUNC_(seconds) - Creates timestamp from the number of milliseconds since UTC epoch.
} | ||
|
||
@ExpressionDescription( | ||
usage = "_FUNC_(date) - Returns timestamp from microseconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
||
protected def upScaleFactor: Long | ||
|
||
override def inputTypes: Seq[AbstractDataType] = Seq(LongType, IntegerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should accept LongType
only. The analyzer will cast the input to long for you, if possible.
…amp_milliseconds to timestamp_millis
fba8827
to
f0cc631
Compare
Test build #122972 has finished for PR 28534 at commit
|
Test build #122974 has finished for PR 28534 at commit
|
Test build #122987 has finished for PR 28534 at commit
|
thanks, merging to master! |
Hi @TJX2014 , the original JIRA doesn't fit this PR very well. Can you create a new JIRA ticket and update the PR title? Thanks! |
Thanks very much. Got a lot. |
@cloud-fan ok, I will create the jira. |
### What changes were proposed in this pull request? This is a followup of #28534 , to make `TIMESTAMP_SECONDS` function support fractional input as well. ### Why are the changes needed? Previously the cast function can cast fractional values to timestamp. Now we suggest users to ues these new functions, and we need to cover all the cast use cases. ### Does this PR introduce _any_ user-facing change? Yes, now `TIMESTAMP_SECONDS` function accepts fractional input. ### How was this patch tested? new tests Closes #28956 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…X_MICROS ### What changes were proposed in this pull request? As #28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers. ### Why are the changes needed? 1. Symmetry of the conversion functions 2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion. ### Does this PR introduce _any_ user-facing change? 3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type. ### How was this patch tested? Unit tests. Closes #30566 from gengliangwang/timestampLong. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Add and register three new functions:
TIMESTAMP_SECONDS
,TIMESTAMP_MILLIS
andTIMESTAMP_MICROS
A test is added.
Reference: BigQuery
Why are the changes needed?
People will have convenient way to get timestamps from seconds,milliseconds and microseconds.
Does this PR introduce any user-facing change?
Yes, people will have the following ways to get timestamp:
How was this patch tested?
Unit test.