[SPARK-37433][SQL] Uses TimeZone.getDefault when timeZoneId is None for ZoneAwareExpression#34675
Conversation
…ZoneAwareExpression
|
Can one of the admins verify this patch? |
…t exception doesn't reproduce
| /** Returns a copy of this expression with the specified timeZoneId. */ | ||
| def withTimeZone(timeZoneId: String): TimeZoneAwareExpression | ||
|
|
||
| @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get) |
There was a problem hiding this comment.
It's guaranteed to have timeZoneId defined by resolved above. How did you face the error case?
There was a problem hiding this comment.
Using the expression outside the context of a DataFrame gives the error. For example, the functions hour, date_format on a spark shell will give the error.
scala> hour(current_timestamp).expr.eval()
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53)
at org.apache.spark.sql.catalyst.expressions.Hour.zoneId$lzycompute(datetimeExpressions.scala:321)
at org.apache.spark.sql.catalyst.expressions.Hour.zoneId(datetimeExpressions.scala:321)
at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval(datetimeExpressions.scala:302)
at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval$(datetimeExpressions.scala:301)
at org.apache.spark.sql.catalyst.expressions.Hour.nullSafeEval(datetimeExpressions.scala:321)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
... 47 elided
scala> date_format(current_timestamp, "dd").expr.eval()
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53)
at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53)
at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId$lzycompute(datetimeExpressions.scala:772)
at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId(datetimeExpressions.scala:772)
at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter(datetimeExpressions.scala:70)
at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter$(datetimeExpressions.scala:67)
at org.apache.spark.sql.catalyst.expressions.DateFormatClass.getFormatter(datetimeExpressions.scala:772)
at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.$anonfun$formatterOption$1(datetimeExpressions.scala:64)
at scala.Option.map(Option.scala:230)
There was a problem hiding this comment.
There expressions are not supposed to be used standalone, and these are supposed to be internal (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22).
There was a problem hiding this comment.
I agree that using Expressions are not supposed to be used standalone, but the error is not due the usage of expressions, it's during the usage of Column. It's a common practise that people use Column type as the argument to a transformation function like the usage of lit(".."). Later in the transformation pipeline if we want to get a string value from the column type, we should be able to get it back.
In our use case, we pass the value of a date as Column type at the beginning of the transformation pipeline to do the filtering on different DFs and at the end of the transformation, we use the date value for partitioning the results. Currently, we don't have better choice other than passing the date value as string type and convert it to column type using lit(..) during filtering which is not so elegant/ consistent with the rest of the pipeline.
There was a problem hiding this comment.
And moreover, calling .get on an Option variable never been a good practise which may leads to un-meaningful error messages. We should always use .getOrElse on an Option variable. So i propose either we use .getOrElse(TimeZone.getDefault) or .getOrElse(throw UnSupportedUsageExpection("...")).
There was a problem hiding this comment.
The expression is only resolved when the time zone has been set:
There was a problem hiding this comment.
I agree, but it will get resolved only when used in the context of a DataFrame/Dataset. Using a Column type outside, for example passing it as an argument to a function to do the filtering will show a confusing error java.util.NoSuchElementException: None.get.
There was a problem hiding this comment.
@wangyum @HyukjinKwon May be we could set the time zone when the case None is called ?
or we could do something better when case None is called, what do you think ?
There was a problem hiding this comment.
If you are ready to use internal stuff, just set the time zone manually:
scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval()
res6: Any = 20
scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval()
res7: Any = 23
MaxGekk
left a comment
There was a problem hiding this comment.
Don't think we should allow to set zoneId in such way. If you are ready to use internal APIs, write a helper function which initialises zoneId to the value you desire.
| /** Returns a copy of this expression with the specified timeZoneId. */ | ||
| def withTimeZone(timeZoneId: String): TimeZoneAwareExpression | ||
|
|
||
| @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get) |
There was a problem hiding this comment.
If you are ready to use internal stuff, just set the time zone manually:
scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval()
res6: Any = 20
scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval()
res7: Any = 23| @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get) | ||
| @transient lazy val zoneId: ZoneId = timeZoneId match { | ||
| case Some(x) => DateTimeUtils.getZoneId(x) | ||
| case None => TimeZone.getDefault.toZoneId |
There was a problem hiding this comment.
Don't think we should use the default JVM time zone. At least, spark.sql.session.timeZone.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Calling
timeZoneId.geton Option[String] leads toNoSuchElementException: None.get. This PR matches the value of Option[String] and usesTimeZone.getDefault.toZoneIdwhen zoneId is None, this avoid unexpected exceptions to users.Why are the changes needed?
Calling
.geton Option variable never been a good idea. We can either use a default value or choose to throw a meaningful exception. In this case, TimeZone.getDefault will be a good fit for a default value.Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested Locally and via Unit Test.