[SPARK-37433][SQL] Uses TimeZone.getDefault when timeZoneId is None for ZoneAwareExpression by sathiyapk · Pull Request #34675 · apache/spark

sathiyapk · 2021-11-21T23:36:59Z

What changes were proposed in this pull request?

Calling timeZoneId.get on Option[String] leads to NoSuchElementException: None.get. This PR matches the value of Option[String] and uses TimeZone.getDefault.toZoneId when zoneId is None, this avoid unexpected exceptions to users.

Why are the changes needed?

Calling .get on Option variable never been a good idea. We can either use a default value or choose to throw a meaningful exception. In this case, TimeZone.getDefault will be a good fit for a default value.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested Locally and via Unit Test.

…ZoneAwareExpression

AmplabJenkins · 2021-11-21T23:49:58Z

Can one of the admins verify this patch?

…t exception doesn't reproduce

HyukjinKwon · 2021-11-22T00:43:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

  /** Returns a copy of this expression with the specified timeZoneId. */
  def withTimeZone(timeZoneId: String): TimeZoneAwareExpression

-  @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)


It's guaranteed to have timeZoneId defined by resolved above. How did you face the error case?

Using the expression outside the context of a DataFrame gives the error. For example, the functions hour, date_format on a spark shell will give the error.

scala> hour(current_timestamp).expr.eval() java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53) at org.apache.spark.sql.catalyst.expressions.Hour.zoneId$lzycompute(datetimeExpressions.scala:321) at org.apache.spark.sql.catalyst.expressions.Hour.zoneId(datetimeExpressions.scala:321) at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval(datetimeExpressions.scala:302) at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval$(datetimeExpressions.scala:301) at org.apache.spark.sql.catalyst.expressions.Hour.nullSafeEval(datetimeExpressions.scala:321) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) ... 47 elided scala> date_format(current_timestamp, "dd").expr.eval() java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53) at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId$lzycompute(datetimeExpressions.scala:772) at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId(datetimeExpressions.scala:772) at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter(datetimeExpressions.scala:70) at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter$(datetimeExpressions.scala:67) at org.apache.spark.sql.catalyst.expressions.DateFormatClass.getFormatter(datetimeExpressions.scala:772) at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.$anonfun$formatterOption$1(datetimeExpressions.scala:64) at scala.Option.map(Option.scala:230)

There expressions are not supposed to be used standalone, and these are supposed to be internal (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22).

I agree that using Expressions are not supposed to be used standalone, but the error is not due the usage of expressions, it's during the usage of Column. It's a common practise that people use Column type as the argument to a transformation function like the usage of lit(".."). Later in the transformation pipeline if we want to get a string value from the column type, we should be able to get it back.

In our use case, we pass the value of a date as Column type at the beginning of the transformation pipeline to do the filtering on different DFs and at the end of the transformation, we use the date value for partitioning the results. Currently, we don't have better choice other than passing the date value as string type and convert it to column type using lit(..) during filtering which is not so elegant/ consistent with the rest of the pipeline.

And moreover, calling .get on an Option variable never been a good practise which may leads to un-meaningful error messages. We should always use .getOrElse on an Option variable. So i propose either we use .getOrElse(TimeZone.getDefault) or .getOrElse(throw UnSupportedUsageExpection("...")).

The expression is only resolved when the time zone has been set:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

Lines 46 to 48 in 6d42230

/** The expression is only resolved when the time zone has been set. */

override lazy val resolved: Boolean =

childrenResolved && checkInputDataTypes().isSuccess && timeZoneId.isDefined

I agree, but it will get resolved only when used in the context of a DataFrame/Dataset. Using a Column type outside, for example passing it as an argument to a function to do the filtering will show a confusing error java.util.NoSuchElementException: None.get.

@wangyum @HyukjinKwon May be we could set the time zone when the case None is called ?

or we could do something better when case None is called, what do you think ?

If you are ready to use internal stuff, just set the time zone manually:

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval() res6: Any = 20 scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval() res7: Any = 23

MaxGekk

Don't think we should allow to set zoneId in such way. If you are ready to use internal APIs, write a helper function which initialises zoneId to the value you desire.

MaxGekk · 2021-11-29T20:39:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

  /** Returns a copy of this expression with the specified timeZoneId. */
  def withTimeZone(timeZoneId: String): TimeZoneAwareExpression

-  @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)


If you are ready to use internal stuff, just set the time zone manually:

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval() res6: Any = 20 scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval() res7: Any = 23

MaxGekk · 2021-11-29T20:40:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

-  @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)
+  @transient lazy val zoneId: ZoneId = timeZoneId match {
+    case Some(x) => DateTimeUtils.getZoneId(x)
+    case None => TimeZone.getDefault.toZoneId


Don't think we should use the default JVM time zone. At least, spark.sql.session.timeZone.

github-actions · 2022-03-10T00:14:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SPARK-37433 Uses TimeZone.getDefault when timeZoneId is None for Time…

9cccd9f

…ZoneAwareExpression

github-actions bot added the SQL label Nov 21, 2021

SPARK-37433 Adds Unit Tests to verify NoSuchElementException: None.ge…

6500c82

…t exception doesn't reproduce

HyukjinKwon reviewed Nov 22, 2021

View reviewed changes

MaxGekk reviewed Nov 29, 2021

View reviewed changes

github-actions bot added the Stale label Mar 10, 2022

github-actions bot closed this Mar 11, 2022

	/** The expression is only resolved when the time zone has been set. */
	override lazy val resolved: Boolean =
	childrenResolved && checkInputDataTypes().isSuccess && timeZoneId.isDefined

Conversation

sathiyapk commented Nov 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Nov 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments