Skip to content

[SPARK-37433][SQL] Uses TimeZone.getDefault when timeZoneId is None for ZoneAwareExpression#34675

Closed
sathiyapk wants to merge 2 commits intoapache:masterfrom
sathiyapk:SPARK-37433-TimeZoneAwareExpression-None.get
Closed

[SPARK-37433][SQL] Uses TimeZone.getDefault when timeZoneId is None for ZoneAwareExpression#34675
sathiyapk wants to merge 2 commits intoapache:masterfrom
sathiyapk:SPARK-37433-TimeZoneAwareExpression-None.get

Conversation

@sathiyapk
Copy link
Contributor

What changes were proposed in this pull request?

Calling timeZoneId.get on Option[String] leads to NoSuchElementException: None.get. This PR matches the value of Option[String] and uses TimeZone.getDefault.toZoneId when zoneId is None, this avoid unexpected exceptions to users.

Why are the changes needed?

Calling .get on Option variable never been a good idea. We can either use a default value or choose to throw a meaningful exception. In this case, TimeZone.getDefault will be a good fit for a default value.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested Locally and via Unit Test.

@github-actions github-actions bot added the SQL label Nov 21, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

/** Returns a copy of this expression with the specified timeZoneId. */
def withTimeZone(timeZoneId: String): TimeZoneAwareExpression

@transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's guaranteed to have timeZoneId defined by resolved above. How did you face the error case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the expression outside the context of a DataFrame gives the error. For example, the functions hour, date_format on a spark shell will give the error.

scala> hour(current_timestamp).expr.eval()
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:529)
  at scala.None$.get(Option.scala:527)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53)
  at org.apache.spark.sql.catalyst.expressions.Hour.zoneId$lzycompute(datetimeExpressions.scala:321)
  at org.apache.spark.sql.catalyst.expressions.Hour.zoneId(datetimeExpressions.scala:321)
  at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval(datetimeExpressions.scala:302)
  at org.apache.spark.sql.catalyst.expressions.GetTimeField.nullSafeEval$(datetimeExpressions.scala:301)
  at org.apache.spark.sql.catalyst.expressions.Hour.nullSafeEval(datetimeExpressions.scala:321)
  at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
  ... 47 elided

scala> date_format(current_timestamp, "dd").expr.eval()
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:529)
  at scala.None$.get(Option.scala:527)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53)
  at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId$lzycompute(datetimeExpressions.scala:772)
  at org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId(datetimeExpressions.scala:772)
  at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter(datetimeExpressions.scala:70)
  at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter$(datetimeExpressions.scala:67)
  at org.apache.spark.sql.catalyst.expressions.DateFormatClass.getFormatter(datetimeExpressions.scala:772)
  at org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.$anonfun$formatterOption$1(datetimeExpressions.scala:64)
  at scala.Option.map(Option.scala:230)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There expressions are not supposed to be used standalone, and these are supposed to be internal (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using Expressions are not supposed to be used standalone, but the error is not due the usage of expressions, it's during the usage of Column. It's a common practise that people use Column type as the argument to a transformation function like the usage of lit(".."). Later in the transformation pipeline if we want to get a string value from the column type, we should be able to get it back.

In our use case, we pass the value of a date as Column type at the beginning of the transformation pipeline to do the filtering on different DFs and at the end of the transformation, we use the date value for partitioning the results. Currently, we don't have better choice other than passing the date value as string type and convert it to column type using lit(..) during filtering which is not so elegant/ consistent with the rest of the pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And moreover, calling .get on an Option variable never been a good practise which may leads to un-meaningful error messages. We should always use .getOrElse on an Option variable. So i propose either we use .getOrElse(TimeZone.getDefault) or .getOrElse(throw UnSupportedUsageExpection("...")).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression is only resolved when the time zone has been set:

/** The expression is only resolved when the time zone has been set. */
override lazy val resolved: Boolean =
childrenResolved && checkInputDataTypes().isSuccess && timeZoneId.isDefined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it will get resolved only when used in the context of a DataFrame/Dataset. Using a Column type outside, for example passing it as an argument to a function to do the filtering will show a confusing error java.util.NoSuchElementException: None.get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum @HyukjinKwon May be we could set the time zone when the case None is called ?

or we could do something better when case None is called, what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are ready to use internal stuff, just set the time zone manually:

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval()
res6: Any = 20

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval()
res7: Any = 23

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should allow to set zoneId in such way. If you are ready to use internal APIs, write a helper function which initialises zoneId to the value you desire.

/** Returns a copy of this expression with the specified timeZoneId. */
def withTimeZone(timeZoneId: String): TimeZoneAwareExpression

@transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are ready to use internal stuff, just set the time zone manually:

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("UTC").eval()
res6: Any = 20

scala> hour(current_timestamp).expr.asInstanceOf[TimeZoneAwareExpression].withTimeZone("Europe/Moscow").eval()
res7: Any = 23

@transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get)
@transient lazy val zoneId: ZoneId = timeZoneId match {
case Some(x) => DateTimeUtils.getZoneId(x)
case None => TimeZone.getDefault.toZoneId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should use the default JVM time zone. At least, spark.sql.session.timeZone.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 10, 2022
@github-actions github-actions bot closed this Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments