-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348
Conversation
cc @aray and @cloud-fan, could you take a look and see if it makes sense? |
470f36c
to
3c619df
Compare
LGTM |
Thank you for your sign-off @aray. |
Test build #74824 has finished for PR 17348 at commit
|
@@ -486,14 +486,16 @@ class Analyzer( | |||
case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) => | |||
val singleAgg = aggregates.size == 1 | |||
def outputName(value: Literal, aggregate: Expression): String = { | |||
val scalaValue = CatalystTypeConverters.convertToScala(value.value, value.dataType) | |||
val stringValue = Option(scalaValue).getOrElse("null").toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impact is not only on the data type timestamp
. Any test case to cover null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I thought https://github.com/HyukjinKwon/spark/blob/3c619dfb94723bd7a7d6a0811ab6329bf107f81b/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L220-L232 covers this.
Literal.toString
handles null
case before. If we remove Option(...).getOrElse("null")
there, it throws NPE in those tests.
What if session local timezone is changed? |
@ueshin, you are right. I think we should consider the timezone. val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show()
|
@@ -486,14 +486,16 @@ class Analyzer( | |||
case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) => | |||
val singleAgg = aggregates.size == 1 | |||
def outputName(value: Literal, aggregate: Expression): String = { | |||
val utf8Value = Cast(value, StringType, Some(conf.sessionLocalTimeZone)).eval(EmptyRow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can cast into StringType
in all the ways -
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Line 41 in e9e2c61
case (_, StringType) => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is this a correct way for handling timezone - @ueshin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your confirmation.
Test build #75018 has finished for PR 17348 at commit
|
val df = Seq(java.sql.Timestamp.valueOf(ts)).toDF("a").groupBy("a").pivot("a").count() | ||
val expected = StructType( | ||
StructField("a", TimestampType) :: | ||
StructField(tsWithZone, LongType) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it expected? users will see different values now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was confused of it too because the original values are apprently rendered differently. However, it seems intended.
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011
scala> Seq(timestamp).toDF("a").show()
+--------------------+
| a|
+--------------------+
|2012-12-30 23:00:...|
+--------------------+
Internal values seem as they are but it seems only changing human readable format according to the given timezone.
I guess this is as described in #16308
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the column name changes with timezone, but what about the value? can you also check the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure.
scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false)
+-----------------------+-----------------------+
|a |2012-12-30 23:00:10.011|
+-----------------------+-----------------------+
|2012-12-30 23:00:10.011|1 |
+-----------------------+-----------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the default timezone ...
scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011
scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false)
+-----------------------+-----------------------+
|a |2012-12-31 16:00:10.011|
+-----------------------+-----------------------+
|2012-12-31 16:00:10.011|1 |
+-----------------------+-----------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more tests with string cast ...
scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011
scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-31 16:00:10.011`").show(false)
+-----------------------+-----------------------+
|a |2012-12-31 16:00:10.011|
+-----------------------+-----------------------+
|2012-12-31 16:00:10.011|1 |
+-----------------------+-----------------------+
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011
scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-30 23:00:10.011`").show(false)
+-----------------------+-----------------------+
|a |2012-12-30 23:00:10.011|
+-----------------------+-----------------------+
|2012-12-30 23:00:10.011|1 |
+-----------------------+-----------------------+
Test build #75019 has finished for PR 17348 at commit
|
val expected = StructType( | ||
StructField("a", TimestampType) :: | ||
StructField(tsWithZone, LongType) :: Nil) | ||
assert(df.schema == expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a checkAnswer
to make sure the value is also tsWithZone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
LGTM |
Test build #75051 has finished for PR 17348 at commit
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Currently, when we perform count with timestamp types, it prints the internal representation as the column name as below:
This PR proposes to use external Scala value instead of the internal representation in the column names as below:
How was this patch tested?
Unit test in
DataFramePivotSuite
and manual tests.