[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

HyukjinKwon · 2017-03-19T13:48:38Z

What changes were proposed in this pull request?

Currently, when we perform count with timestamp types, it prints the internal representation as the column name as below:

Seq(new java.sql.Timestamp(1)).toDF("a").groupBy("a").pivot("a").count().show()

+--------------------+----+
|                   a|1000|
+--------------------+----+
|1969-12-31 16:00:...|   1|
+--------------------+----+

This PR proposes to use external Scala value instead of the internal representation in the column names as below:

+--------------------+-----------------------+
|                   a|1969-12-31 16:00:00.001|
+--------------------+-----------------------+
|1969-12-31 16:00:...|                      1|
+--------------------+-----------------------+

How was this patch tested?

Unit test in DataFramePivotSuite and manual tests.

HyukjinKwon · 2017-03-19T13:49:01Z

cc @aray and @cloud-fan, could you take a look and see if it makes sense?

aray · 2017-03-19T14:42:50Z

LGTM

HyukjinKwon · 2017-03-19T14:44:49Z

Thank you for your sign-off @aray.

SparkQA · 2017-03-19T15:57:16Z

Test build #74824 has finished for PR 17348 at commit 3c619df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-22T00:47:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -486,14 +486,16 @@ class Analyzer(
      case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) =>
        val singleAgg = aggregates.size == 1
        def outputName(value: Literal, aggregate: Expression): String = {
+          val scalaValue = CatalystTypeConverters.convertToScala(value.value, value.dataType)
+          val stringValue = Option(scalaValue).getOrElse("null").toString


The impact is not only on the data type timestamp. Any test case to cover null?

Maybe, I thought https://github.com/HyukjinKwon/spark/blob/3c619dfb94723bd7a7d6a0811ab6329bf107f81b/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L220-L232 covers this.

Literal.toString handles null case before. If we remove Option(...).getOrElse("null") there, it throws NPE in those tests.

ueshin · 2017-03-22T01:01:15Z

What if session local timezone is changed?

HyukjinKwon · 2017-03-22T01:34:34Z

@ueshin, you are right. I think we should consider the timezone.

val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show()

+--------------------+-----------------------+
|                   a|2012-12-31 16:00:10.011|
+--------------------+-----------------------+
|2012-12-30 23:00:...|                      1|
+--------------------+-----------------------+

HyukjinKwon · 2017-03-22T02:10:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -486,14 +486,16 @@ class Analyzer(
      case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) =>
        val singleAgg = aggregates.size == 1
        def outputName(value: Literal, aggregate: Expression): String = {
+          val utf8Value = Cast(value, StringType, Some(conf.sessionLocalTimeZone)).eval(EmptyRow)


It seems we can cast into StringType in all the ways -

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Line 41 in e9e2c61

case (_, StringType) => true

BTW, is this a correct way for handling timezone - @ueshin ?

Yes, it looks good.

Thank you for your confirmation.

SparkQA · 2017-03-22T04:05:02Z

Test build #75018 has finished for PR 17348 at commit 93f05f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-22T04:12:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala

+      val df = Seq(java.sql.Timestamp.valueOf(ts)).toDF("a").groupBy("a").pivot("a").count()
+      val expected = StructType(
+        StructField("a", TimestampType) ::
+        StructField(tsWithZone, LongType) :: Nil)


is it expected? users will see different values now

Yea, I was confused of it too because the original values are apprently rendered differently. However, it seems intended.

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011") timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011 scala> Seq(timestamp).toDF("a").show() +--------------------+ | a| +--------------------+ |2012-12-30 23:00:...| +--------------------+

Internal values seem as they are but it seems only changing human readable format according to the given timezone.

I guess this is as described in #16308

the column name changes with timezone, but what about the value? can you also check the result?

Ah, sure.

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011") timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011 scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false) +-----------------------+-----------------------+ |a |2012-12-30 23:00:10.011| +-----------------------+-----------------------+ |2012-12-30 23:00:10.011|1 | +-----------------------+-----------------------+

With the default timezone ...

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011") timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011 scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false) +-----------------------+-----------------------+ |a |2012-12-31 16:00:10.011| +-----------------------+-----------------------+ |2012-12-31 16:00:10.011|1 | +-----------------------+-----------------------+

Few more tests with string cast ...

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011") timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011 scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-31 16:00:10.011`").show(false) +-----------------------+-----------------------+ |a |2012-12-31 16:00:10.011| +-----------------------+-----------------------+ |2012-12-31 16:00:10.011|1 | +-----------------------+-----------------------+

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011") timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011 scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-30 23:00:10.011`").show(false) +-----------------------+-----------------------+ |a |2012-12-30 23:00:10.011| +-----------------------+-----------------------+ |2012-12-30 23:00:10.011|1 | +-----------------------+-----------------------+

SparkQA · 2017-03-22T04:21:57Z

Test build #75019 has finished for PR 17348 at commit 4e4cfa7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-22T10:51:22Z

sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala

+      val expected = StructType(
+        StructField("a", TimestampType) ::
+        StructField(tsWithZone, LongType) :: Nil)
+      assert(df.schema == expected)


can we add a checkAnswer to make sure the value is also tsWithZone?

cloud-fan · 2017-03-22T13:51:35Z

LGTM

SparkQA · 2017-03-22T15:39:26Z

Test build #75051 has finished for PR 17348 at commit 803a094.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFramePivotSuite extends QueryTest with SharedSQLContext

gatorsmile · 2017-03-22T16:59:05Z

Thanks! Merging to master.

Pivot with timestamp and count should not print internal representation

3c619df

HyukjinKwon force-pushed the SPARK-20018 branch from 470f36c to 3c619df Compare March 19, 2017 13:51

gatorsmile reviewed Mar 22, 2017

View reviewed changes

HyukjinKwon added 2 commits March 22, 2017 10:55

Consider timezone in external representation in TimestampType

93f05f3

Make the name consistent

4e4cfa7

HyukjinKwon commented Mar 22, 2017

View reviewed changes

cloud-fan reviewed Mar 22, 2017

View reviewed changes

Fix tests to make sure string representation are the same

803a094

asfgit closed this in 80fd070 Mar 22, 2017

HyukjinKwon deleted the SPARK-20018 branch January 2, 2018 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

HyukjinKwon commented Mar 19, 2017

HyukjinKwon commented Mar 19, 2017

aray commented Mar 19, 2017

HyukjinKwon commented Mar 19, 2017

SparkQA commented Mar 19, 2017

gatorsmile Mar 22, 2017

HyukjinKwon Mar 22, 2017

ueshin commented Mar 22, 2017

HyukjinKwon commented Mar 22, 2017 •

edited

Loading

HyukjinKwon Mar 22, 2017

HyukjinKwon Mar 22, 2017

ueshin Mar 22, 2017

HyukjinKwon Mar 22, 2017

SparkQA commented Mar 22, 2017

cloud-fan Mar 22, 2017

HyukjinKwon Mar 22, 2017

cloud-fan Mar 22, 2017

HyukjinKwon Mar 22, 2017

HyukjinKwon Mar 22, 2017

HyukjinKwon Mar 22, 2017

SparkQA commented Mar 22, 2017

cloud-fan Mar 22, 2017

HyukjinKwon Mar 22, 2017

cloud-fan commented Mar 22, 2017

SparkQA commented Mar 22, 2017

gatorsmile commented Mar 22, 2017

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

Conversation

HyukjinKwon commented Mar 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Mar 19, 2017

aray commented Mar 19, 2017

HyukjinKwon commented Mar 19, 2017

SparkQA commented Mar 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Mar 22, 2017

HyukjinKwon commented Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 22, 2017

SparkQA commented Mar 22, 2017

gatorsmile commented Mar 22, 2017

HyukjinKwon commented Mar 22, 2017 •

edited

Loading