Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation #17348

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

Currently, when we perform count with timestamp types, it prints the internal representation as the column name as below:

Seq(new java.sql.Timestamp(1)).toDF("a").groupBy("a").pivot("a").count().show()
+--------------------+----+
|                   a|1000|
+--------------------+----+
|1969-12-31 16:00:...|   1|
+--------------------+----+

This PR proposes to use external Scala value instead of the internal representation in the column names as below:

+--------------------+-----------------------+
|                   a|1969-12-31 16:00:00.001|
+--------------------+-----------------------+
|1969-12-31 16:00:...|                      1|
+--------------------+-----------------------+

How was this patch tested?

Unit test in DataFramePivotSuite and manual tests.

@HyukjinKwon
Copy link
Member Author

cc @aray and @cloud-fan, could you take a look and see if it makes sense?

@aray
Copy link
Contributor

aray commented Mar 19, 2017

LGTM

@HyukjinKwon
Copy link
Member Author

Thank you for your sign-off @aray.

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74824 has finished for PR 17348 at commit 3c619df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -486,14 +486,16 @@ class Analyzer(
case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) =>
val singleAgg = aggregates.size == 1
def outputName(value: Literal, aggregate: Expression): String = {
val scalaValue = CatalystTypeConverters.convertToScala(value.value, value.dataType)
val stringValue = Option(scalaValue).getOrElse("null").toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impact is not only on the data type timestamp. Any test case to cover null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I thought https://github.com/HyukjinKwon/spark/blob/3c619dfb94723bd7a7d6a0811ab6329bf107f81b/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L220-L232 covers this.

Literal.toString handles null case before. If we remove Option(...).getOrElse("null") there, it throws NPE in those tests.

@ueshin
Copy link
Member

ueshin commented Mar 22, 2017

What if session local timezone is changed?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 22, 2017

@ueshin, you are right. I think we should consider the timezone.

val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show()
+--------------------+-----------------------+
|                   a|2012-12-31 16:00:10.011|
+--------------------+-----------------------+
|2012-12-30 23:00:...|                      1|
+--------------------+-----------------------+

@@ -486,14 +486,16 @@ class Analyzer(
case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, child) =>
val singleAgg = aggregates.size == 1
def outputName(value: Literal, aggregate: Expression): String = {
val utf8Value = Cast(value, StringType, Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can cast into StringType in all the ways -

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is this a correct way for handling timezone - @ueshin ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your confirmation.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75018 has finished for PR 17348 at commit 93f05f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val df = Seq(java.sql.Timestamp.valueOf(ts)).toDF("a").groupBy("a").pivot("a").count()
val expected = StructType(
StructField("a", TimestampType) ::
StructField(tsWithZone, LongType) :: Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected? users will see different values now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was confused of it too because the original values are apprently rendered differently. However, it seems intended.

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011

scala> Seq(timestamp).toDF("a").show()
+--------------------+
|                   a|
+--------------------+
|2012-12-30 23:00:...|
+--------------------+

Internal values seem as they are but it seems only changing human readable format according to the given timezone.

I guess this is as described in #16308

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the column name changes with timezone, but what about the value? can you also check the result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure.

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false)
+-----------------------+-----------------------+
|a                      |2012-12-30 23:00:10.011|
+-----------------------+-----------------------+
|2012-12-30 23:00:10.011|1                      |
+-----------------------+-----------------------+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default timezone ...

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011

scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().show(false)
+-----------------------+-----------------------+
|a                      |2012-12-31 16:00:10.011|
+-----------------------+-----------------------+
|2012-12-31 16:00:10.011|1                      |
+-----------------------+-----------------------+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more tests with string cast ...

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011

scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-31 16:00:10.011`").show(false)
+-----------------------+-----------------------+
|a                      |2012-12-31 16:00:10.011|
+-----------------------+-----------------------+
|2012-12-31 16:00:10.011|1                      |
+-----------------------+-----------------------+
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
timestamp: java.sql.Timestamp = 2012-12-31 16:00:10.011

scala> Seq(timestamp).toDF("a").groupBy("a").pivot("a").count().selectExpr("cast(a as string)", "`2012-12-30 23:00:10.011`").show(false)
+-----------------------+-----------------------+
|a                      |2012-12-30 23:00:10.011|
+-----------------------+-----------------------+
|2012-12-30 23:00:10.011|1                      |
+-----------------------+-----------------------+

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75019 has finished for PR 17348 at commit 4e4cfa7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val expected = StructType(
StructField("a", TimestampType) ::
StructField(tsWithZone, LongType) :: Nil)
assert(df.schema == expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a checkAnswer to make sure the value is also tsWithZone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75051 has finished for PR 17348 at commit 803a094.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFramePivotSuite extends QueryTest with SharedSQLContext

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 80fd070 Mar 22, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-20018 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants