[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

wzhfy · 2017-09-22T07:40:31Z

What changes were proposed in this pull request?

The percentile_approx function previously accepted numeric type input and output double type results.

But since all numeric types, date and timestamp types are represented as numerics internally, percentile_approx can support them easily.

After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

This change is also required when we generate equi-height histograms for these types.

How was this patch tested?

Added a new test and modified some existing tests.

wzhfy · 2017-09-22T07:41:49Z

...t/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentileSuite.scala

@@ -270,7 +270,6 @@ class ApproximatePercentileSuite extends SparkFunSuite {
        percentageExpression = percentageExpression,
        accuracyExpression = Literal(100))

-      val result = wrongPercentage.checkInputDataTypes()


This is duplicated by line 274.

wzhfy · 2017-09-22T07:42:57Z

cc @cloud-fan @gatorsmile

SparkQA · 2017-09-22T09:35:10Z

Test build #82073 has finished for PR 19321 at commit 958715b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-23T03:04:00Z

Test build #82099 has finished for PR 19321 at commit db2c110.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-23T05:58:38Z

Test build #82101 has finished for PR 19321 at commit 45e655f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-23T06:55:39Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

@@ -85,7 +85,8 @@ case class ApproximatePercentile(
  private lazy val accuracy: Int = accuracyExpression.eval().asInstanceOf[Int]

  override def inputTypes: Seq[AbstractDataType] = {
-    Seq(DoubleType, TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType)
+    Seq(TypeCollection(NumericType, DateType, TimestampType),
+      TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType)


This will cause the result difference. We need to document it.

felixcheung

are we losing precision in the result with this change?
all the tests are with .0 so I'll not sure.

wzhfy · 2017-09-23T08:48:06Z

@felixcheung For percentiles, I think the type of results should be the same as input data type. In these tests, the type of data is int, so actually 30 is more accurate than 30.0. The previous answer is 30.0 because ApproximatePercentile only accepts double input type and outputs double results.

SparkQA · 2017-09-23T09:57:45Z

Test build #82108 has finished for PR 19321 at commit 0d34053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-09-23T19:01:42Z

That's a good point, thanks

gatorsmile · 2017-09-24T05:28:30Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

+      case LongType => doubleResult.map(_.toLong)
+      case FloatType => doubleResult.map(_.toFloat)
+      case DoubleType => doubleResult
+      case _: DecimalType => doubleResult.map(Decimal(_))


Add

case other: DataType => throw new UnsupportedOperationException(s"Unexpected data type $other")

gatorsmile · 2017-09-24T05:48:57Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

+      val doubleValue = child.dataType match {
+        case DateType => value.asInstanceOf[Int].toDouble
+        case TimestampType => value.asInstanceOf[Long].toDouble
+        case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])


The same here.

gatorsmile · 2017-09-24T05:54:39Z

Could you document the change in the output type of percentile_approx in the following section?

https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide

gatorsmile · 2017-09-25T03:41:05Z

docs/sql-programming-guide.md

@@ -1553,6 +1553,7 @@ options.
 ## Upgrading From Spark SQL 2.2 to 2.3

  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
+  - The percentile_approx function previously accepted only double type input and output double type results. Now it supports date type, timestamp type and all numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.


This is not right? Before this PR, we already support numeric types. We automatically cast it to Double, right?

Right, my description is not accurate, I'll correct it.

gatorsmile · 2017-09-25T03:44:03Z

Please update the PR title and description.

SparkQA · 2017-09-25T04:42:13Z

Test build #82140 has finished for PR 19321 at commit 1d26f50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-25T16:11:28Z

Test build #82149 has finished for PR 19321 at commit d59fe37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-25T16:26:54Z

LGTM

gatorsmile · 2017-09-25T16:51:57Z

Thanks! Merged to master.

percentile_approx supports all atomic types

958715b

wzhfy commented Sep 22, 2017

View reviewed changes

fix test

db2c110

fix pyspark test

45e655f

gatorsmile reviewed Sep 23, 2017

View reviewed changes

fix sparkr test

0d34053

felixcheung reviewed Sep 23, 2017

View reviewed changes

gatorsmile reviewed Sep 24, 2017

View reviewed changes

improve comment and doc

1d26f50

gatorsmile reviewed Sep 25, 2017

View reviewed changes

correct doc

d59fe37

wzhfy changed the title ~~[SPARK-22100] [SQL] Make percentile_approx support numeric/date/timestamp types~~ [SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type Sep 25, 2017

asfgit closed this in 365a29b Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

wzhfy commented Sep 22, 2017 •

edited

wzhfy Sep 22, 2017 •

edited

wzhfy commented Sep 22, 2017

SparkQA commented Sep 22, 2017

SparkQA commented Sep 23, 2017

SparkQA commented Sep 23, 2017

gatorsmile Sep 23, 2017

felixcheung left a comment

wzhfy commented Sep 23, 2017 •

edited

SparkQA commented Sep 23, 2017

felixcheung commented Sep 23, 2017 via email

gatorsmile Sep 24, 2017

gatorsmile Sep 24, 2017

gatorsmile commented Sep 24, 2017

gatorsmile Sep 25, 2017

wzhfy Sep 25, 2017

gatorsmile commented Sep 25, 2017

SparkQA commented Sep 25, 2017

SparkQA commented Sep 25, 2017

gatorsmile commented Sep 25, 2017

gatorsmile commented Sep 25, 2017

[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

Conversation

wzhfy commented Sep 22, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

wzhfy Sep 22, 2017 • edited

Choose a reason for hiding this comment

wzhfy commented Sep 22, 2017

SparkQA commented Sep 22, 2017

SparkQA commented Sep 23, 2017

SparkQA commented Sep 23, 2017

gatorsmile Sep 23, 2017

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

wzhfy commented Sep 23, 2017 • edited

SparkQA commented Sep 23, 2017

felixcheung commented Sep 23, 2017 via email

gatorsmile Sep 24, 2017

Choose a reason for hiding this comment

gatorsmile Sep 24, 2017

Choose a reason for hiding this comment

gatorsmile commented Sep 24, 2017

gatorsmile Sep 25, 2017

Choose a reason for hiding this comment

wzhfy Sep 25, 2017

Choose a reason for hiding this comment

gatorsmile commented Sep 25, 2017

SparkQA commented Sep 25, 2017

SparkQA commented Sep 25, 2017

gatorsmile commented Sep 25, 2017

gatorsmile commented Sep 25, 2017

wzhfy commented Sep 22, 2017 •

edited

wzhfy Sep 22, 2017 •

edited

wzhfy commented Sep 23, 2017 •

edited