Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321

Closed
wants to merge 6 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Sep 22, 2017

What changes were proposed in this pull request?

The percentile_approx function previously accepted numeric type input and output double type results.

But since all numeric types, date and timestamp types are represented as numerics internally, percentile_approx can support them easily.

After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

This change is also required when we generate equi-height histograms for these types.

How was this patch tested?

Added a new test and modified some existing tests.

@@ -270,7 +270,6 @@ class ApproximatePercentileSuite extends SparkFunSuite {
percentageExpression = percentageExpression,
accuracyExpression = Literal(100))

val result = wrongPercentage.checkInputDataTypes()
Copy link
Contributor Author

@wzhfy wzhfy Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated by line 274.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 22, 2017

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82073 has finished for PR 19321 at commit 958715b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2017

Test build #82099 has finished for PR 19321 at commit db2c110.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2017

Test build #82101 has finished for PR 19321 at commit 45e655f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -85,7 +85,8 @@ case class ApproximatePercentile(
private lazy val accuracy: Int = accuracyExpression.eval().asInstanceOf[Int]

override def inputTypes: Seq[AbstractDataType] = {
Seq(DoubleType, TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType)
Seq(TypeCollection(NumericType, DateType, TimestampType),
TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the result difference. We need to document it.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we losing precision in the result with this change?
all the tests are with .0 so I'll not sure.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 23, 2017

@felixcheung For percentiles, I think the type of results should be the same as input data type. In these tests, the type of data is int, so actually 30 is more accurate than 30.0. The previous answer is 30.0 because ApproximatePercentile only accepts double input type and outputs double results.

@SparkQA
Copy link

SparkQA commented Sep 23, 2017

Test build #82108 has finished for PR 19321 at commit 0d34053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Sep 23, 2017 via email

case LongType => doubleResult.map(_.toLong)
case FloatType => doubleResult.map(_.toFloat)
case DoubleType => doubleResult
case _: DecimalType => doubleResult.map(Decimal(_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add

        case other: DataType =>
          throw new UnsupportedOperationException(s"Unexpected data type $other")

val doubleValue = child.dataType match {
case DateType => value.asInstanceOf[Int].toDouble
case TimestampType => value.asInstanceOf[Long].toDouble
case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

@gatorsmile
Copy link
Member

Could you document the change in the output type of percentile_approx in the following section?

https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide

@@ -1553,6 +1553,7 @@ options.
## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
- The percentile_approx function previously accepted only double type input and output double type results. Now it supports date type, timestamp type and all numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right? Before this PR, we already support numeric types. We automatically cast it to Double, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my description is not accurate, I'll correct it.

@gatorsmile
Copy link
Member

Please update the PR title and description.

@SparkQA
Copy link

SparkQA commented Sep 25, 2017

Test build #82140 has finished for PR 19321 at commit 1d26f50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy wzhfy changed the title [SPARK-22100] [SQL] Make percentile_approx support numeric/date/timestamp types [SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type Sep 25, 2017
@SparkQA
Copy link

SparkQA commented Sep 25, 2017

Test build #82149 has finished for PR 19321 at commit d59fe37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@asfgit asfgit closed this in 365a29b Sep 25, 2017
@gatorsmile
Copy link
Member

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants