New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22100] [SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type #19321
Conversation
@@ -270,7 +270,6 @@ class ApproximatePercentileSuite extends SparkFunSuite { | |||
percentageExpression = percentageExpression, | |||
accuracyExpression = Literal(100)) | |||
|
|||
val result = wrongPercentage.checkInputDataTypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated by line 274.
Test build #82073 has finished for PR 19321 at commit
|
Test build #82099 has finished for PR 19321 at commit
|
Test build #82101 has finished for PR 19321 at commit
|
@@ -85,7 +85,8 @@ case class ApproximatePercentile( | |||
private lazy val accuracy: Int = accuracyExpression.eval().asInstanceOf[Int] | |||
|
|||
override def inputTypes: Seq[AbstractDataType] = { | |||
Seq(DoubleType, TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType) | |||
Seq(TypeCollection(NumericType, DateType, TimestampType), | |||
TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause the result difference. We need to document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we losing precision in the result with this change?
all the tests are with .0
so I'll not sure.
@felixcheung For percentiles, I think the type of results should be the same as input data type. In these tests, the type of data is int, so actually |
Test build #82108 has finished for PR 19321 at commit
|
That's a good point, thanks
|
case LongType => doubleResult.map(_.toLong) | ||
case FloatType => doubleResult.map(_.toFloat) | ||
case DoubleType => doubleResult | ||
case _: DecimalType => doubleResult.map(Decimal(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add
case other: DataType =>
throw new UnsupportedOperationException(s"Unexpected data type $other")
val doubleValue = child.dataType match { | ||
case DateType => value.asInstanceOf[Int].toDouble | ||
case TimestampType => value.asInstanceOf[Long].toDouble | ||
case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
Could you document the change in the output type of https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide |
docs/sql-programming-guide.md
Outdated
@@ -1553,6 +1553,7 @@ options. | |||
## Upgrading From Spark SQL 2.2 to 2.3 | |||
|
|||
- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`. | |||
- The percentile_approx function previously accepted only double type input and output double type results. Now it supports date type, timestamp type and all numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not right? Before this PR, we already support numeric types. We automatically cast it to Double, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my description is not accurate, I'll correct it.
Please update the PR title and description. |
Test build #82140 has finished for PR 19321 at commit
|
Test build #82149 has finished for PR 19321 at commit
|
LGTM |
Thanks! Merged to master. |
What changes were proposed in this pull request?
The
percentile_approx
function previously accepted numeric type input and output double type results.But since all numeric types, date and timestamp types are represented as numerics internally,
percentile_approx
can support them easily.After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
How was this patch tested?
Added a new test and modified some existing tests.