Skip to content

Commit

Permalink
[SPARK-32908][SQL] Fix target error calculation in percentile_approx()
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.

### Why are the changes needed?
To fix incorrect percentile calculation, see an example in SPARK-32908.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.

Closes #29784 from MaxGekk/fix-percentile_approx-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 75dd864)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
MaxGekk authored and HyukjinKwon committed Sep 18, 2020
1 parent ecc2f5d commit 5581a92
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ class QuantileSummaries(

// Target rank
val rank = math.ceil(quantile * count).toLong
val targetError = relativeError * count
val targetError = sampled.map(s => s.delta + s.g).max / 2
// Minimum rank at current sample
var minRank = 0L
var i = 0
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ class ApproximatePercentileQuerySuite extends QueryTest with SharedSparkSession
(1 to 1000).toDF("col").createOrReplaceTempView(table)
checkAnswer(
spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 800) FROM $table"),
Row(Seq(499))
Row(Seq(500))
)
}
}
Expand Down Expand Up @@ -296,4 +296,23 @@ class ApproximatePercentileQuerySuite extends QueryTest with SharedSparkSession
buffer.quantileSummaries
assert(buffer.isCompressed)
}

test("SPARK-32908: maximum target error in percentile_approx") {
withTempView(table) {
spark.read
.schema("col int")
.csv(testFile("test-data/percentile_approx-input.csv.bz2"))
.repartition(1)
.createOrReplaceTempView(table)
checkAnswer(
spark.sql(
s"""SELECT
| percentile_approx(col, 0.77, 1000),
| percentile_approx(col, 0.77, 10000),
| percentile_approx(col, 0.77, 100000),
| percentile_approx(col, 0.77, 1000000)
|FROM $table""".stripMargin),
Row(18, 17, 17, 17))
}
}
}

0 comments on commit 5581a92

Please sign in to comment.