Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32908][SQL][2.4] Fix target error calculation in percentile_approx() #29786

Closed
wants to merge 5 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 17, 2020

What changes were proposed in this pull request?

  1. Change the target error calculation according to the paper Space-Efficient Online Computation of Quantile Summaries. It says that the error e = max(gi, deltai)/2 (see the page 59). Also this has clear explanation ε-approximate quantiles.
  2. Added a test to check different accuracies.
  3. Added an input CSV file percentile_approx-input.csv.bz2 to the resource folder sql/catalyst/src/main/resources for the test.

Why are the changes needed?

To fix incorrect percentile calculation, see an example in SPARK-32908.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

  • By running existing tests in QuantileSummariesSuite and in ApproximatePercentileQuerySuite.
  • Added new test SPARK-32908: maximum target error in percentile_approx to ApproximatePercentileQuerySuite.

@SparkQA
Copy link

SparkQA commented Sep 17, 2020

Test build #128827 has finished for PR 29786 at commit 06b44a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Sep 18, 2020
…pprox()`

### What changes were proposed in this pull request?
1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.

### Why are the changes needed?
To fix incorrect percentile calculation, see an example in SPARK-32908.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.

Closes #29786 from MaxGekk/fix-percentile_approx-2.4.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the fix-percentile_approx-2.4 branch December 11, 2020 20:28
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…pprox()`

1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.

To fix incorrect percentile calculation, see an example in SPARK-32908.

Yes

- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.

Closes apache#29786 from MaxGekk/fix-percentile_approx-2.4.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Ref: LIHADOOP-55803
(cherry picked from eed7c62)

RB=2293402
BUG=LIHADOOP-55803
G=spark-reviewers
R=rhu,vsowrira,mmuralid,mshen
A=vsowrira
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants