[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

srowen · 2016-09-06T13:55:04Z

What changes were proposed in this pull request?

Actually call compress() in insert() when the threshold is exceeded
Also, avoid using ArrayBuffer where all data is prepended, because this is O(n)

How was this patch tested?

Existing tests.

srowen · 2016-09-06T13:55:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala

-      this.withHeadBufferInserted
+      val result = this.withHeadBufferInserted
+      if (result.sampled.length >= compressThreshold) {
+        result.compress()


CC @thunterdb -- is this the right fix?
I also 'adjusted' calls to .append() which is actually a varargs method; += appends an element

@srowen
I think the compression decision need to be related to relative error setting. (The smaller the relative error is, the less frequent we do compression)

When implementing aggregation function percentile_approx, I have implemented compression like this:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala#L214

By the way, would you mind change ApproximatePercentile altogether, the compression in ApproximatePercentile will not be necessary if we have done compression at QuantileSummaries

Yes, this is the fix, thanks for doing it. I had never realized that .append takes a vararg input, thanks for the hint.

@clockfly oh, hm, this code just landed recently? it seems to call compress() itself rather than leave it to QuantileSummaries, in which case I'm not clear why there's a compressThreshold in QuantileSummaries It seems like the new class is trying to manage it. What's the right way to rationalize this -- are you saying QuantileSummaries shouldn't manage compression at all? that's fine too (in which case this can just turn into a very small optimization change).

@clockfly @srowen the compression threshold is just here to amortize the cost of performing compression. If you wanted to, you could run compression every iteration (it is an idempotent operation). Internally, the compress method uses a merging threshold that indeeds depends on the number of elements seen, but it operates on a number of samples that is bounded by O(1/\epsi).
This patch will work. @clockfly I suspect some of the wrappers done in the Approximate percentile are not required either, once I submit a PR that fixes an off-by-1 error.

@srowen @thunterdb

I think the compression still need to be done in QuantileSummaries. I added some compression implementation in wrapper class ApproximatePercentile because ApproximatePercentile need to know whether the QuantileSummaries is compressed or not, otherwise ApproximatePercentile don't know whether it is OK to call def query(quantile: Double) of QuantileSummaries.

Maybe QuantileSummaries should expose an API like "isCompressed"? So that the caller can skip calling compress if QuantileSummaries is already compressed.

SparkQA · 2016-09-06T16:05:22Z

Test build #64997 has finished for PR 14976 at commit 7226aa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2016-09-06T23:13:13Z

@srowen LGTM, thanks!

thunterdb · 2016-09-06T23:36:54Z

By the way, you gave some great advice here. Is there a page on the wiki where we collect all this internal knowledge?

…rayBuffer.prepend

SparkQA · 2016-09-07T11:26:40Z

Test build #65032 has finished for PR 14976 at commit 75cb088.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-09T16:36:05Z

Superseded by #15002

…es and adding more tests ## What changes were proposed in this pull request? This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. ## How was this patch tested? This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783.

…es and adding more tests This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783. (cherry picked from commit 180796e) Signed-off-by: Sean Owen <sowen@cloudera.com>

…es and adding more tests ## What changes were proposed in this pull request? This PR build on apache#14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. ## How was this patch tested? This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes apache#15002 from thunterdb/ml-1783.

srowen reviewed Sep 6, 2016
View reviewed changes

Actually call compress() in QuantileSummaries, and avoid expensive Ar…

75cb088

…rayBuffer.prepend

srowen force-pushed the SPARK-17306 branch from 7226aa0 to 75cb088 Compare September 7, 2016 09:14

thunterdb mentioned this pull request Sep 7, 2016

[SQL][SPARK-17439] Fixing compression issues with approximate quantiles and adding more tests #15002

Closed

srowen closed this Sep 9, 2016

srowen deleted the SPARK-17306 branch September 9, 2016 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

srowen commented Sep 6, 2016

srowen Sep 6, 2016

clockfly Sep 6, 2016 •

edited

clockfly Sep 6, 2016

thunterdb Sep 6, 2016

srowen Sep 7, 2016

thunterdb Sep 7, 2016

clockfly Sep 8, 2016

SparkQA commented Sep 6, 2016

thunterdb commented Sep 6, 2016

thunterdb commented Sep 6, 2016

SparkQA commented Sep 7, 2016

srowen commented Sep 9, 2016

[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

Conversation

srowen commented Sep 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen Sep 6, 2016

Choose a reason for hiding this comment

clockfly Sep 6, 2016 • edited

Choose a reason for hiding this comment

clockfly Sep 6, 2016

Choose a reason for hiding this comment

thunterdb Sep 6, 2016

Choose a reason for hiding this comment

srowen Sep 7, 2016

Choose a reason for hiding this comment

thunterdb Sep 7, 2016

Choose a reason for hiding this comment

clockfly Sep 8, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 6, 2016

thunterdb commented Sep 6, 2016

thunterdb commented Sep 6, 2016

SparkQA commented Sep 7, 2016

srowen commented Sep 9, 2016

clockfly Sep 6, 2016 •

edited