Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Sep 6, 2016

What changes were proposed in this pull request?

  • Actually call compress() in insert() when the threshold is exceeded
  • Also, avoid using ArrayBuffer where all data is prepended, because this is O(n)

How was this patch tested?

Existing tests.

this.withHeadBufferInserted
val result = this.withHeadBufferInserted
if (result.sampled.length >= compressThreshold) {
result.compress()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @thunterdb -- is this the right fix?
I also 'adjusted' calls to .append() which is actually a varargs method; += appends an element

Copy link
Contributor

@clockfly clockfly Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen
I think the compression decision need to be related to relative error setting. (The smaller the relative error is, the less frequent we do compression)

When implementing aggregation function percentile_approx, I have implemented compression like this:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala#L214

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, would you mind change ApproximatePercentile altogether, the compression in ApproximatePercentile will not be necessary if we have done compression at QuantileSummaries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the fix, thanks for doing it. I had never realized that .append takes a vararg input, thanks for the hint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clockfly oh, hm, this code just landed recently? it seems to call compress() itself rather than leave it to QuantileSummaries, in which case I'm not clear why there's a compressThreshold in QuantileSummaries It seems like the new class is trying to manage it. What's the right way to rationalize this -- are you saying QuantileSummaries shouldn't manage compression at all? that's fine too (in which case this can just turn into a very small optimization change).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clockfly @srowen the compression threshold is just here to amortize the cost of performing compression. If you wanted to, you could run compression every iteration (it is an idempotent operation). Internally, the compress method uses a merging threshold that indeeds depends on the number of elements seen, but it operates on a number of samples that is bounded by O(1/\epsi).
This patch will work. @clockfly I suspect some of the wrappers done in the Approximate percentile are not required either, once I submit a PR that fixes an off-by-1 error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen @thunterdb

I think the compression still need to be done in QuantileSummaries. I added some compression implementation in wrapper class ApproximatePercentile because ApproximatePercentile need to know whether the QuantileSummaries is compressed or not, otherwise ApproximatePercentile don't know whether it is OK to call def query(quantile: Double) of QuantileSummaries.

Maybe QuantileSummaries should expose an API like "isCompressed"? So that the caller can skip calling compress if QuantileSummaries is already compressed.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64997 has finished for PR 14976 at commit 7226aa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@thunterdb
Copy link
Contributor

@srowen LGTM, thanks!

@thunterdb
Copy link
Contributor

By the way, you gave some great advice here. Is there a page on the wiki where we collect all this internal knowledge?

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65032 has finished for PR 14976 at commit 75cb088.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Sep 9, 2016

Superseded by #15002

@srowen srowen closed this Sep 9, 2016
@srowen srowen deleted the SPARK-17306 branch September 9, 2016 16:36
asfgit pushed a commit that referenced this pull request Sep 11, 2016
…es and adding more tests

## What changes were proposed in this pull request?

This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors.

## How was this patch tested?

This PR adds 8 unit tests that were failing without the fix.

Author: Timothy Hunter <timhunter@databricks.com>
Author: Sean Owen <sowen@cloudera.com>

Closes #15002 from thunterdb/ml-1783.
asfgit pushed a commit that referenced this pull request Sep 11, 2016
…es and adding more tests

This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors.

This PR adds 8 unit tests that were failing without the fix.

Author: Timothy Hunter <timhunter@databricks.com>
Author: Sean Owen <sowen@cloudera.com>

Closes #15002 from thunterdb/ml-1783.

(cherry picked from commit 180796e)
Signed-off-by: Sean Owen <sowen@cloudera.com>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…es and adding more tests

## What changes were proposed in this pull request?

This PR build on apache#14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors.

## How was this patch tested?

This PR adds 8 unit tests that were failing without the fix.

Author: Timothy Hunter <timhunter@databricks.com>
Author: Sean Owen <sowen@cloudera.com>

Closes apache#15002 from thunterdb/ml-1783.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants