New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976
Conversation
this.withHeadBufferInserted | ||
val result = this.withHeadBufferInserted | ||
if (result.sampled.length >= compressThreshold) { | ||
result.compress() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @thunterdb -- is this the right fix?
I also 'adjusted' calls to .append() which is actually a varargs method; += appends an element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen
I think the compression decision need to be related to relative error setting. (The smaller the relative error is, the less frequent we do compression)
When implementing aggregation function percentile_approx, I have implemented compression like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, would you mind change ApproximatePercentile
altogether, the compression in ApproximatePercentile will not be necessary if we have done compression at QuantileSummaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the fix, thanks for doing it. I had never realized that .append
takes a vararg input, thanks for the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clockfly oh, hm, this code just landed recently? it seems to call compress()
itself rather than leave it to QuantileSummaries
, in which case I'm not clear why there's a compressThreshold
in QuantileSummaries
It seems like the new class is trying to manage it. What's the right way to rationalize this -- are you saying QuantileSummaries
shouldn't manage compression at all? that's fine too (in which case this can just turn into a very small optimization change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clockfly @srowen the compression threshold is just here to amortize the cost of performing compression. If you wanted to, you could run compression every iteration (it is an idempotent operation). Internally, the compress
method uses a merging threshold that indeeds depends on the number of elements seen, but it operates on a number of samples that is bounded by O(1/\epsi).
This patch will work. @clockfly I suspect some of the wrappers done in the Approximate percentile are not required either, once I submit a PR that fixes an off-by-1 error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the compression still need to be done in QuantileSummaries. I added some compression implementation in wrapper class ApproximatePercentile because ApproximatePercentile need to know whether the QuantileSummaries is compressed or not, otherwise ApproximatePercentile don't know whether it is OK to call def query(quantile: Double)
of QuantileSummaries.
Maybe QuantileSummaries should expose an API like "isCompressed"? So that the caller can skip calling compress if QuantileSummaries is already compressed.
Test build #64997 has finished for PR 14976 at commit
|
@srowen LGTM, thanks! |
By the way, you gave some great advice here. Is there a page on the wiki where we collect all this internal knowledge? |
…rayBuffer.prepend
Test build #65032 has finished for PR 14976 at commit
|
Superseded by #15002 |
…es and adding more tests ## What changes were proposed in this pull request? This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. ## How was this patch tested? This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783.
…es and adding more tests This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783. (cherry picked from commit 180796e) Signed-off-by: Sean Owen <sowen@cloudera.com>
…es and adding more tests ## What changes were proposed in this pull request? This PR build on apache#14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. ## How was this patch tested? This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes apache#15002 from thunterdb/ml-1783.
What changes were proposed in this pull request?
compress()
ininsert()
when the threshold is exceededArrayBuffer
where all data is prepended, because this is O(n)How was this patch tested?
Existing tests.