New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo Quantiles #1225
Comments
The performance issue with the current implementation seems to be from the last task. We might want to try a tree reduction. |
The current In [1]: from tdigest import TDigest
In [2]: t = TDigest()
In [3]: %time t.batch_update(range(10000))
CPU times: user 6.47 s, sys: 4 ms, total: 6.48 s
Wall time: 6.53 s My immediate need would probably be satisfied by accelerating the reduction step of the current percentile function. cc @eriknw |
It might also be a fun side project for someone to build a faster |
Indeed, I did a little experiment with this a few months ago and noticed that tdigest seemed to be very slow (relative to loops in a compiled language). It seems like we need some sort batch updates to achieve reasonable performance. At that point, I decided that this was probably beyond my immediate capabilities. |
I've been wanting to do this. I think a cython implementation could be done fairly cheaply, and also be performant. Might try to put something together this week. |
Percentile is fairly close to the core functionality of dask.dataframe. I'm hesitant to add a cython project dependency to dask.dataframe if it can be avoided. Currently we only depend on Pandas and other pure-python projects. |
I'm skeptical that a performant version can be written without any cython code. From what I remember when I looked at this a while ago the needed methods couldn't be easily spelled with numpy vectorized methods. I'm also not against relying on another cython package - I don't think the code would be very complex, and conda and wheels make this cheap. Would be happy to be proven wrong though. |
My two cents is that percentile should never have been added to On Sun, Jun 5, 2016 at 7:57 PM, Matthew Rocklin notifications@github.com
|
So, the algorithm does represent itself as approximate. We've seen a failure case for a version of approximate that is something like "size of the bins into which the percentiles partition the original data" but this failure case doesn't yet disprove approximate under the definition of "will give you a value with a similar rank of the actual percentile." s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
# also holds for all different chunk sizes that I tested other than 20
dd.from_pandas(s, 20).quantile(0.5).compute() # 1.0 This is either great or terrible performance depending on how you define approximate. If you define it as the number of entries away, then yes, it's bad. However if you define it as the number of y-values away then we're just one element off :) I suspect that we're actually talking about two different operations. A lot of internal dask.dataframe operations require a fast and robustly available approximate quantile implementation. We don't mind if it's wildly off (though in practice it performs exceedingly well) so we don't really care about it not being a thoroughly vetted algorithm. We might want to separate this from a user-facing approximate quantile that has good theoretical guarantees. I think that this is the operation that you all are talking about. I honestly don't care much about this algorithm (I think it'd be handy, but it isn't essential to correct dataframe operation.) I don't mind this operation being cython-based as long as that dependency is optional. |
Yes, I completely agree. |
Ah, so what you're looking at here is a faster quantile like thing, regardless of validity of quantiles. I agree then that something good enough could probably be found without adding a compiled dependency. Perhaps change the title of the issue to better reflect the need (as it's not really quanitiles you're after)? |
I agree about separating user-facing quantiles and the approximate quantile function used internally. Regarding improving performance, cheap things to try are to use |
Current performance on the nyctaxi dataset on a cluster looks like 20ms costs for each of the per-partition percentiles followed by a lot of data transfer and a 20s cost for the combination. |
It should be relatively straightforward to convert |
I just implemented a faster Even if (or, rather, when) we add t-digest to dask, we will still want to be able to partition on non-interpolable and non-numeric data, so we should have a generic solution that is good enough. |
There is also the weighted approximate quantile algorithm of XGBoost: https://github.com/dmlc/xgboost/blob/master/src/common/quantile.h It is described in the appendix of the XGBoost paper: https://arxiv.org/abs/1603.02754 It's supposed to be very fast. I don't know how it compares to t-digest (from a speed and statistical accuracy point of view). Maybe @tqchen, @tdunning or @CamDavidsonPilon know better. |
On Thu, Sep 29, 2016 at 9:03 AM, Olivier Grisel notifications@github.com
I don't know yet. But I will know before long. One big difference is that recent versions of t-digest are entirely static |
Thanks for your input Ted.
Is this change purely an implementation detail of the official java implementation or a fundamental change in the datastructure described in your paper? |
I think the major advantage of what is described in XGBoost is that it works for the weighted data naturally with theoretical guanrantee regardless of order of the data. The requested memory size can be calculated before hand with the corresponding accuracy request. It can also be plugged into the streaming version of algorithm described in http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf |
Another interesting thing is that the weighted version of quantiles can become more accurate(than what it aske to be) when there are duplication in the data, because the duplications goes into the same buckets and add up weights, it makes it useful for many skewed real world data cases |
My understanding is that t-digest also has support by default for these. I have not yet read the papers (t-digest and xgboost appendix) fully but I guess that the main difference might be that the approximation error of t-digest is proportional to |
Get that. I take a quick look into walkthrough of t-digest, it seems that in order for t-digest to work, it requires on assumption of randomly shuffled data. While the quantile summary proofably guarantees the absolute correctness(such that approximation error falls within specified epsilon bound) regardless of order of the dataset. I do agree that the skewness requirement on the end is interesting and not yet supported by the quantile summary. |
Answers:
On Fri, Sep 30, 2016 at 1:11 PM, Tianqi Chen notifications@github.com
|
Thank you all for your inputs. |
@tdunning interesting. Just to try to understand it a bit more. Is there any paper about proof of correctness of the t-digest (the two variants you mentioned) to with respect to non-random orders, or are these heuristics that works well in practice? |
At this point there is no completed paper, nor is there a fully developed On the other hand, the special case where all of the points are buffered For a more comprehensive proof, it should suffice that merging t-digests In addition, the algorithm in all forms performs very well in practice. The On Sat, Oct 1, 2016 at 12:39 PM, Tianqi Chen notifications@github.com
|
This is an old issue but from what I see it still hasn't been addressed. I happen to need this functionality right now so I'm interested in knowing the status of this issue and I'm also willing work on this. So what is the current status of this? Could this be implemented using crick as mentioned here #3099? FYI there's also a new paper released for t-digest lately https://arxiv.org/abs/1902.04023 |
Thanks for your interest here @Dimplexion . It would be good to resolve this issue well.
The current status is what you see in this issue, there are a few different proposals. Someone probably needs to evaluate them and come up with a plan for what to implement that balances perforamnce, maintainability (like adding more dependencies), and easy of implementation. Crick is one suggestion, but there are a few others above. We'll need to come to some conclusion. The first thing to do is probably to summarize the options above, their pros and cons, maybe take a look at the various implementations and see how feasible they are, and then propose a path forward. |
I can summarize t digest.
In the upcoming release, you can opt for constant error instead of error
optimized for tails. Serialized size of a typical digest is about 200
bytes, the memory residency requirements are the equivalent of a few
hundred floats. Memory allocation is entirely static.
For use with decision trees an interesting advantage is the ability to
probe any quantile desired after the digest is constructed rather than
specifying which quantiles are interesting before construction. There is no
sensitivity to ordering of data, except that accuracy will be better
slightly with completely ordered samples. Accuracy with unordered samples
is nearly as good as if you had an idealized histogram with optimal
boundaries.
Speed of insertion is less than 100 nanoseconds per sample on average. With
multiple levels of merging and about a thousand course for parallel at work
you can digest a billion data points in less than a second. This makes use
of the fact that merging 1000 digests takes only about a hundred and fifty
milli seconds.
The t digest library is very small (only a few thousand lines of code).
There are no external dependencies. All code is Apache licensed.
…On Fri, Mar 1, 2019, 8:07 AM Matthew Rocklin ***@***.***> wrote:
Thanks for your interest here @Dimplexion <https://github.com/Dimplexion>
. It would be good to resolve this issue well.
So what is the current status of this? Could this be implemented using
crick as mentioned here #3099 <#3099>?
The current status is what you see in this issue, there are a few
different proposals. Someone probably needs to evaluate them and come up
with a plan for what to implement that balances perforamnce,
maintainability (like adding more dependencies), and easy of implementation.
Crick is one suggestion, but there are a few others above. We'll need to
come to some conclusion. The first thing to do is probably to summarize the
options above, their pros and cons, maybe take a look at the various
implementations and see how feasible they are, and then propose a path
forward.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1225 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPSej-qs7_CJD9mEJNDYxuw28ZhhxHTks5vSVAogaJpZM4Iuel7>
.
|
Alright. This week is a little busy for me but I'll start going through the other options and post my summary here by the end of next week along with my suggestion for which path to take. Others can then verify the conclusions and we can go from there. |
I've been researching the different options and it has proven to be a little more involved than I realized initially so I'm posting my current findings here for others to comment. Current implementation:
Options:Python default tdigest package
Note: the current version seems to be much faster (around twice as fast) than when first tested in this thread. crick (cython based tdigest implementation)
Upcoming t-digest version
XGBoost approximate quantile
Conclusion so far:From the currently available options for tdigest, crick seems to be the way to go. It is still experimental but I feel the performance improvement compared to the default tdigest package is well worth updating and maintaining it. Updating it to support the new changes to the tdigest algorithm still-to-be released is a good option to bring further benefits. As a side note though, I'm using the simple implementation based on crick from here #3099 and it is considerably slower than the current Unfortunately it's not so easy to compare tdigest to the XGboost approximate quantile algorithm due to the lack of a Python library implementation. From theoretical point-of-view both of them seem to be good enough to cover our use-cases. The way forward is that we must compare the two algorithms by either wrapping XGBoost in a Python module or test with the C++ implementations to compare the algorithms in realistic scenarios. If the performances are similar I would give the edge to tdigest due to the static memory usage and the fact that it seems to be more widely adopted and supported. Let me know if I missed anything. Also, I tried to understand both algorithms properly and make as objective points as I could, but I'm not an expert on these algorithms so please let me know if there are inaccuraties or you just have a different opinion. |
FYI, a related paper I saw recently https://arxiv.org/abs/1803.01969 |
As a note, I wrote crick specifically to be used by Dask, but due to priorities never got around to implementing a Dask wrapper. Crick implements T-Digest's "Merging Digest" algorithm at the time of writing (static allocation, fast updates and merges), though it looks like there have been algorithm changes since I wrote it. Since I'm one of the core dask developers I don't see any issue with using it within Dask. It's intended to be a general library of approximate algorithms for Python, so other cython implementations we want can also go there. |
Thanks, I'll check that out!
Sounds good to me. Whichever algorithm we choose implementing it in crick makes a lot of sense. I'll check out the paper and will try to figure out how to best compare the performance of the different algorithms. I should have time to work more on this next week. |
After some more reading, I've come to the conclusion that for the public facing
Comparing XGBoost's quantile algorithm to it I can't really come up with other reasons to use it over t-digest except maybe speed. I believe t-digest will be fast enough to be used in the I checked out the paper @tqchen linked and while the algorithm (M-sketch) has some interesting properties, I don't think it's a good fit for this use-case. I'll add my summary of it to the end of this post. It seems that quantiles are used for partitioning internally. For this case the error guarantees aren't very important, especially near the ends which is t-digest's strength. It would be interesting to benchmark the XGBoost's quantile approximation against t-digest and maybe M-sketch. Doing that will require writing implementations of them that are accessible from Python first so it will be a little laborious. Are there other use-cases in addition to partitioning? Is the current implementation still slow after the latest updates? Here's the summary of M-sketch: Moment-Based Quantile SketchesBased on this paper: https://arxiv.org/abs/1803.01969 Pros
Cons
Note: the t-digest version compared against didn't include the latests improvements. Summary:This algorithm has some really nice properties for handling high cardinality data but it's not most likely suitable to our use-cases, at least alone, due to how badly it handles low cardinality data. The merge performance is really interesting as an optimization for huge amounts of high cardinality data. |
I am happy to help with advice about updating the current implementation of
t-digest, btw.
Regarding the moment sketches, here are a few observations:
On Wed, Mar 13, 2019 at 11:36 AM Janne Vuorela ***@***.***> wrote:
...
Here's the summary of M-sketch:
Moment-Based Quantile Sketches
Based on this paper: https://arxiv.org/abs/1803.01969
Pros
- Uses only 200 bytes
With the current updates, 200 byte t-digests are useful and can attain
roughly equal accuracy.
- Easy to parallelize
- Very fast merges compared to other algorithms with similar accuracy
- Beats for example t-digest by a large margin
The margin is now about 2-4x smaller than before. The moment sketches are
going to be faster for merging, however.
-
- Merge time begins to dominate the performance when the number of
merged quantiles is around 10^4
Note that all of these comparisons are ignoring I/O time, deserialization
and the potential for pre-merging. The I/O costs may well dominate the
merge costs except for specialized hardware. The moment sketch is
particularly easy to use in an off-heap fashion.
-
Cons
- ...
- Doesn't work well for low cardinality dataset
- With only 5 unique values the algorithm didn't converge at all
- Due to this limitation, this algorithm would also fail the test
here: #731 <#731>
To be fair, it would be typical to just store the samples directly until
the samples take up as much space as the digest. For 32-bit floats, this is
50 samples which is plenty to allow convergence.
Another question, though, is what happens to accuracy with repeated values.
|
So happy to see this issue receiving attention! The references are interesting--thanks for sharing. I agree that t-digest would be a safe choice for floats. How much consideration should we give to taking quantiles on integer and non-numeric datatypes? I wrote the code in |
I personally think we should only replace the algorithm used for the user facing |
We could also speed up the internal algorithm by using functions (specifically |
Thanks for the comments @tdunning. It would definitely be helpful to get some advice from you when updating the implementation! I created an issue in the crick repository so we can discuss specifics of the implementation there: dask/crick#20. I added a couple of questions there already if you have time to take a look. I also agree we should only change the user facing method and leave the internal one as-is, at least for now. Maybe we should make a new issue for speeding up the internal quantile method? There seems to be a consensus that going with t-digest is a good way to go. I'll start working on updating t-digest in crick and implementing the quantile method using that this week still. I'm also willing to share If anyone wants to help with these tasks. :) |
Btw... I just took a look at the moment sketch relative to some other work.
The implementation referenced by the VLDB paper
<http://www.bailis.org/papers/moments-vldb2018.pdf> is very fragile. If the
sample distribution has an offset that is about 10-20x the center of the
distribution, it fails to converge. This means, for example, that it would
fail if you gave it human body temperatures. Or altitudes of landmarks in
Denver. Or disk space used over less than a few days. And by fail, I don't
mean it produces inaccurate results. Instead, it just throws a relatively
uninformative RuntimeException when trying to compute a quantile.
To demonstrate this, just add a million points from a uniform distribution
from 20 to 21 and ask for the median or 99th percentile.
I hate to snipe, but that method isn't ready for prime time.
…On Tue, Mar 19, 2019 at 10:31 AM Janne Vuorela ***@***.***> wrote:
Thanks for the comments @tdunning <https://github.com/tdunning>. It would
definitely be helpful to get some advice from you when updating the
implementation! I created an issue in the crick repository so we can
discuss specifics of the implementation there: dask/crick#20
<dask/crick#20>. I added a couple of
questions there already if you have time to take a look.
I also agree we should only change the user facing method and leave the
internal one as-is, at least for now. Maybe we should make a new issue for
speeding up the internal quantile method?
There seems to be a consensus that going with t-digest is a good way to
go. I'll start working on updating t-digest in crick and implementing the
quantile method using that this week still. I'm also willing to share If
anyone wants to help with these tasks. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1225 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPSegyZDYy0y67Tnt26hYoWJAP8rAqIks5vYR7ggaJpZM4Iuel7>
.
|
cc @edgan8 who was the author of the moment sketch paper |
Yes as Ted pointed out the moments code is more of a prototype proof of concept than something I would recommend production use, in my experience the t-digest is a reasonable choice and I would also look at https://datasketches.github.io/ which has quite nice error guarantees @leerho. @tdunning can you elaborate which method you're calling to run into these issues? For a uniform distribution from 20 to 21 I would expect the moment sketch algorithm to work quite well and if it isn't I should fix the bug. |
@edgan8
If you would like comments on the code, I would be happy to make some.
…On Mon, Mar 25, 2019 at 1:20 PM Edward Gan ***@***.***> wrote:
Yes as Ted pointed out the moments code is more of a prototype proof of
concept than something I would recommend production use, in my experience
the t-digest is a reasonable choice and I would also look at
https://datasketches.github.io/ which has quite nice error guarantees
@leerho <https://github.com/leerho>.
@tdunning <https://github.com/tdunning> can you elaborate which method
you're calling to run into these issues? For a uniform distribution from 20
to 21 I would expect the moment sketch algorithm to work quite well and if
it isn't I should fix the bug.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1225 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPSely-T5_2qR6QI6L829zQKF1n-Y6yks5vaS-KgaJpZM4Iuel7>
.
|
Thanks for linking this, it seems interesting indeed. It would be an option to create a Python wrapper for the C++ version if it's better suited for this use-case than t-digest. Based on a quick glance the error guarantees seem good enough for us. I think the question here is if it's noticeable faster than t-digest. |
Is this issue considered closed by #4677? |
I think so. @mrocklin is this safe to close with the introduction of the |
Sure?
…On Mon, Sep 23, 2019 at 9:00 PM James Bourbeau ***@***.***> wrote:
I think so. @mrocklin <https://github.com/mrocklin> is this safe to close
with the introduction of the method='tdigest' option in quantile?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1225?email_source=notifications&email_token=AACKZTGNMODGPAB4HES3WP3QLFYFNA5CNFSM4CFZ5F52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MZZOQ#issuecomment-534355130>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTAQDNXNNBYXJR2OUWDQLFYFNANCNFSM4CFZ5F5Q>
.
|
The current quantiles implementation has served us well, but may need to be redone. It was, I believe, the result of a quick whiteboard session rather proper research of existing algorithms.
It has some accuracy concerns (see #731) and is also fairly slow when operating on many many partitions (around 1 minute on the nyctaxi data).
People have suggested t-digest in the past. This has all of the operations that we need. It is a bit overkill but also widely implemented and trusted.
The text was updated successfully, but these errors were encountered: