Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo Quantiles #1225

Closed
mrocklin opened this issue Jun 5, 2016 · 49 comments
Closed

Redo Quantiles #1225

mrocklin opened this issue Jun 5, 2016 · 49 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Jun 5, 2016

The current quantiles implementation has served us well, but may need to be redone. It was, I believe, the result of a quick whiteboard session rather proper research of existing algorithms.

It has some accuracy concerns (see #731) and is also fairly slow when operating on many many partitions (around 1 minute on the nyctaxi data).

People have suggested t-digest in the past. This has all of the operations that we need. It is a bit overkill but also widely implemented and trusted.

@mrocklin mrocklin added good first issue Clearly described and easy to accomplish. Good for beginners to the project. dataframe labels Jun 5, 2016
@mrocklin
Copy link
Member Author

mrocklin commented Jun 5, 2016

The performance issue with the current implementation seems to be from the last task. We might want to try a tree reduction.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2016

The current tdigest package on PyPI appears to be quite slow:

In [1]: from tdigest import TDigest

In [2]: t = TDigest()

In [3]: %time t.batch_update(range(10000))
CPU times: user 6.47 s, sys: 4 ms, total: 6.48 s
Wall time: 6.53 s

My immediate need would probably be satisfied by accelerating the reduction step of the current percentile function.

cc @eriknw

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2016

It might also be a fun side project for someone to build a faster tdigest project

@shoyer
Copy link
Member

shoyer commented Jun 6, 2016

Indeed, I did a little experiment with this a few months ago and noticed that tdigest seemed to be very slow (relative to loops in a compiled language). It seems like we need some sort batch updates to achieve reasonable performance. At that point, I decided that this was probably beyond my immediate capabilities.

@jcrist
Copy link
Member

jcrist commented Jun 6, 2016

It might also be a fun side project for someone to build a faster tdigest project

I've been wanting to do this. I think a cython implementation could be done fairly cheaply, and also be performant. Might try to put something together this week.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2016

Percentile is fairly close to the core functionality of dask.dataframe. I'm hesitant to add a cython project dependency to dask.dataframe if it can be avoided. Currently we only depend on Pandas and other pure-python projects.

@jcrist
Copy link
Member

jcrist commented Jun 6, 2016

I'm skeptical that a performant version can be written without any cython code. From what I remember when I looked at this a while ago the needed methods couldn't be easily spelled with numpy vectorized methods. I'm also not against relying on another cython package - I don't think the code would be very complex, and conda and wheels make this cheap. Would be happy to be proven wrong though.

@shoyer
Copy link
Member

shoyer commented Jun 6, 2016

My two cents is that percentile should never have been added to
dask.dataframe in its current state. It's completely bogus for some unknown
fraction of inputs, with no guarantees on performance or correctness. So I
think a required Cython dependency is perfectly reasonable.

On Sun, Jun 5, 2016 at 7:57 PM, Matthew Rocklin notifications@github.com
wrote:

Percentile is fairly close to the core functionality of dask.dataframe.
I'm hesitant to add a cython project dependency to dask.dataframe if it can
be avoided. Currently we only depend on Pandas and other pure-python
projects.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1225 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABKS1nEHYcjsz0DLuSvIXvxU7FTil6Mhks5qI4yGgaJpZM4Iuel7
.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2016

So, the algorithm does represent itself as approximate. We've seen a failure case for a version of approximate that is something like "size of the bins into which the percentiles partition the original data" but this failure case doesn't yet disprove approximate under the definition of "will give you a value with a similar rank of the actual percentile."

s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
# also holds for all different chunk sizes that I tested other than 20
dd.from_pandas(s, 20).quantile(0.5).compute()  # 1.0

This is either great or terrible performance depending on how you define approximate. If you define it as the number of entries away, then yes, it's bad. However if you define it as the number of y-values away then we're just one element off :)

I suspect that we're actually talking about two different operations. A lot of internal dask.dataframe operations require a fast and robustly available approximate quantile implementation. We don't mind if it's wildly off (though in practice it performs exceedingly well) so we don't really care about it not being a thoroughly vetted algorithm. We might want to separate this from a user-facing approximate quantile that has good theoretical guarantees. I think that this is the operation that you all are talking about. I honestly don't care much about this algorithm (I think it'd be handy, but it isn't essential to correct dataframe operation.) I don't mind this operation being cython-based as long as that dependency is optional.

@shoyer
Copy link
Member

shoyer commented Jun 6, 2016

I suspect that we're actually talking about two different operations. A lot of internal dask.dataframe operations require a fast and robustly available approximate quantile implementation. We don't mind if it's wildly off (though in practice it performs exceedingly well) so we don't really care about it not being a thoroughly vetted algorithm.

Yes, I completely agree.

@jcrist
Copy link
Member

jcrist commented Jun 6, 2016

I suspect that we're actually talking about two different operations. A lot of internal dask.dataframe operations require a fast and robustly available approximate quantile implementation. We don't mind if it's wildly off (though in practice it performs exceedingly well) so we don't really care about it not being a thoroughly vetted algorithm.

Ah, so what you're looking at here is a faster quantile like thing, regardless of validity of quantiles. I agree then that something good enough could probably be found without adding a compiled dependency. Perhaps change the title of the issue to better reflect the need (as it's not really quanitiles you're after)?

@eriknw
Copy link
Member

eriknw commented Jun 6, 2016

I agree about separating user-facing quantiles and the approximate quantile function used internally.

Regarding improving performance, cheap things to try are to use cytoolz.merge_sorted (maybe just to see if it helps), and to calculate fewer percentiles on each partition if there are many partitions. For example, suppose you want 1000 equally-spaced percentiles and have 1000 partitions. Right now, I think we calculate all the percentiles on all the partitions, so we're sorting 1,000,000 pieces of data which get reduced to 1000. Taking as few as 10 percentiles on each partition would probably suffice for "good enough" here in most cases, but it does become more sensitive to pathological cases and seems like very few data points for a single sweep over the data.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2016

Current performance on the nyctaxi dataset on a cluster looks like 20ms costs for each of the per-partition percentiles followed by a lot of data transfer and a 20s cost for the combination.

@eriknw
Copy link
Member

eriknw commented Jun 6, 2016

The performance issue with the current implementation seems to be from the last task. We might want to try a tree reduction.

It should be relatively straightforward to convert merge_sorted within merge_percentiles into a reduction. I don't have the time to get to this this week, but I'd be happy to talk about it as necessary. You know where to find me, @jcrist, if you want to tackle this.

@eriknw
Copy link
Member

eriknw commented Jun 8, 2016

I just implemented a faster merge_sorted in toolz (pytoolz/toolz#313) using recursion. It should be straightforward to adapt this and build the reduction graph manually for dask, which will distribute the tasks. This should address the immediate performance concerns regarding internal partitioning.

Even if (or, rather, when) we add t-digest to dask, we will still want to be able to partition on non-interpolable and non-numeric data, so we should have a generic solution that is good enough.

@mrocklin mrocklin removed the good first issue Clearly described and easy to accomplish. Good for beginners to the project. label Jun 9, 2016
@ogrisel
Copy link
Contributor

ogrisel commented Sep 29, 2016

There is also the weighted approximate quantile algorithm of XGBoost:

https://github.com/dmlc/xgboost/blob/master/src/common/quantile.h

It is described in the appendix of the XGBoost paper:

https://arxiv.org/abs/1603.02754

It's supposed to be very fast. I don't know how it compares to t-digest (from a speed and statistical accuracy point of view). Maybe @tqchen, @tdunning or @CamDavidsonPilon know better.

@tdunning
Copy link

On Thu, Sep 29, 2016 at 9:03 AM, Olivier Grisel notifications@github.com
wrote:

It's supposed to be very fast. I don't know how it compared to t-digest
(from a speed and statistical accuracy point of view). Maybe @tqchen
https://github.com/tqchen, @tdunning https://github.com/tdunning or
@CamDavidsonPilon https://github.com/CamDavidsonPilon know better.

I don't know yet. But I will know before long.

One big difference is that recent versions of t-digest are entirely static
with respect to memory allocation.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 29, 2016

Thanks for your input Ted.

One big difference is that recent versions of t-digest are entirely static with respect to memory allocation.

Is this change purely an implementation detail of the official java implementation or a fundamental change in the datastructure described in your paper?

@tqchen
Copy link

tqchen commented Sep 29, 2016

I think the major advantage of what is described in XGBoost is that it works for the weighted data naturally with theoretical guanrantee regardless of order of the data.

The requested memory size can be calculated before hand with the corresponding accuracy request.

It can also be plugged into the streaming version of algorithm described in http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf

@tqchen
Copy link

tqchen commented Sep 29, 2016

Another interesting thing is that the weighted version of quantiles can become more accurate(than what it aske to be) when there are duplication in the data, because the duplications goes into the same buckets and add up weights, it makes it useful for many skewed real world data cases

@ogrisel
Copy link
Contributor

ogrisel commented Sep 30, 2016

I think the major advantage of what is described in XGBoost is that it works for the weighted data naturally with theoretical guarantee regardless of order of the data.

My understanding is that t-digest also has support by default for these. I have not yet read the papers (t-digest and xgboost appendix) fully but I guess that the main difference might be that the approximation error of t-digest is proportional to q * (1 - q) making it very accurate to estimate extreme quantiles such as 0.999. This is useful for use cases such as anomaly detection and trimming outliers but maybe less important for out-of-core or distributed training of boosted decision trees.

@tqchen
Copy link

tqchen commented Sep 30, 2016

Get that. I take a quick look into walkthrough of t-digest, it seems that in order for t-digest to work, it requires on assumption of randomly shuffled data.

While the quantile summary proofably guarantees the absolute correctness(such that approximation error falls within specified epsilon bound) regardless of order of the dataset. I do agree that the skewness requirement on the end is interesting and not yet supported by the quantile summary.

@tdunning
Copy link

tdunning commented Oct 1, 2016

Answers:

  • Olivier, the change has been implemented with new algorithms in new
    classes. The basic idea has been that we can buffer samples for a time and
    then do a sort-merge against existing centroids. There is a bound on the
    max number of centroids so this trivially requires no dynamic allocation.
  • Tianqi, there is no requirement for randomized ordering, especially with
    the new implementations. The old implementations had issues, but even then,
    they would simply shuffle the existing centroids and re-insert time. With
    current knowledge, we could do a merge pass in order when this happens, but
    we might as well use the merging array algo.
  • Tianqi, insertion of weighted data points is trivial for t-digest,
    especially for the merging array variants.
  • Olivier, I expect that tail accuracy is not required for decision trees,
    as you suggest. On the other hand, the merging array algorithm might still
    be interesting for constant absolute accuracy since it is very fast. The
    change to centroids from bins boundaries is also an interesting difference.

On Fri, Sep 30, 2016 at 1:11 PM, Tianqi Chen notifications@github.com
wrote:

Get that. I take a quick look into walkthrough of t-digest, it seems that
in order for t-digest to work, it requires on assumption of randomly
shuffled data.

While the quantile summary guarantees the absolute correctness regardless
of order of the dataset. I do agree that the skewness requirement on the
end is interesting and not yet supported by the quantile summary.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1225 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSeiv2Qu2v6jOVwHywCW8IcgcEDaQ_ks5qvUK_gaJpZM4Iuel7
.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 1, 2016

Thank you all for your inputs.

@tqchen
Copy link

tqchen commented Oct 1, 2016

@tdunning interesting. Just to try to understand it a bit more.

Is there any paper about proof of correctness of the t-digest (the two variants you mentioned) to with respect to non-random orders, or are these heuristics that works well in practice?

@tdunning
Copy link

tdunning commented Oct 2, 2016

At this point there is no completed paper, nor is there a fully developed
proof.

On the other hand, the special case where all of the points are buffered
before insertion is trivially correct. I think that it follows that ordered
presentation for the array buffer algorithm will also be provably correct.

For a more comprehensive proof, it should suffice that merging t-digests
gives a valid t-digest. I haven't done that yet, but I have roughed out the
steps.

In addition, the algorithm in all forms performs very well in practice. The
standard test distributions used in the unit tests include a Gamma(0.1,
0.1) distribution. This Gamma distribution is very heavily skewed. The
0.1%-ile is 6.07e-30 while the median is 0.006 and the 99.9th %-ile is 33.6
while the mean is 1. I don't think that there is any practical application
with such a high degree of skew. The unit tests also cover ordered,
reversed, repeated, and piece-wise random insertion orders.

On Sat, Oct 1, 2016 at 12:39 PM, Tianqi Chen notifications@github.com
wrote:

@tdunning https://github.com/tdunning interesting. Just to try to
understand it a bit more.

Is there any paper about proof of correctness of the t-digest (the two
variants you mentioned) to with respect to non-random orders, or are these
heuristics that works well in practice?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1225 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSem-mtVm8O63Xy6gBiLkh6I4l63sYks5qvoy9gaJpZM4Iuel7
.

@shoyer shoyer mentioned this issue Jul 30, 2018
4 tasks
@Dimplexion
Copy link
Contributor

This is an old issue but from what I see it still hasn't been addressed. I happen to need this functionality right now so I'm interested in knowing the status of this issue and I'm also willing work on this.

So what is the current status of this? Could this be implemented using crick as mentioned here #3099?

FYI there's also a new paper released for t-digest lately https://arxiv.org/abs/1902.04023

@mrocklin
Copy link
Member Author

mrocklin commented Mar 1, 2019

Thanks for your interest here @Dimplexion . It would be good to resolve this issue well.

So what is the current status of this? Could this be implemented using crick as mentioned here #3099?

The current status is what you see in this issue, there are a few different proposals. Someone probably needs to evaluate them and come up with a plan for what to implement that balances perforamnce, maintainability (like adding more dependencies), and easy of implementation.

Crick is one suggestion, but there are a few others above. We'll need to come to some conclusion. The first thing to do is probably to summarize the options above, their pros and cons, maybe take a look at the various implementations and see how feasible they are, and then propose a path forward.

@tdunning
Copy link

tdunning commented Mar 1, 2019 via email

@Dimplexion
Copy link
Contributor

Alright. This week is a little busy for me but I'll start going through the other options and post my summary here by the end of next week along with my suggestion for which path to take. Others can then verify the conclusions and we can go from there.

@Dimplexion
Copy link
Contributor

I've been researching the different options and it has proven to be a little more involved than I realized initially so I'm posting my current findings here for others to comment.

Current implementation:

Options:

Python default tdigest package

  • https://github.com/CamDavidsonPilon/tdigest
  • Good guarantees on accuracy near the ends
  • Very easy to implement
    • Stable package in PyPi
    • Easy to use API
  • Doesn't contain the new improvements to t-digest
    • Last updated almost a year ago
  • Seems to be quite slow

Note: the current version seems to be much faster (around twice as fast) than when first tested in this thread.

crick (cython based tdigest implementation)

  • https://github.com/jcrist/crick
  • Much faster to update the digest than the default python package (around 10 times in quick testing)
  • Latest version is still experimental
  • Should be easy to implement in dask
    • Available in PyPi
    • API similar to the default tdigest package
  • Will require resources from dask developers to maintain if chosen

Upcoming t-digest version

  • https://github.com/tdunning/t-digest
  • Still unreleased
    • After release still need to either wait for one of the Python libraries to update to the new version or dedicate resources to update ourselves
  • Algorithm will be faster than previously
    • However it's not possible to compare the actual Python implementation with others yet
  • Static memory allocation
  • Has the option for constant error guarantee
    • more intuitive to understand
    • very useful to have for some use-cases

XGBoost approximate quantile

Conclusion so far:

From the currently available options for tdigest, crick seems to be the way to go. It is still experimental but I feel the performance improvement compared to the default tdigest package is well worth updating and maintaining it. Updating it to support the new changes to the tdigest algorithm still-to-be released is a good option to bring further benefits. As a side note though, I'm using the simple implementation based on crick from here #3099 and it is considerably slower than the current .quantile() implementation (although much more accurate). Therefore it is too slow to replace the current implementation for the internal use-case.

Unfortunately it's not so easy to compare tdigest to the XGboost approximate quantile algorithm due to the lack of a Python library implementation. From theoretical point-of-view both of them seem to be good enough to cover our use-cases. The way forward is that we must compare the two algorithms by either wrapping XGBoost in a Python module or test with the C++ implementations to compare the algorithms in realistic scenarios. If the performances are similar I would give the edge to tdigest due to the static memory usage and the fact that it seems to be more widely adopted and supported.

Let me know if I missed anything. Also, I tried to understand both algorithms properly and make as objective points as I could, but I'm not an expert on these algorithms so please let me know if there are inaccuraties or you just have a different opinion.

@tqchen
Copy link

tqchen commented Mar 10, 2019

FYI, a related paper I saw recently https://arxiv.org/abs/1803.01969

@jcrist
Copy link
Member

jcrist commented Mar 10, 2019

It is still experimental but I feel the performance improvement compared to the default tdigest package is well worth updating and maintaining it.

As a note, I wrote crick specifically to be used by Dask, but due to priorities never got around to implementing a Dask wrapper. Crick implements T-Digest's "Merging Digest" algorithm at the time of writing (static allocation, fast updates and merges), though it looks like there have been algorithm changes since I wrote it. Since I'm one of the core dask developers I don't see any issue with using it within Dask. It's intended to be a general library of approximate algorithms for Python, so other cython implementations we want can also go there.

@Dimplexion
Copy link
Contributor

FYI, a related paper I saw recently https://arxiv.org/abs/1803.01969

Thanks, I'll check that out!

It is still experimental but I feel the performance improvement compared to the default tdigest package is well worth updating and maintaining it.

As a note, I wrote crick specifically to be used by Dask, but due to priorities never got around to implementing a Dask wrapper. Crick implements T-Digest's "Merging Digest" algorithm at the time of writing (static allocation, fast updates and merges), though it looks like there have been algorithm changes since I wrote it. Since I'm one of the core dask developers I don't see any issue with using it within Dask. It's intended to be a general library of approximate algorithms for Python, so other cython implementations we want can also go there.

Sounds good to me. Whichever algorithm we choose implementing it in crick makes a lot of sense.

I'll check out the paper and will try to figure out how to best compare the performance of the different algorithms. I should have time to work more on this next week.

@Dimplexion
Copy link
Contributor

After some more reading, I've come to the conclusion that for the public facing DataFrame.quantile() function using t-digest implemented in crick is the best option. We need to update crick to include the latest improvements to t-digest, but we can already work on the implementation in Dask as the changes will be internal to crick. Summary of the reasons why I think this is the way to go:

  • existing fast implementation we can use in crick
  • widely used for this purpose, trusted by many big entities
  • good accuracy guarantees while still being fast

Comparing XGBoost's quantile algorithm to it I can't really come up with other reasons to use it over t-digest except maybe speed. I believe t-digest will be fast enough to be used in the DataFrame.quantile function, though. The advantage t-digest has is that there are already existing implementations we can use. If there are no objections I will start working on the implementation next week.

I checked out the paper @tqchen linked and while the algorithm (M-sketch) has some interesting properties, I don't think it's a good fit for this use-case. I'll add my summary of it to the end of this post.

It seems that quantiles are used for partitioning internally. For this case the error guarantees aren't very important, especially near the ends which is t-digest's strength. It would be interesting to benchmark the XGBoost's quantile approximation against t-digest and maybe M-sketch. Doing that will require writing implementations of them that are accessible from Python first so it will be a little laborious. Are there other use-cases in addition to partitioning? Is the current implementation still slow after the latest updates?

Here's the summary of M-sketch:

Moment-Based Quantile Sketches

Based on this paper: https://arxiv.org/abs/1803.01969

Pros

  • Uses only 200 bytes
  • Easy to parallelize
  • Very fast merges compared to other algorithms with similar accuracy
    • Beats for example t-digest by a large margin
    • Merge time begins to dominate the performance when the number of merged quantiles is around 10^4

Cons

  • Accuracy bound is limited to around e = 0.01
  • Slower quantile estimation compared to other algorithms
    • Quantile estimation time dominates when the number of merged quantiles is around 100 or less
  • Works best for floating point data
  • Doesn't work well for low cardinality dataset

Note: the t-digest version compared against didn't include the latests improvements.

Summary:

This algorithm has some really nice properties for handling high cardinality data but it's not most likely suitable to our use-cases, at least alone, due to how badly it handles low cardinality data. The merge performance is really interesting as an optimization for huge amounts of high cardinality data.

@tdunning
Copy link

tdunning commented Mar 13, 2019 via email

@eriknw
Copy link
Member

eriknw commented Mar 13, 2019

So happy to see this issue receiving attention! The references are interesting--thanks for sharing.

I agree that t-digest would be a safe choice for floats. How much consideration should we give to taking quantiles on integer and non-numeric datatypes?

I wrote the code in dask/dataframe/partitionquantiles.py btw, and it was specifically tailored for guesstimating suitable partitions boundaries for use with set_index. It has a lot of heuristics to try to not do a terrible job at this in a single pass over the data, and it works for any datatype that has comparisons (including categoricals, strings, etc). I would be interested to learn where it does poorly (generic quantiles notwithstanding of course!). It seems to do well enough, fast enough for partitioning, and we expose a knob (upsample= in set_index) in case the user wants to increase the accuracy of partition boundaries (which may be useful if, e.g., they know that only 0.1% of chunks contain data).

@jcrist
Copy link
Member

jcrist commented Mar 13, 2019

I personally think we should only replace the algorithm used for the user facing quantile method, and leave our internal algorithm for partitioning alone. It's faster in common cases (though we may be able to speed up the cython implementation of tdigest), and has been working fine so far.

@eriknw
Copy link
Member

eriknw commented Mar 13, 2019

We could also speed up the internal algorithm by using functions (specifically merge_sorted) from cytoolz instead of toolz ;)

@Dimplexion
Copy link
Contributor

Thanks for the comments @tdunning. It would definitely be helpful to get some advice from you when updating the implementation! I created an issue in the crick repository so we can discuss specifics of the implementation there: dask/crick#20. I added a couple of questions there already if you have time to take a look.

I also agree we should only change the user facing method and leave the internal one as-is, at least for now. Maybe we should make a new issue for speeding up the internal quantile method?

There seems to be a consensus that going with t-digest is a good way to go. I'll start working on updating t-digest in crick and implementing the quantile method using that this week still. I'm also willing to share If anyone wants to help with these tasks. :)

@tdunning
Copy link

tdunning commented Mar 24, 2019 via email

@tqchen
Copy link

tqchen commented Mar 24, 2019

cc @edgan8 who was the author of the moment sketch paper

@edgan8
Copy link

edgan8 commented Mar 25, 2019

Yes as Ted pointed out the moments code is more of a prototype proof of concept than something I would recommend production use, in my experience the t-digest is a reasonable choice and I would also look at https://datasketches.github.io/ which has quite nice error guarantees @leerho.

@tdunning can you elaborate which method you're calling to run into these issues? For a uniform distribution from 20 to 21 I would expect the moment sketch algorithm to work quite well and if it isn't I should fix the bug.

@tdunning
Copy link

@tdunning
Copy link

tdunning commented Mar 25, 2019 via email

@Dimplexion
Copy link
Contributor

Dimplexion commented Mar 26, 2019

Yes as Ted pointed out the moments code is more of a prototype proof of concept than something I would recommend production use, in my experience the t-digest is a reasonable choice and I would also look at https://datasketches.github.io/ which has quite nice error guarantees @leerho.

Thanks for linking this, it seems interesting indeed. It would be an option to create a Python wrapper for the C++ version if it's better suited for this use-case than t-digest. Based on a quick glance the error guarantees seem good enough for us. I think the question here is if it's noticeable faster than t-digest.

@rabernat
Copy link
Contributor

Is this issue considered closed by #4677?

@jrbourbeau
Copy link
Member

I think so. @mrocklin is this safe to close with the introduction of the method='tdigest' option in quantile?

@mrocklin
Copy link
Member Author

mrocklin commented Sep 24, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests