Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Upgrade t-digest to 3.2 #28305
1 similar comment
It would be good to understand why some of these values are different now, especially as for the one on line 100 of the tests the change in the values is relatively big. I think its ok for the values to change if they are more correct but we should be able to explain why the values needed to change.
Thanks @colings86 ! I'm not a expert on t-digest, please correct me if I'm wrong. I made a test for both t-digest 3.2 and 3.1, I found the big change is introduced in 3.2:
The output is:
For 3.1, the output:
Then I found another implementation https://github.com/CamDavidsonPilon/tdigest and made a test:
The output is:
So I trend to think it's more accurate than before.
@liketic I had a look into the differences between t-digest 3.0 and t-digest 3.2 specifically around the AVLTreeDigest that we use in Elasticsearch. The key difference that is causing the tests here to need changing seems to be a different approach to interpolating the value when the requested quantile is between two centroids.
As an example, if we are looking for the 81.25th percentile and we have 8 values the percentile is located at the 6.5th value. Let say that we have a centroid that contains the 6th value with a centroid value of 4 and a centroid containing the 7th value with centroid value of 9.
In 3.0 the interpolation was linear between the two centroids. So it will return the value that is half way between 5 and 8, since the value we want is half way between the 6th and 7th value. So the returned value is 8.5
In 3.2 the interpolation is a weighted average between the two value which takes into account the count of values in the 6th and 7th centroids. So if the 7th centroid has more values than the 6th centroid, the interpolation is biased more towards the 7th centroid.
I think the weighted interpolation is a better approach so I am happy that the change is a positive one. Would you mind updating your branch with the latest master and then I can set of another CI build to make sure we are still green before I merge this?
Thanks for working on this and also for your patience waiting for my response (sorry for the delay)
@liketic it looks like there are some test failures left in the documentation tests:
I think the expected outputs in the percentiles aggregation page in the docs just need adjusting. Are you able to make those changes?