Upgrade t-digest to 3.3 by Jackie-Jiang · Pull Request #7076 · apache/pinot

Jackie-Jiang · 2021-06-20T20:04:04Z

Also modify the test per the discussion in #7069

Jackie-Jiang · 2021-06-20T20:06:12Z

@tdunning Please take a look at the modified test and see if limiting the error to be within 2% make sense for compression 200?

codecov-commenter · 2021-06-20T21:13:54Z

Codecov Report

Merging #7076 (22ba51e) into master (f773208) will decrease coverage by 0.06%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master    #7076      +/-   ##
============================================
- Coverage     73.64%   73.57%   -0.07%     
  Complexity       91       91              
============================================
  Files          1480     1480              
  Lines         72871    72871              
  Branches      10483    10483              
============================================
- Hits          53664    53614      -50     
- Misses        15734    15781      +47     
- Partials       3473     3476       +3

Flag	Coverage Δ
integration	`41.78% <ø> (-0.14%)`	⬇️
unittests	`65.56% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...a/manager/realtime/RealtimeSegmentDataManager.java	`50.00% <0.00%> (-50.00%)`	⬇️
...ot/segment/local/customobject/MinMaxRangePair.java	`75.86% <0.00%> (-24.14%)`	⬇️
...cheduler/resources/PolicyBasedResourceManager.java	`75.00% <0.00%> (-18.75%)`	⬇️
...impl/dictionary/DoubleOnHeapMutableDictionary.java	`46.98% <0.00%> (-13.26%)`	⬇️
.../impl/dictionary/FloatOnHeapMutableDictionary.java	`69.87% <0.00%> (-6.03%)`	⬇️
...cal/startree/v2/builder/BaseSingleTreeBuilder.java	`89.23% <0.00%> (-4.49%)`	⬇️
.../impl/dictionary/BaseOffHeapMutableDictionary.java	`84.00% <0.00%> (-3.34%)`	⬇️
...mpl/dictionary/DoubleOffHeapMutableDictionary.java	`59.57% <0.00%> (-3.20%)`	⬇️
...gregation/function/StUnionAggregationFunction.java	`71.42% <0.00%> (-2.86%)`	⬇️
...lix/core/realtime/PinotRealtimeSegmentManager.java	`82.05% <0.00%> (-2.06%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f773208...22ba51e. Read the comment docs.

tdunning · 2021-06-21T07:45:43Z

@tdunning Please take a look at the modified test and see if limiting the error to be within 2% make sense for compression 200?

I can't comment exactly without understanding the test. I can see that 200,000 random numbers are generated and put pair-by-pair into t-digests that are somehow serialized and combined. Ultimately, the data seems to be grouped using two fields each with 100 values resulting in a total 10,000 digests each with an average of 20 data points in them. Beyond this outline, my understanding fades to nothing. Can you say what the different between the star-index path and non-star-index path is?

Without understanding the details, I think that it is a reasonable assumption to say that the difference between two digests with the same data but constructed differently will be limited by the difference between difference between either digest and the actual empirical distribution. The test, as designed, does not appear to be likely to put more than 45 data points in a single digest which means that any compression parameter greater than 100 will result in zero error because all 45 data points will be preserved.

My guess is that the underlying intent of the test is a bit different from the execution, though, and you are interested in two things over a range of digest sizes:

that the star tree approach will not increase errors relative to the non star tree approach
that the t-digest will not introduce large errors in the first place for some definition of large

I can say that pretty much all quantiles such that q <= 0.1 or q >= 0.9 will have really small errors with compression of 200 for as many samples as you might expect and still quite small errors with compression of 100.

Here are some examples derived from 50 runs for different values of compression (delta). In general, the worst case is when there is a transition to two samples per centroid. Other than this, the accuracy for the median is generally worse than the accuracy for the tails (that is, of course, the point of the t-digest).

In general, for 100 samples or more, you can occasionally have errors of more than 2% with delta=50, and will rarely have errors over 1% with delta=100 or delta=200. I don't have handy data for n < 100, for delta=200, there should never be any errors for such small counts except possibly right at the median.

Jackie-Jiang · 2021-06-21T17:41:01Z

@tdunning Thanks for the explanation.
Yes we randomly generate 100,000 t-digest objects, each constructed from 2 random integers of range [0, 10,000). The difference between star-tree vs non-star-tree is just the merging order of these 100,000 t-digest objects, and eventually we will merge all the t-digest objects into one. Then we compare the result of quantile 0%, 1%, ..., 99%, 100% and check the difference of the results from the 2 merged t-digests.
With compression 200, I can still observe difference over 100 from the 2 t-digests (over 1%). While with the old version (3.2), the difference is always smaller than 50 (0.5%). Is this expected? I feel the new version somehow has worse accuracy than the old version

tdunning · 2021-06-21T18:08:22Z

@Jackie-Jiang

It is not expected that accuracy would be worse if you normalize by number of centroids. On the other hand, compression 200 with new should by at least as many centroids as compression 100 in 3.2.

Can you say more about the merge strategies of interest? I would be happy to build a reference test that mirrors your needs.

Merging through a ser-de step will definitely have an impact on accuracy since more data is transiently kept in memory than is strictly specified by the compression level. Even so, we should be able to find a good trade-off.

Jackie-Jiang · 2021-06-21T18:28:16Z

@tdunning

The random t-digest objects are created as following:

  @Override
  Object getRandomRawValue(Random random) {
    TDigest tDigest = TDigest.createMergingDigest(COMPRESSION);
    tDigest.add(random.nextInt(MAX_VALUE));
    tDigest.add(random.nextInt(MAX_VALUE));
    return ObjectSerDeUtils.TDIGEST_SER_DE.serialize(tDigest);
  }

In non-star-tree approach simply merges the t-digests in sequence without ser-de; the star-tree approach will pre-aggregate t-digests and then stores the serialized merged t-digests (each pre-aggregated t-digest might go through multiple rounds of ser-de).

Non-star-tree:

  TDigest tDigest = deserialize(tDigest1);
  tDigest.add(deserialize(tDigest2));
  tDigest.add(deserialize(tDigest3));
  tDigest.add(deserialize(tDigest4));

Star-tree:

  TDigest tDigest = deserialize(tDigest1);
  tDigest.add(deserialize(tDigest2));
  byte[] mergedTDigest1 = serialize(tDigest);
  
  tDigest = deserialize(tDigest3);
  tDigest.add(deserialize(tDigest4));
  byte[] mergedTDigest2 = serialize(tDigest);

  tDigest = deserialize(mergedTDigest1);
  tDigest.add(deserialize(mergedTDigest2));

Then we compare the result for each quantile from these 2 result t-digest objects.

Hope this can explain the test logic.

tdunning · 2021-06-22T15:07:44Z

Yes. Very helpful. I have a test that does roughly this. I keep the raw data for comparison, a t-digest that is never serialized, one that is serialized and deserialized for each sample and five that each get random subsets of data and then are merged at the end. The final analysis looks at discrepancies against exact answers and the three digested approaches. Not done yet.

…

On Mon, Jun 21, 2021 at 11:28 AM Xiaotian (Jackie) Jiang < ***@***.***> wrote: @tdunning <https://github.com/tdunning> The random t-digest objects are created as following: @OverRide Object getRandomRawValue(Random random) { TDigest tDigest = TDigest.createMergingDigest(COMPRESSION); tDigest.add(random.nextInt(MAX_VALUE)); tDigest.add(random.nextInt(MAX_VALUE)); return ObjectSerDeUtils.TDIGEST_SER_DE.serialize(tDigest); } In non-star-tree approach simply merges the t-digests in sequence without ser-de; the star-tree approach will pre-aggregate t-digests and then stores the serialized merged t-digests (each pre-aggregated t-digest might go through multiple rounds of ser-de). Non-star-tree: TDigest tDigest = deserialize(tDigest1); tDigest.add(deserialize(tDigest2)); tDigest.add(deserialize(tDigest3)); tDigest.add(deserialize(tDigest4)); Star-tree: TDigest tDigest = deserialize(tDigest1); tDigest.add(deserialize(tDigest2)); byte[] mergedTDigest1 = serialize(tDigest); tDigest = deserialize(tDigest3); tDigest.add(deserialize(tDigest4)); byte[] mergedTDigest2 = serialize(tDigest); tDigest = deserialize(mergedTDigest1); tDigest.add(deserialize(mergedTDigest2)); Then we compare the result for each quantile from these 2 result t-digest objects. Hope this can explain the test logic. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7076 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5E6XDZ2GJYY3ZE5JDYL3TT6AE7ANCNFSM47AL7QEA> .

Jackie-Jiang · 2021-07-01T17:42:47Z

@tdunning Any update on this?

tdunning · 2021-07-01T19:14:28Z

Yes. There is an update. I think that there is a fundamental problem with the way that the test is phrased. Any digest that has limited memory will ultimately have to lose some information and will estimate quantiles. That will lead to errors in sample space roughly equal to the separation between adjacent samples. Different digests may keep more samples or fewer and thus adjust when this happens and they may prioritize samples in different parts of the distribution, but there will always be an approximation error when samples are combined. This means that if the samples are more than 0.02 x max apart, you have the risk of error unless the digest is not a digest at all. The max separation for N uniform random samples has a 99% percentile value of 4.6 / N. That means that you will need 230 samples before you have a 99% chance of having no gap bigger than 99%. In turn, that means that your digest will need to keep at least 230 samples before coalescing to be reasonably assured of passing the test. Some digests might do a bit better than this, but not massively. For instance, the t-digest will focus coalescence into a fraction of the interval and thus have a probability of failure of 1/2 to 1/10 of this value. On the other hand, if you recast the error back into quantile space by calculating the empirical CDF of the estimated quantile, you immediately have a well-behaved system. Your unavoidable errors will be bounded by 1/N and a t-digest that keeps 50 samples or more and limits errors to one sample size should have the guarantees that you want.

…

On Thu, Jul 1, 2021 at 10:42 AM Xiaotian (Jackie) Jiang < ***@***.***> wrote: @tdunning <https://github.com/tdunning> Any update on this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7076 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5E6VIUKHHS5FQAXWXWEDTVSSKDANCNFSM47AL7QEA> .

Jackie-Jiang · 2021-07-02T00:32:49Z

@tdunning Thanks for the explanation.
I understand the error is unavoidable, but IMO the test is legit and just mimics the behavior of utilizing t-digest to pre-grouping the numbers and get the compressed statistics instead of storing all the numbers. From user's perspective, grouping the numbers in different order should give very close result.
The concern here is that with exactly the same way of grouping numbers, we observed larger error rate with the new version compared with the old version. We want to justify this behavior. Is this caused by better/more strict compression in the new version? What is the benefit of upgrading to the new version?

Jackie-Jiang · 2022-01-14T20:24:43Z

@richardstartin Do you want to take a look at this discussion thread? We observed worse accuracy after upgrading the library, but not sure if that is expected because of a more advanced algorithm (e.g. tradeoff between space and accuracy)

Upgrade t-digest to 3.3

22ba51e

Jackie-Jiang requested a review from mayankshriv June 20, 2021 20:04

mayankshriv approved these changes Jun 20, 2021

View reviewed changes

andrewchoi5 approved these changes Jan 11, 2022

View reviewed changes

Jackie-Jiang mentioned this pull request Apr 10, 2024

Bump com.tdunning:t-digest from 3.2 to 3.3 #12882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade t-digest to 3.3#7076

Upgrade t-digest to 3.3#7076
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:tdigest_upgrade

Jackie-Jiang commented Jun 20, 2021

Uh oh!

Jackie-Jiang commented Jun 20, 2021

Uh oh!

codecov-commenter commented Jun 20, 2021 •

edited

Loading

Uh oh!

tdunning commented Jun 21, 2021

Uh oh!

Jackie-Jiang commented Jun 21, 2021

Uh oh!

tdunning commented Jun 21, 2021

Uh oh!

Jackie-Jiang commented Jun 21, 2021

Uh oh!

tdunning commented Jun 22, 2021 via email

Uh oh!

Jackie-Jiang commented Jul 1, 2021

Uh oh!

tdunning commented Jul 1, 2021 via email

Uh oh!

Jackie-Jiang commented Jul 2, 2021

Uh oh!

Jackie-Jiang commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Jackie-Jiang commented Jun 20, 2021

Uh oh!

Jackie-Jiang commented Jun 20, 2021

Uh oh!

codecov-commenter commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tdunning commented Jun 21, 2021

Uh oh!

Jackie-Jiang commented Jun 21, 2021

Uh oh!

tdunning commented Jun 21, 2021

Uh oh!

Jackie-Jiang commented Jun 21, 2021

Uh oh!

tdunning commented Jun 22, 2021 via email

Uh oh!

Jackie-Jiang commented Jul 1, 2021

Uh oh!

tdunning commented Jul 1, 2021 via email

Uh oh!

Jackie-Jiang commented Jul 2, 2021

Uh oh!

Jackie-Jiang commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jun 20, 2021 •

edited

Loading