Skip to content

Upgrade t-digest to 3.3#7076

Open
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:tdigest_upgrade
Open

Upgrade t-digest to 3.3#7076
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:tdigest_upgrade

Conversation

@Jackie-Jiang
Copy link
Contributor

Also modify the test per the discussion in #7069

@Jackie-Jiang Jackie-Jiang requested a review from mayankshriv June 20, 2021 20:04
@Jackie-Jiang
Copy link
Contributor Author

@tdunning Please take a look at the modified test and see if limiting the error to be within 2% make sense for compression 200?

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2021

Codecov Report

Merging #7076 (22ba51e) into master (f773208) will decrease coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7076      +/-   ##
============================================
- Coverage     73.64%   73.57%   -0.07%     
  Complexity       91       91              
============================================
  Files          1480     1480              
  Lines         72871    72871              
  Branches      10483    10483              
============================================
- Hits          53664    53614      -50     
- Misses        15734    15781      +47     
- Partials       3473     3476       +3     
Flag Coverage Δ
integration 41.78% <ø> (-0.14%) ⬇️
unittests 65.56% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...a/manager/realtime/RealtimeSegmentDataManager.java 50.00% <0.00%> (-50.00%) ⬇️
...ot/segment/local/customobject/MinMaxRangePair.java 75.86% <0.00%> (-24.14%) ⬇️
...cheduler/resources/PolicyBasedResourceManager.java 75.00% <0.00%> (-18.75%) ⬇️
...impl/dictionary/DoubleOnHeapMutableDictionary.java 46.98% <0.00%> (-13.26%) ⬇️
.../impl/dictionary/FloatOnHeapMutableDictionary.java 69.87% <0.00%> (-6.03%) ⬇️
...cal/startree/v2/builder/BaseSingleTreeBuilder.java 89.23% <0.00%> (-4.49%) ⬇️
.../impl/dictionary/BaseOffHeapMutableDictionary.java 84.00% <0.00%> (-3.34%) ⬇️
...mpl/dictionary/DoubleOffHeapMutableDictionary.java 59.57% <0.00%> (-3.20%) ⬇️
...gregation/function/StUnionAggregationFunction.java 71.42% <0.00%> (-2.86%) ⬇️
...lix/core/realtime/PinotRealtimeSegmentManager.java 82.05% <0.00%> (-2.06%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f773208...22ba51e. Read the comment docs.

@tdunning
Copy link
Contributor

@tdunning Please take a look at the modified test and see if limiting the error to be within 2% make sense for compression 200?

I can't comment exactly without understanding the test. I can see that 200,000 random numbers are generated and put pair-by-pair into t-digests that are somehow serialized and combined. Ultimately, the data seems to be grouped using two fields each with 100 values resulting in a total 10,000 digests each with an average of 20 data points in them. Beyond this outline, my understanding fades to nothing. Can you say what the different between the star-index path and non-star-index path is?

Without understanding the details, I think that it is a reasonable assumption to say that the difference between two digests with the same data but constructed differently will be limited by the difference between difference between either digest and the actual empirical distribution. The test, as designed, does not appear to be likely to put more than 45 data points in a single digest which means that any compression parameter greater than 100 will result in zero error because all 45 data points will be preserved.

My guess is that the underlying intent of the test is a bit different from the execution, though, and you are interested in two things over a range of digest sizes:

  • that the star tree approach will not increase errors relative to the non star tree approach

  • that the t-digest will not introduce large errors in the first place for some definition of large

I can say that pretty much all quantiles such that q <= 0.1 or q >= 0.9 will have really small errors with compression of 200 for as many samples as you might expect and still quite small errors with compression of 100.

Here are some examples derived from 50 runs for different values of compression (delta). In general, the worst case is when there is a transition to two samples per centroid. Other than this, the accuracy for the median is generally worse than the accuracy for the tails (that is, of course, the point of the t-digest).

In general, for 100 samples or more, you can occasionally have errors of more than 2% with delta=50, and will rarely have errors over 1% with delta=100 or delta=200. I don't have handy data for n < 100, for delta=200, there should never be any errors for such small counts except possibly right at the median.

delta=50

delta=100

delta=200

delta=500

@Jackie-Jiang
Copy link
Contributor Author

@tdunning Thanks for the explanation.
Yes we randomly generate 100,000 t-digest objects, each constructed from 2 random integers of range [0, 10,000). The difference between star-tree vs non-star-tree is just the merging order of these 100,000 t-digest objects, and eventually we will merge all the t-digest objects into one. Then we compare the result of quantile 0%, 1%, ..., 99%, 100% and check the difference of the results from the 2 merged t-digests.
With compression 200, I can still observe difference over 100 from the 2 t-digests (over 1%). While with the old version (3.2), the difference is always smaller than 50 (0.5%). Is this expected? I feel the new version somehow has worse accuracy than the old version

@tdunning
Copy link
Contributor

@Jackie-Jiang

It is not expected that accuracy would be worse if you normalize by number of centroids. On the other hand, compression 200 with new should by at least as many centroids as compression 100 in 3.2.

Can you say more about the merge strategies of interest? I would be happy to build a reference test that mirrors your needs.

Merging through a ser-de step will definitely have an impact on accuracy since more data is transiently kept in memory than is strictly specified by the compression level. Even so, we should be able to find a good trade-off.

@Jackie-Jiang
Copy link
Contributor Author

@tdunning

The random t-digest objects are created as following:

  @Override
  Object getRandomRawValue(Random random) {
    TDigest tDigest = TDigest.createMergingDigest(COMPRESSION);
    tDigest.add(random.nextInt(MAX_VALUE));
    tDigest.add(random.nextInt(MAX_VALUE));
    return ObjectSerDeUtils.TDIGEST_SER_DE.serialize(tDigest);
  }

In non-star-tree approach simply merges the t-digests in sequence without ser-de; the star-tree approach will pre-aggregate t-digests and then stores the serialized merged t-digests (each pre-aggregated t-digest might go through multiple rounds of ser-de).

Non-star-tree:

  TDigest tDigest = deserialize(tDigest1);
  tDigest.add(deserialize(tDigest2));
  tDigest.add(deserialize(tDigest3));
  tDigest.add(deserialize(tDigest4));

Star-tree:

  TDigest tDigest = deserialize(tDigest1);
  tDigest.add(deserialize(tDigest2));
  byte[] mergedTDigest1 = serialize(tDigest);
  
  tDigest = deserialize(tDigest3);
  tDigest.add(deserialize(tDigest4));
  byte[] mergedTDigest2 = serialize(tDigest);

  tDigest = deserialize(mergedTDigest1);
  tDigest.add(deserialize(mergedTDigest2));

Then we compare the result for each quantile from these 2 result t-digest objects.

Hope this can explain the test logic.

@tdunning
Copy link
Contributor

tdunning commented Jun 22, 2021 via email

@Jackie-Jiang
Copy link
Contributor Author

@tdunning Any update on this?

@tdunning
Copy link
Contributor

tdunning commented Jul 1, 2021 via email

@Jackie-Jiang
Copy link
Contributor Author

@tdunning Thanks for the explanation.
I understand the error is unavoidable, but IMO the test is legit and just mimics the behavior of utilizing t-digest to pre-grouping the numbers and get the compressed statistics instead of storing all the numbers. From user's perspective, grouping the numbers in different order should give very close result.
The concern here is that with exactly the same way of grouping numbers, we observed larger error rate with the new version compared with the old version. We want to justify this behavior. Is this caused by better/more strict compression in the new version? What is the benefit of upgrading to the new version?

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin Do you want to take a look at this discussion thread? We observed worse accuracy after upgrading the library, but not sure if that is expected because of a more advanced algorithm (e.g. tradeoff between space and accuracy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants