New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade t-digest to 3.2 #28305

Merged
merged 1 commit into from Feb 15, 2018

Conversation

Projects
None yet
3 participants
@liketic
Contributor

liketic commented Jan 19, 2018

Upgrade t-digest to 3.2. Some unit tests are failed seems like the percentile algorithm is different as the current version.

Closes #28295

/cc @colings86

@elasticmachine

This comment has been minimized.

elasticmachine commented Jan 19, 2018

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine

This comment has been minimized.

elasticmachine commented Jan 19, 2018

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@liketic liketic changed the title from Upgrade t digest to Upgrade t-digest Jan 19, 2018

@liketic liketic changed the title from Upgrade t-digest to Upgrade t-digest to 3.2 Jan 19, 2018

@colings86

It would be good to understand why some of these values are different now, especially as for the one on line 100 of the tests the change in the values is relatively big. I think its ok for the values to change if they are more correct but we should be able to explain why the values needed to change.

@liketic

This comment has been minimized.

Contributor

liketic commented Jan 24, 2018

Thanks @colings86 ! I'm not a expert on t-digest, please correct me if I'm wrong. I made a test for both t-digest 3.2 and 3.1, I found the big change is introduced in 3.2:

    private static void test32() {
        AVLTreeDigest digest = new AVLTreeDigest(100);
        digest.add(8);
        digest.add(5);
        digest.add(3);
        digest.add(2);
        digest.add(1);
        digest.add(1);
        digest.add(0);
        System.out.println(digest.quantile(0.86));
        System.out.println(digest.quantile(0.75));
        System.out.println(digest.quantile(0.25));
        System.out.println(digest.quantile(0.01));
    }

The output is:

6.559999999999999
4.5
1.0
0.0

For 3.1, the output:

5.48
4.0
1.0
0.06

Then I found another implementation https://github.com/CamDavidsonPilon/tdigest and made a test:

from tdigest import TDigest

digest = TDigest()
x = [8, 5, 3, 2, 1, 1, 0]

for t in x:
    digest.update(t)

print(digest.percentile(86))
print(digest.percentile(75))
print(digest.percentile(25))
print(digest.percentile(1))

The output is:

6.56
4.5
0.833333333333
-0.286666666667

So I trend to think it's more accurate than before.

@colings86

This comment has been minimized.

Member

colings86 commented Feb 2, 2018

@liketic I had a look into the differences between t-digest 3.0 and t-digest 3.2 specifically around the AVLTreeDigest that we use in Elasticsearch. The key difference that is causing the tests here to need changing seems to be a different approach to interpolating the value when the requested quantile is between two centroids.

As an example, if we are looking for the 81.25th percentile and we have 8 values the percentile is located at the 6.5th value. Let say that we have a centroid that contains the 6th value with a centroid value of 4 and a centroid containing the 7th value with centroid value of 9.

In 3.0 the interpolation was linear between the two centroids. So it will return the value that is half way between 5 and 8, since the value we want is half way between the 6th and 7th value. So the returned value is 8.5

In 3.2 the interpolation is a weighted average between the two value which takes into account the count of values in the 6th and 7th centroids. So if the 7th centroid has more values than the 6th centroid, the interpolation is biased more towards the 7th centroid.

I think the weighted interpolation is a better approach so I am happy that the change is a positive one. Would you mind updating your branch with the latest master and then I can set of another CI build to make sure we are still green before I merge this?

Thanks for working on this and also for your patience waiting for my response (sorry for the delay)

@liketic liketic force-pushed the liketic:upgrade-t-digest branch from 7309ac5 to d1890b5 Feb 5, 2018

@liketic

This comment has been minimized.

Contributor

liketic commented Feb 5, 2018

Wow! @colings86 Thank you so much for your professional investigation! I rebased the pull request, please trigger jenkins.

@colings86

This comment has been minimized.

Member

colings86 commented Feb 5, 2018

@liketic no problem, I'll trigger a build now

@colings86

This comment has been minimized.

Member

colings86 commented Feb 5, 2018

jenkins please test this

@colings86

This comment has been minimized.

Member

colings86 commented Feb 5, 2018

jenkins test this please

@colings86

This comment has been minimized.

Member

colings86 commented Feb 6, 2018

@liketic it looks like there are still test failures in TDigestPercentileRanksIT are you able to take a look into them?

@colings86

This comment has been minimized.

Member

colings86 commented Feb 13, 2018

jenkins test this please

@colings86

This comment has been minimized.

Member

colings86 commented Feb 13, 2018

Jenkins retest this please

@colings86

This comment has been minimized.

Member

colings86 commented Feb 14, 2018

@liketic it looks like there are some test failures left in the documentation tests:

- org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/aggregations/metrics/percentile-aggregation/line_28}
- org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/aggregations/metrics/percentile-aggregation/line_103}

I think the expected outputs in the percentiles aggregation page in the docs just need adjusting. Are you able to make those changes?

@liketic

This comment has been minimized.

Contributor

liketic commented Feb 14, 2018

@colings86 That's my mistake. I pushed 6f65266 to fix the tests. Please trigger jenkins again.

@colings86

This comment has been minimized.

Member

colings86 commented Feb 14, 2018

jenkins please retest this

@liketic

This comment has been minimized.

Contributor

liketic commented Feb 14, 2018

@colings86 The testing is still failed, I'll rebase the PR, WDYT?

@colings86

This comment has been minimized.

Member

colings86 commented Feb 14, 2018

@liketic yes I think rebasing the PR would be good to try here as I think the failure is related to some changes that were made overnight. Thanks

@liketic liketic force-pushed the liketic:upgrade-t-digest branch from 6f65266 to a307bdd Feb 14, 2018

@colings86

This comment has been minimized.

Member

colings86 commented Feb 14, 2018

jenkins test this please

@colings86 colings86 merged commit fc406c9 into elastic:master Feb 15, 2018

2 checks passed

CLA Commit author has signed the CLA
Details
elasticsearch-ci Build finished.
Details

@liketic liketic deleted the liketic:upgrade-t-digest branch Feb 15, 2018

colings86 added a commit that referenced this pull request Feb 15, 2018

@colings86

This comment has been minimized.

Member

colings86 commented Feb 15, 2018

Merged to master and backported to 6.x. Thanks for working on this @liketic and for sticking with it through all the build failures 😄

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 15, 2018

Merge branch 'master' into async-errors
* master:
  Add a note to the docs that _cat api `help` option cannot be used if an optional url param is used (elastic#28686)
  Lift error finding utility to exceptions helpers
  Change "tweet" type to "_doc" (elastic#28690)
  [Docs] Add missing word in nested.asciidoc (elastic#28507)
  Simplify the Translog constructor by always expecting an existing translog (elastic#28676)
  Upgrade t-digest to 3.2 (elastic#28295) (elastic#28305)
  Add comment explaining lazy declared versions

@mfussenegger mfussenegger referenced this pull request Oct 24, 2018

Merged

Update t-digest to 3.2 #7804

4 of 4 tasks complete

ruflin added a commit to ruflin/Elastica that referenced this pull request Nov 7, 2018

update docker-compose to 3.4, update ES version & bug on new ES versi…
…on on Percentiles aggregations (#1531)

I've updated the:

- docker-compose to a newer version (3.4)
- updated to ElasticSearch 6.4.x (available on [HERE](https://www.docker.elastic.co/))
- Updated Percentiles Aggregation Tests as the T-Digest algorithm has been updated and so all the tests failed. Have a look at [the PR on ES](elastic/elasticsearch#28305)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment