Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork tdigest library #95903

Closed
martijnvg opened this issue May 8, 2023 · 5 comments
Closed

Fork tdigest library #95903

martijnvg opened this issue May 8, 2023 · 5 comments
Assignees
Labels
:Analytics/Aggregations Aggregations :Delivery/Build Build or test infrastructure >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Delivery Meta label for Delivery team

Comments

@martijnvg
Copy link
Member

martijnvg commented May 8, 2023

We plan to for the tdigest library.

There are two main reasons behind this choice:

  1. We would like to control semantic version and backward compatibility according to our definition. Right now, for instance, TDigest does not match our usage of semantic versioning when changing the library code and that makes upgrading quite challenging because exposes us to backward compatibility issues.
  2. We would like to change those libraries to use some specific Elasticsearch libraries/tools/frameworks such as BigArrays. Right now when running some aggregations (percentiles, boxplot,...) we experience OOMs due to large memory usage. Using BigArrays, for instance, would allow us to deal with OOMs using Circuit Breakers.

The immediate goal of this issue is to fork the library and then at a later stage enhance to forked library to make use of the big arrays infrastructure.

Currently t-digest version 3.2 is used. The current version is 3.3 We have been locked to the 3.2 version because of at least one breaking change (how p50 is computed). The plan is to fork from the latest commit and change the forked library such that the results it produces are similar to that of version 3.2.

@martijnvg martijnvg added :Delivery/Build Build or test infrastructure :Analytics/Aggregations Aggregations >refactoring labels May 8, 2023
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Delivery Meta label for Delivery team labels May 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira
Copy link
Contributor

@martijnvg @wchaparro what's the priority on this. Is the upgrade to 3.3 blocking something that is targeting a particular Elasticsearch release?

@martijnvg
Copy link
Member Author

@mark-vieira We're not blocked by other items. The forking has a high priority for us and we like to get this in with the 8.9 release. We plan to fork it as a library under the top-level libs directory, similar to how h3 was forked.

The integration with big arrays is currently not a blocker. We do like to adopt big arrays in the forked library, but that will be done at a later point of time and has a lower priority than the forking itself.

elasticsearchmachine pushed a commit that referenced this issue Jun 19, 2023
The asserts were misfiring in this test:

```
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviationTests.testReduceRandom" -Dtests.seed=AA1D81AD056870F0 -Dtests.locale=en-CA -Dtests.timezone=US/Eastern -Druntime.java=20

org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviationTests > testReduceRandom FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([AA1D81AD056870F0:6A202DBC2335E9A6]:0)
        at org.elasticsearch.tdigest.MergingDigest.merge(MergingDigest.java:316)
        at org.elasticsearch.tdigest.MergingDigest.mergeNewValues(MergingDigest.java:298)
        at org.elasticsearch.tdigest.MergingDigest.mergeNewValues(MergingDigest.java:288)
        at org.elasticsearch.tdigest.MergingDigest.quantile(MergingDigest.java:485)
        at org.elasticsearch.tdigest.HybridDigest.quantile(HybridDigest.java:141)
        at org.elasticsearch.search.aggregations.metrics.TDigestState.quantile(TDigestState.java:247)
        at org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviation.computeMedianAbsoluteDeviation(InternalMedianAbsoluteDeviation.java:38)
        at org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviation.<init>(InternalMedianAbsoluteDeviation.java:48)
        at org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviationTests.createTestInstance(InternalMedianAbsoluteDeviationTests.java:33)
        at org.elasticsearch.search.aggregations.metrics.InternalMedianAbsoluteDeviationTests.createTestInstance(InternalMedianAbsoluteDeviationTests.java:23)
```

They should have been removed earlier, it's possible for the first and
the last centroid to have weight > 1.

Related to #95903
kkrik-es added a commit to kkrik-es/elasticsearch that referenced this issue Jun 20, 2023
This was replaced by optional serialization in version
V_8_500_018 (elastic#96943), breaking backwards compatibility for serverless.

Related to elastic#95903
kkrik-es added a commit that referenced this issue Jun 20, 2023
This was replaced by optional serialization in version
V_8_500_018 (#96943), breaking backwards compatibility for serverless.

Related to #95903
elasticsearchmachine pushed a commit that referenced this issue Jun 21, 2023
Setting `experiment_hint` to `high_accuracy` leads to using a slower but
more accurate implementation for TDigest.

Add YAML tests to cover for settings this param properly and for parsing
errors.

Related to #95903
@kkrik-es
Copy link
Contributor

This is now complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Delivery/Build Build or test infrastructure >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

4 participants