use datasketches-java 4.2.0#15257
Merged
AlexanderSaydakov merged 9 commits intomasterfrom Oct 26, 2023
Merged
Conversation
...-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollectorFactory.java
Fixed
Show fixed
Hide fixed
...-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollectorFactory.java
Fixed
Show fixed
Hide fixed
cryptoe
approved these changes
Oct 26, 2023
Contributor
cryptoe
left a comment
There was a problem hiding this comment.
Changes LGTM.
Thanks @AlexanderSaydakov @gianm @adarshsanjeev for pitching in for this druid 28 blocker.
LakshSingla
pushed a commit
to LakshSingla/druid
that referenced
this pull request
Oct 27, 2023
* use datasketches-java 4.2.0 * use exclusive mode * fixed issues raised by CodeQL * fixed issue raised by spotbugs * fixed issues raised by intellij * added missing import * Update QuantilesSketchKeyCollector search mode and adjust tests. * Update sizeOf functions and add unit tests * Add unit tests --------- Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
cryptoe
pushed a commit
that referenced
this pull request
Oct 27, 2023
Backport of : #15267 --------- Co-authored-by: Alexander Saydakov <13126686+AlexanderSaydakov@users.noreply.github.com> Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
CaseyPan
pushed a commit
to CaseyPan/druid
that referenced
this pull request
Nov 17, 2023
* use datasketches-java 4.2.0 * use exclusive mode * fixed issues raised by CodeQL * fixed issue raised by spotbugs * fixed issues raised by intellij * added missing import * Update QuantilesSketchKeyCollector search mode and adjust tests. * Update sizeOf functions and add unit tests * Add unit tests --------- Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is to use the latest datasketches-java version 4.2.0.
This was supposed to be a minor version change, but inadvertently some API changes were introduced. Therefore I had to implement a few new required methods in the custom ArrayOfStringTuplesSerDe. They will need a careful review since I am not entirely sure I understood the serial format correctly.
Also one test is currently failing. I don’t understand the purpose of this test. It is called preservesMinAndMaxWhenAssumeGroupedFalse. I have no idea what does this mean. However it asks a quantile sketch to partition 66 items into 66 partitions and expects exactly one item in each. If we allow even slightest error (and sketches are approximate) we can get some partitions with 2 items and some empty ones. So with deduplication it leads to fewer partitions.
This change in behavior from 4.1.0 to 4.2.0 is unfortunate, but not incorrect. This is a degenerate use case. I would think that a better test could generate, say, 1000 items, ask for 10 partitions and assert that partitions have 100+-2 items or something like that. Perhaps this behavior with very small partitions can be improved in the next version, but for now I would suggest using 4.2.0 and changing this test somehow.