Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UltraLogLog support #11835

Merged
merged 7 commits into from Oct 24, 2023
Merged

Conversation

andimiller
Copy link
Contributor

@andimiller andimiller commented Oct 19, 2023

UltraLogLog is a variant of HyperLogLog from dynatrace, with an implementation available in hash4j under the apache license.

This adds support for using it in the ways you'd use HLL in Pinot.

When using it against normal java types, wyhash 4 is used as the default hashing algorithm, when bringing your own serialized sketches you can use any.

  • supports UltraLogLog as a data type
    • the serialization format includes serializing the P value as the first byte in the data stream, so the width is known in streams
  • adds DistinctCountULL and DistinctCountRawULL for use in SQL
    • raw outputs Base64 encoded bytes that can be fed into UltraLogLog.wrap in other services
  • adds startree support
  • adds merge rollup support
  • new transformaction functions added
    • toULL allows turning data into a ULL
    • fromULL lets you import ULLs encoded as byte arrays outside pinot

Release Notes

  • Added UltraLogLog aggregations for Count Distinct (distinctCountULL and distinctCountRawULL)
  • Added UltraLogLog creation via Transform Function
  • Added UltraLogLog merging in MergeRollup
  • Added support for UltraLogLog in Star-Tree indexes

UltraLogLog is a variant of HyperLogLog from dynatrace, with an implementation available in hash4j under the apache license.

This adds support for using it in the ways you'd use HLL in Pinot.

When using it against normal java types, wyhash 4 is used as the default hashing algorithm, when bringing your own serialized sketches you can use any.

* supports `UltraLogLog` as a data type
  * the serialization format includes serializing the P value as the first byte in the data stream, so the width is known in streams
* adds `DistinctCountULL` and `DistinctCountRawULL` for use in SQL
  * raw outputs Base64 encoded bytes that can be fed into `UltraLogLog.wrap` in other services
* adds startree support
* adds merge rollup support
* new transformaction functions added
  * `toULL` allows turning data into a ULL
  * `fromULL` lets you import ULLs encoded as byte arrays outside pinot
@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2023

Codecov Report

Merging #11835 (82387ca) into master (ecac6c9) will decrease coverage by 0.07%.
The diff coverage is 29.37%.

@@             Coverage Diff              @@
##             master   #11835      +/-   ##
============================================
- Coverage     62.87%   62.80%   -0.07%     
+ Complexity     1141     1140       -1     
============================================
  Files          2367     2373       +6     
  Lines        127888   128207     +319     
  Branches      19732    19787      +55     
============================================
+ Hits          80414    80525     +111     
- Misses        41752    41958     +206     
- Partials       5722     5724       +2     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.75% <29.37%> (-0.08%) ⬇️
java-21 62.68% <29.37%> (-0.07%) ⬇️
skip-bytebuffers-false 62.77% <29.37%> (-0.09%) ⬇️
skip-bytebuffers-true 62.66% <29.37%> (-0.08%) ⬇️
temurin 62.80% <29.37%> (-0.07%) ⬇️
unittests 62.80% <29.37%> (-0.07%) ⬇️
unittests1 66.78% <29.37%> (-0.11%) ⬇️
unittests2 14.39% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...he/pinot/core/function/scalar/SketchFunctions.java 59.33% <100.00%> (+1.69%) ⬆️
...al/aggregator/DistinctCountULLValueAggregator.java 100.00% <100.00%> (ø)
...gment/local/aggregator/ValueAggregatorFactory.java 83.33% <100.00%> (+0.98%) ⬆️
...he/pinot/segment/local/utils/CustomSerDeUtils.java 45.26% <100.00%> (+6.43%) ⬆️
...he/pinot/segment/local/utils/TableConfigUtils.java 72.01% <ø> (ø)
...he/pinot/segment/local/utils/UltraLogLogUtils.java 100.00% <100.00%> (ø)
...va/org/apache/pinot/spi/utils/CommonConstants.java 28.00% <ø> (ø)
...org/apache/pinot/core/common/ObjectSerDeUtils.java 90.01% <92.85%> (+0.18%) ⬆️
...gregation/function/AggregationFunctionFactory.java 81.51% <50.00%> (-0.31%) ⬇️
.../processing/aggregator/ValueAggregatorFactory.java 30.00% <0.00%> (-3.34%) ⬇️
... and 5 more

... and 12 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xiangfu0
Copy link
Contributor

Can you rebase to latest?

@andimiller
Copy link
Contributor Author

Can you rebase to latest?

done, updated

Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, otherwise

pinot-segment-local/pom.xml Outdated Show resolved Hide resolved
pinot-segment-local/pom.xml Outdated Show resolved Hide resolved
@xiangfu0 xiangfu0 merged commit 70ac4b6 into apache:master Oct 24, 2023
21 checks passed
@xiangfu0 xiangfu0 added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Oct 24, 2023
@xiangfu0
Copy link
Contributor

Thanks for the contribution!
Please add a release notes section in the PR description as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants