Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion HLL Merge Rollup aggregation function #9971

Closed
t0mpere opened this issue Dec 12, 2022 · 1 comment
Closed

Minion HLL Merge Rollup aggregation function #9971

t0mpere opened this issue Dec 12, 2022 · 1 comment
Labels
PEP-Request Pinot Enhancement Proposal request to be reviewed.

Comments

@t0mpere
Copy link
Contributor

t0mpere commented Dec 12, 2022

Hello everybody, My name is Tommaso and I'm a Data Platform Engineer at Bumble Inc.
I would like to contribute to Apache Pinot by implementing the HLL merge function as an Aggregation option for the Minion MergeRollUpTask

What needs to be done?

Currently the minion MergeRollUpTask only supports aggregation operations like: Min, Max, Sum
I would like to add HLLMerge as an aggregation function, to aggregate serialised HLL estimators objects when performing rollup.

Use cases

At Bumble we rely on HLL for cardinality estimation on a large scale and it would be of great value if Apache Pinot could rollup segments that store serialised HLL counters in BYTES columns. This would allow us to have different time granularities in a single table allowing for longer retention and faster queries on big time windows that use HLL cardinality estimation.

Initial Proposal

The aggregation function will be a class implementing the org.apache.pinot.core.segment.processing.aggregator.ValueAggregator interface. This aggregation function will only support BYTES metric columns where com.clearspring.analytics.stream.cardinality.HyperLogLog objects are serialised. The operation will merge two existing HyperLogLog objects into a single resulting HyperLogLog object.

Limitations and Questions

It could be useful to rollup column types like STRING, INT, DOUBLE, LONG, MV or FLOAT into a BYTES column with HLL. This kind of rollup is what we do already at Bumble with Apache Spark before ingesting data into Pinot. This would let users ingest raw data and then let the minions take care of the rollup with the option of having HLL distinct count as a metric.

I'm afraid that this would not possible due to the fact that MergeRollUpTask can't change the type of a column to BYTES to store the HLL estimator. Has anyone ever thought about this?

@snleee snleee added the PEP-Request Pinot Enhancement Proposal request to be reviewed. label Dec 12, 2022
@t0mpere t0mpere closed this as completed Apr 17, 2023
@t0mpere t0mpere reopened this Apr 17, 2023
@t0mpere
Copy link
Contributor Author

t0mpere commented Apr 17, 2023

#10328 This functionality was implemented here :)

@t0mpere t0mpere closed this as completed Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PEP-Request Pinot Enhancement Proposal request to be reviewed.
Projects
None yet
Development

No branches or pull requests

2 participants