Minion HLL Merge Rollup aggregation function #9971

t0mpere · 2022-12-12T17:17:59Z

Hello everybody, My name is Tommaso and I'm a Data Platform Engineer at Bumble Inc.
I would like to contribute to Apache Pinot by implementing the HLL merge function as an Aggregation option for the Minion MergeRollUpTask

What needs to be done?

Currently the minion MergeRollUpTask only supports aggregation operations like: Min, Max, Sum
I would like to add HLLMerge as an aggregation function, to aggregate serialised HLL estimators objects when performing rollup.

Use cases

At Bumble we rely on HLL for cardinality estimation on a large scale and it would be of great value if Apache Pinot could rollup segments that store serialised HLL counters in BYTES columns. This would allow us to have different time granularities in a single table allowing for longer retention and faster queries on big time windows that use HLL cardinality estimation.

Initial Proposal

The aggregation function will be a class implementing the org.apache.pinot.core.segment.processing.aggregator.ValueAggregator interface. This aggregation function will only support BYTES metric columns where com.clearspring.analytics.stream.cardinality.HyperLogLog objects are serialised. The operation will merge two existing HyperLogLog objects into a single resulting HyperLogLog object.

Limitations and Questions

It could be useful to rollup column types like STRING, INT, DOUBLE, LONG, MV or FLOAT into a BYTES column with HLL. This kind of rollup is what we do already at Bumble with Apache Spark before ingesting data into Pinot. This would let users ingest raw data and then let the minions take care of the rollup with the option of having HLL distinct count as a metric.

I'm afraid that this would not possible due to the fact that MergeRollUpTask can't change the type of a column to BYTES to store the HLL estimator. Has anyone ever thought about this?

The text was updated successfully, but these errors were encountered:

t0mpere · 2023-04-17T16:36:53Z

#10328 This functionality was implemented here :)

snleee added the PEP-Request Pinot Enhancement Proposal request to be reviewed. label Dec 12, 2022

t0mpere closed this as completed Apr 17, 2023

t0mpere reopened this Apr 17, 2023

t0mpere closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minion HLL Merge Rollup aggregation function #9971

Minion HLL Merge Rollup aggregation function #9971

t0mpere commented Dec 12, 2022

t0mpere commented Apr 17, 2023

Minion HLL Merge Rollup aggregation function #9971

Minion HLL Merge Rollup aggregation function #9971

Comments

t0mpere commented Dec 12, 2022

What needs to be done?

Use cases

Initial Proposal

Limitations and Questions

t0mpere commented Apr 17, 2023