You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everybody, My name is Tommaso and I'm a Data Platform Engineer at Bumble Inc.
I would like to contribute to Apache Pinot by implementing the HLL merge function as an Aggregation option for the Minion MergeRollUpTask
What needs to be done?
Currently the minion MergeRollUpTask only supports aggregation operations like: Min, Max, Sum
I would like to add HLLMerge as an aggregation function, to aggregate serialised HLL estimators objects when performing rollup.
Use cases
At Bumble we rely on HLL for cardinality estimation on a large scale and it would be of great value if Apache Pinot could rollup segments that store serialised HLL counters in BYTES columns. This would allow us to have different time granularities in a single table allowing for longer retention and faster queries on big time windows that use HLL cardinality estimation.
Initial Proposal
The aggregation function will be a class implementing the org.apache.pinot.core.segment.processing.aggregator.ValueAggregator interface. This aggregation function will only support BYTES metric columns where com.clearspring.analytics.stream.cardinality.HyperLogLog objects are serialised. The operation will merge two existing HyperLogLog objects into a single resulting HyperLogLog object.
Limitations and Questions
It could be useful to rollup column types like STRING, INT, DOUBLE, LONG, MV or FLOAT into a BYTES column with HLL. This kind of rollup is what we do already at Bumble with Apache Spark before ingesting data into Pinot. This would let users ingest raw data and then let the minions take care of the rollup with the option of having HLL distinct count as a metric.
I'm afraid that this would not possible due to the fact that MergeRollUpTask can't change the type of a column to BYTES to store the HLL estimator. Has anyone ever thought about this?
The text was updated successfully, but these errors were encountered:
Hello everybody, My name is Tommaso and I'm a Data Platform Engineer at Bumble Inc.
I would like to contribute to Apache Pinot by implementing the HLL merge function as an Aggregation option for the Minion
MergeRollUpTask
What needs to be done?
Currently the minion
MergeRollUpTask
only supports aggregation operations like: Min, Max, SumI would like to add
HLLMerge
as an aggregation function, to aggregate serialised HLL estimators objects when performing rollup.Use cases
At Bumble we rely on HLL for cardinality estimation on a large scale and it would be of great value if Apache Pinot could rollup segments that store serialised HLL counters in
BYTES
columns. This would allow us to have different time granularities in a single table allowing for longer retention and faster queries on big time windows that use HLL cardinality estimation.Initial Proposal
The aggregation function will be a class implementing the
org.apache.pinot.core.segment.processing.aggregator.ValueAggregator
interface. This aggregation function will only supportBYTES
metric columns wherecom.clearspring.analytics.stream.cardinality.HyperLogLog
objects are serialised. The operation will merge two existing HyperLogLog objects into a single resulting HyperLogLog object.Limitations and Questions
It could be useful to rollup column types like
STRING
,INT
,DOUBLE
,LONG
,MV
orFLOAT
into aBYTES
column withHLL
. This kind of rollup is what we do already at Bumble with Apache Spark before ingesting data into Pinot. This would let users ingest raw data and then let the minions take care of the rollup with the option of having HLL distinct count as a metric.I'm afraid that this would not possible due to the fact that
MergeRollUpTask
can't change the type of a column toBYTES
to store the HLL estimator. Has anyone ever thought about this?The text was updated successfully, but these errors were encountered: