Skip to content

[Feature] Add Apache DataSketches sketch aggregation support #3331

@rahulbsw

Description

@rahulbsw

Motivation

Fluss aggregation merge-engine tables currently support exact numeric, string, boolean, value-selection, and RoaringBitmap byte aggregations. The aggregation documentation does not currently expose native approximate distinct-count or approximate percentile/quantile aggregations for continuously updated analytical metrics.

Approximate distinct counts are useful for workloads such as daily active users, unique devices, unique sessions, and feature-store metrics. Approximate percentiles are useful for request latency, model feature distributions, transaction amounts, and service quality metrics. Exact COUNT(DISTINCT ...) and exact percentile computation can be expensive to maintain incrementally, while sketches provide compact mergeable state.

Apache DataSketches provides sketch implementations with defined serialization and merge semantics. HLL and KLL sketch support fit Fluss' aggregation merge engine model because Fluss can store serialized sketch bytes and merge them per primary key, similar to the existing rbm32 and rbm64 serialized RoaringBitmap aggregation functions.

Solution

Add Apache DataSketches-based aggregation merge functions for serialized sketch bytes.

Proposed merge-engine function names:

  • hll_sketch: approximate distinct count via serialized Apache DataSketches HLL sketches.
  • kll_double_sketch: approximate percentile/quantile support via serialized Apache DataSketches KLL double sketches.

Both functions should:

  • Support only BYTES columns.
  • Treat input values as serialized Apache DataSketches sketches.
  • Merge two non-null sketches using the corresponding DataSketches merge/union API.
  • Return compact serialized sketch bytes.
  • Ignore nulls consistently with existing byte aggregations such as rbm32 and rbm64.
  • Validate function options through the existing aggregation function parameter validation path.

Example DDL:

CREATE TABLE daily_metrics (
    metric_day STRING,
    user_hll BYTES,
    latency_kll BYTES,
    PRIMARY KEY (metric_day) NOT ENFORCED
) WITH (
    'table.merge-engine' = 'aggregation',
    'fields.user_hll.agg' = 'hll_sketch',
    'fields.latency_kll.agg' = 'kll_double_sketch'
);

The storage-level merge functions should operate on pre-serialized sketch bytes. Users would still write serialized HLL/KLL sketch bytes from a client or compute engine unless separate SQL helper functions are added.

Suggested implementation order:

  1. Add hll_sketch first as the smallest distinct-count path.
  2. Add kll_double_sketch as a follow-up for percentile/quantile workloads.
  3. Add Flink SQL helper functions such as fluss_hll_build(value), fluss_hll_estimate(bytes), fluss_kll_double_build(value), and fluss_kll_double_quantile(bytes, quantile).
  4. Add Spark SQL helper functions with the same semantics.

Anything else?

The implementation can follow the existing RoaringBitmap aggregation pattern:

  • Add new entries such as HLL_SKETCH and KLL_DOUBLE_SKETCH to AggFunctionType.
  • Add Java API factory methods such as AggFunctions.HLL_SKETCH() and AggFunctions.KLL_DOUBLE_SKETCH().
  • Add server-side field aggregators and SPI factories.
  • Add unit tests in the aggregation row merger test suite.
  • Add Flink parser coverage through the existing fields.<column>.agg path.
  • Document the functions in the aggregation merge-engine docs.

Dependency and packaging notes:

  • DataSketches Java depends on DataSketches Memory, so the build and binary license metadata should be checked.
  • LICENSE-bin already mentions Apache DataSketches, but the actual module dependencies still need to be added and verified.

I searched existing issues for DataSketches, HLL, KLL, approx_count_distinct, and approx_percentile and did not find a matching Fluss feature request.

I am willing to submit a PR for the first hll_sketch implementation after maintainers confirm the API names and scope. I would keep the first PR focused and leave kll_double_sketch plus Flink/Spark SQL helper functions for follow-up PRs under this feature direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions