[Feature] Add Apache DataSketches sketch aggregation support

### Motivation

Fluss aggregation merge-engine tables currently support exact numeric, string, boolean, value-selection, and RoaringBitmap byte aggregations. The aggregation documentation does not currently expose native approximate distinct-count or approximate percentile/quantile aggregations for continuously updated analytical metrics.

Approximate distinct counts are useful for workloads such as daily active users, unique devices, unique sessions, and feature-store metrics. Approximate percentiles are useful for request latency, model feature distributions, transaction amounts, and service quality metrics. Exact `COUNT(DISTINCT ...)` and exact percentile computation can be expensive to maintain incrementally, while sketches provide compact mergeable state.

Apache DataSketches provides sketch implementations with defined serialization and merge semantics. HLL and KLL sketch support fit Fluss' aggregation merge engine model because Fluss can store serialized sketch bytes and merge them per primary key, similar to the existing `rbm32` and `rbm64` serialized RoaringBitmap aggregation functions.

### Solution

Add Apache DataSketches-based aggregation merge functions for serialized sketch bytes.

Proposed merge-engine function names:

- `hll_sketch`: approximate distinct count via serialized Apache DataSketches HLL sketches.
- `kll_double_sketch`: approximate percentile/quantile support via serialized Apache DataSketches KLL double sketches.

Both functions should:

- Support only `BYTES` columns.
- Treat input values as serialized Apache DataSketches sketches.
- Merge two non-null sketches using the corresponding DataSketches merge/union API.
- Return compact serialized sketch bytes.
- Ignore nulls consistently with existing byte aggregations such as `rbm32` and `rbm64`.
- Validate function options through the existing aggregation function parameter validation path.

Example DDL:

```sql
CREATE TABLE daily_metrics (
    metric_day STRING,
    user_hll BYTES,
    latency_kll BYTES,
    PRIMARY KEY (metric_day) NOT ENFORCED
) WITH (
    'table.merge-engine' = 'aggregation',
    'fields.user_hll.agg' = 'hll_sketch',
    'fields.latency_kll.agg' = 'kll_double_sketch'
);
```

The storage-level merge functions should operate on pre-serialized sketch bytes. Users would still write serialized HLL/KLL sketch bytes from a client or compute engine unless separate SQL helper functions are added.

Suggested implementation order:

1. Add `hll_sketch` first as the smallest distinct-count path.
2. Add `kll_double_sketch` as a follow-up for percentile/quantile workloads.
3. Add Flink SQL helper functions such as `fluss_hll_build(value)`, `fluss_hll_estimate(bytes)`, `fluss_kll_double_build(value)`, and `fluss_kll_double_quantile(bytes, quantile)`.
4. Add Spark SQL helper functions with the same semantics.

### Anything else?

The implementation can follow the existing RoaringBitmap aggregation pattern:

- Add new entries such as `HLL_SKETCH` and `KLL_DOUBLE_SKETCH` to `AggFunctionType`.
- Add Java API factory methods such as `AggFunctions.HLL_SKETCH()` and `AggFunctions.KLL_DOUBLE_SKETCH()`.
- Add server-side field aggregators and SPI factories.
- Add unit tests in the aggregation row merger test suite.
- Add Flink parser coverage through the existing `fields.<column>.agg` path.
- Document the functions in the aggregation merge-engine docs.

Dependency and packaging notes:

- DataSketches Java depends on DataSketches Memory, so the build and binary license metadata should be checked.
- `LICENSE-bin` already mentions Apache DataSketches, but the actual module dependencies still need to be added and verified.

I searched existing issues for DataSketches, HLL, KLL, `approx_count_distinct`, and `approx_percentile` and did not find a matching Fluss feature request.

I am willing to submit a PR for the first `hll_sketch` implementation after maintainers confirm the API names and scope. I would keep the first PR focused and leave `kll_double_sketch` plus Flink/Spark SQL helper functions for follow-up PRs under this feature direction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Apache DataSketches sketch aggregation support #3331

Motivation

Solution

Anything else?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Add Apache DataSketches sketch aggregation support #3331

Description

Motivation

Solution

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions