Skip to content

Conversation

@Myracle
Copy link
Contributor

@Myracle Myracle commented Feb 9, 2026

What is the purpose of the change

This pull request extends the APPROX_COUNT_DISTINCT aggregate function to support streaming mode, including Window TVF (TUMBLE, HOP, CUMULATE). Previously, this function was only available in batch mode. The implementation uses the HyperLogLog++ algorithm to provide approximate distinct counting with approximately 1% relative standard error while using constant memory (approximately 16KB). This enables efficient cardinality estimation in streaming scenarios where exact counts are not required.

Brief change log

  • Added a new unified ApproxCountDistinctAggFunctions class that supports both batch and streaming modes
  • Deprecated the existing BatchApproxCountDistinctAggFunctions class for backward compatibility
  • Updated AggFunctionFactory to use the new implementation and added validation to reject retraction scenarios in non-windowed streaming aggregations
  • Implemented merge() method required for Window TVF aggregation
  • Optimized timestamp hashing to support high-precision TIMESTAMP values (nanosecond precision)
  • Added SQL function documentation in both English and Chinese
  • Added release notes for Flink 2.2

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

  • Added ApproxCountDistinctAggFunctionTest with unit tests covering all supported data types (TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, DATE, TIME, TIMESTAMP, TIMESTAMP_LTZ, VARCHAR/CHAR) and merge functionality
  • Added StreamApproxCountDistinctITCase with integration tests for:
    • TUMBLE window aggregation
    • HOP window aggregation
    • CUMULATE window aggregation
    • Different data types in streaming mode
    • NULL value handling
    • Precision validation with 10,000 records (error rate < 2%)
  • Added BatchApproxCountDistinctITCase with integration tests for:
    • All supported data types in batch mode
    • GROUP BY aggregation
    • NULL value handling
    • Empty table handling
    • Precision validation with 10,000 records

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs / JavaDocs
    • Added function documentation in docs/data/sql_functions.yml and docs/data/sql_functions_zh.yml
    • Added comprehensive JavaDocs in ApproxCountDistinctAggFunctions class

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 9, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

…ate function in streaming mode with Window TVF
@Myracle Myracle force-pushed the FLINK-39051-streaming-APPROX_COUNT_DISTINCT branch from cce7639 to 7ea8bd5 Compare February 9, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants