Skip to content

[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745

Open
xiongbo-sjtu wants to merge 1 commit intoapache:masterfrom
xiongbo-sjtu:master
Open

[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745
xiongbo-sjtu wants to merge 1 commit intoapache:masterfrom
xiongbo-sjtu:master

Conversation

@xiongbo-sjtu
Copy link

What changes were proposed in this pull request?

This PR adds 6 built-in SQL functions for the Apache DataSketches ItemsSketch (Frequent Items) algorithm, following the same architectural patterns established by the existing Theta, Tuple, and KLL sketch implementations.

Functions added:

Function Type Description
items_sketch_agg Aggregate Builds an ItemsSketch from input values
items_sketch_merge_agg Aggregate Merges pre-built ItemsSketch binaries
items_sketch_get_frequent_items Scalar Returns frequent items with frequency bounds
items_sketch_get_estimate Scalar Returns estimated frequency of a specific item
items_sketch_merge Scalar Merges two ItemsSketch binaries
items_sketch_to_string Scalar Returns human-readable sketch summary

Files changed:

New files (3):

  • sql/catalyst/.../util/ItemsSketchUtils.scala — Utility class for validation, sketch creation, SerDe dispatch
  • sql/catalyst/.../expressions/aggregate/itemsSketchAggregates.scalaItemsSketchAgg and ItemsSketchMergeAgg aggregate expressions with ExpressionBuilder companions
  • sql/catalyst/.../expressions/itemsSketchExpressions.scalaItemsSketchSerDeHelper (wire format), ItemsSketchGetFrequentItems, ItemsSketchGetEstimate, ItemsSketchMerge, ItemsSketchToString scalar expressions

Modified files:

  • sql/catalyst/.../analysis/FunctionRegistry.scala — Register 6 functions (2 via expressionBuilder, 4 via expression)
  • sql/api/.../sql/functions.scala — Public Scala/Java API methods (since 4.2.0)
  • common/utils/.../error/error-conditions.json — 6 new error conditions
  • sql/catalyst/.../errors/QueryExecutionErrors.scala — 6 error factory methods
  • docs/sql-ref-sketch-aggregates.md — Full documentation for all 6 functions
  • python/pyspark/sql/functions/builtin.py — PySpark function wrappers
  • python/pyspark/sql/connect/functions/builtin.py — Spark Connect PySpark wrappers
  • python/pyspark/sql/functions/__init__.py — PySpark exports
  • python/docs/source/reference/pyspark.sql/functions.rst — PySpark docs entries

Test files:

  • sql/core/.../DataFrameAggregateSuite.scala — 9 test cases
  • python/pyspark/sql/tests/test_functions.py — 3 test methods

Auto-generated:

  • sql/core/.../sql-functions/sql-expression-schema.md — Regenerated

Why are the changes needed?

Spark already provides built-in support for HLL, Theta, Tuple, and KLL sketches. The ItemsSketch fills a gap by providing frequency estimation with:

  1. Frequency estimates for any item — ItemsSketch can estimate the frequency of any queried item.
  2. Configurable error guarantees — Users choose between NO_FALSE_POSITIVES (every returned item is truly frequent) and NO_FALSE_NEGATIVES (all truly frequent items are returned).
  3. Mergeable binary representations — Enables multi-level rollup aggregation (e.g., daily sketches merged into weekly/monthly) without re-scanning raw data.
  4. Wide data type support — Boolean, all numeric types, String, Decimal, Date, Timestamp, TimestampNTZ.

Does this PR introduce any user-facing change?

Yes. 6 new SQL functions are available:

-- Build a sketch and get frequent items
SELECT items_sketch_get_frequent_items(
  items_sketch_agg(visitor_id))
FROM sales;

-- Get estimated frequency of a specific item
SELECT items_sketch_get_estimate(
  items_sketch_agg(visitor_id), 'visitor_8')
FROM sales;

-- Merge pre-aggregated sketches across time periods
SELECT items_sketch_get_frequent_items(
  items_sketch_merge_agg(daily_sketch), 'NO_FALSE_POSITIVES')
FROM daily_visitor_sketches;

How was this patch tested?

  • 9 Scala test cases in DataFrameAggregateSuite:

    1. Basic items_sketch_agg + items_sketch_get_frequent_items
    2. items_sketch_get_estimate for existing and non-existent items
    3. items_sketch_merge (scalar merge of two sketches)
    4. items_sketch_merge_agg (aggregate merge)
    5. Custom maxMapSize parameter
    6. Numeric data types (Int, Long, Double)
    7. Null handling (all nulls, mixed nulls, single row)
    8. Small capacity approximation behavior + error type comparison
    9. Comprehensive rollup test: 7-day visitor data with dimension hierarchy, verifying daily vs weekly top visitors diverge as expected
  • 3 Python test methods covering all functions, small capacity, and null handling.

  • Golden file regenerated via SPARK_GENERATE_GOLDEN_FILES=1.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: claude-sonnet-4

@xiongbo-sjtu xiongbo-sjtu force-pushed the master branch 8 times, most recently from 9981750 to 24f60f7 Compare March 12, 2026 05:53
…ems) functions

```
build/mvn install -DskipTests -am -pl core

SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl core \
  -Dtest=none \
  -Dsuites=org.apache.spark.SparkThrowableSuite \
  -Dtests="Error conditions are correctly formatted" \
  -DfailIfNoTests=false
```

```
build/mvn install -DskipTests -am -pl sql/core

SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl sql/core \
  -Dtest=none \
  -Dsuites=org.apache.spark.sql.ExpressionsSchemaSuite \
  -DfailIfNoTests=false
```
@xiongbo-sjtu
Copy link
Author

@dtenedor @cboumalh Please help review this PR when you get a chance. Thanks!

@cboumalh
Copy link
Contributor

Hi @xiongbo-sjtu, thanks for working on this! I believe Spark already supports ItemsSketch through approx_top_k, approx_top_k_accumulate, approx_top_k_combine. Do those serve your use case?

@xiongbo-sjtu
Copy link
Author

xiongbo-sjtu commented Mar 12, 2026

Hi @xiongbo-sjtu, thanks for working on this! I believe Spark already supports ItemsSketch through approx_top_k, approx_top_k_accumulate, approx_top_k_combine. Do those serve your use case?

Good question! I did look into the approx_top_k family of functions, which is great for the common case where users simply want a quick, easy answer to "what are the most frequent items?" with minimal configuration. It's well-designed for that workflow.

However, it doesn't fully serve our use cases. The example here illustrates a simplified version of our real-world usage. The key gaps we see are:

  • Full sketch introspection: We need to retrieve all frequent items tracked by the sketch with their estimated frequencies and error bounds, as well as query the estimated frequency of a specific item, rather than only extracting a top-k list.
  • Configurable error guarantees: Users can choose between NO_FALSE_POSITIVES (every returned item is truly frequent) and NO_FALSE_NEGATIVES (all truly frequent items are returned), which is important for different analytical scenarios.

We see these two sets of functions as complementary: approx_top_k for simple top-k queries, and the new items_sketch functions for more advanced analytical workflows that need finer control over accuracy, error semantics, and data types.

@xiongbo-sjtu
Copy link
Author

The main motivation is architectural consistency. Every other sketch family in Spark (HLL, Theta, Tuple, KLL) follows the *_sketch_agg / *_merge_agg / scalar query pattern with opaque BinaryType output designed for table storage and multi-level rollup. The approx_top_k family uses a different pattern (struct-based state, separate accumulate/combine/estimate) that predates this convention. The items_sketch functions bring frequent-items into the same consistent API shape, with a self-describing binary wire format suitable for persisting in tables and merging across time horizons without re-scanning raw data.

That said, if reviewers feel strongly that we should enhance the existing approx_top_k functions instead (e.g., adding point frequency queries, binary output format, etc.), I'm open to that direction. Happy to discuss which approach the community prefers.

@xiongbo-sjtu
Copy link
Author

@dtenedor

Hi Daniel, I'd appreciate your review on this PR whenever you have a moment. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants