[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions by xiongbo-sjtu · Pull Request #54745 · apache/spark

xiongbo-sjtu · 2026-03-10T21:52:29Z

What changes were proposed in this pull request?

This PR adds 6 built-in SQL functions for the Apache DataSketches ItemsSketch (Frequent Items) algorithm, following the same architectural patterns established by the existing Theta, Tuple, and KLL sketch implementations.

Functions added:

Function	Type	Description
`items_sketch_agg`	Aggregate	Builds an ItemsSketch from input values
`items_sketch_merge_agg`	Aggregate	Merges pre-built ItemsSketch binaries
`items_sketch_get_frequent_items`	Scalar	Returns frequent items with frequency bounds
`items_sketch_get_estimate`	Scalar	Returns estimated frequency of a specific item
`items_sketch_merge`	Scalar	Merges two ItemsSketch binaries
`items_sketch_to_string`	Scalar	Returns human-readable sketch summary

Files changed:

New files (3):

sql/catalyst/.../util/ItemsSketchUtils.scala — Utility class for validation, sketch creation, SerDe dispatch
sql/catalyst/.../expressions/aggregate/itemsSketchAggregates.scala — ItemsSketchAgg and ItemsSketchMergeAgg aggregate expressions with ExpressionBuilder companions
sql/catalyst/.../expressions/itemsSketchExpressions.scala — ItemsSketchSerDeHelper (wire format), ItemsSketchGetFrequentItems, ItemsSketchGetEstimate, ItemsSketchMerge, ItemsSketchToString scalar expressions

Modified files:

sql/catalyst/.../analysis/FunctionRegistry.scala — Register 6 functions (2 via expressionBuilder, 4 via expression)
sql/api/.../sql/functions.scala — Public Scala/Java API methods (since 4.2.0)
common/utils/.../error/error-conditions.json — 6 new error conditions
sql/catalyst/.../errors/QueryExecutionErrors.scala — 6 error factory methods
docs/sql-ref-sketch-aggregates.md — Full documentation for all 6 functions
python/pyspark/sql/functions/builtin.py — PySpark function wrappers
python/pyspark/sql/connect/functions/builtin.py — Spark Connect PySpark wrappers
python/pyspark/sql/functions/__init__.py — PySpark exports
python/docs/source/reference/pyspark.sql/functions.rst — PySpark docs entries

Test files:

sql/core/.../DataFrameAggregateSuite.scala — 9 test cases
python/pyspark/sql/tests/test_functions.py — 3 test methods

Auto-generated:

sql/core/.../sql-functions/sql-expression-schema.md — Regenerated

Why are the changes needed?

Spark already provides built-in support for HLL, Theta, Tuple, and KLL sketches. The ItemsSketch fills a gap by providing frequency estimation with:

Frequency estimates for any item — ItemsSketch can estimate the frequency of any queried item.
Configurable error guarantees — Users choose between NO_FALSE_POSITIVES (every returned item is truly frequent) and NO_FALSE_NEGATIVES (all truly frequent items are returned).
Mergeable binary representations — Enables multi-level rollup aggregation (e.g., daily sketches merged into weekly/monthly) without re-scanning raw data.
Wide data type support — Boolean, all numeric types, String, Decimal, Date, Timestamp, TimestampNTZ.

Does this PR introduce any user-facing change?

Yes. 6 new SQL functions are available:

-- Build a sketch and get frequent items
SELECT items_sketch_get_frequent_items(
  items_sketch_agg(visitor_id))
FROM sales;

-- Get estimated frequency of a specific item
SELECT items_sketch_get_estimate(
  items_sketch_agg(visitor_id), 'visitor_8')
FROM sales;

-- Merge pre-aggregated sketches across time periods
SELECT items_sketch_get_frequent_items(
  items_sketch_merge_agg(daily_sketch), 'NO_FALSE_POSITIVES')
FROM daily_visitor_sketches;

How was this patch tested?

9 Scala test cases in DataFrameAggregateSuite:
1. Basic items_sketch_agg + items_sketch_get_frequent_items
2. items_sketch_get_estimate for existing and non-existent items
3. items_sketch_merge (scalar merge of two sketches)
4. items_sketch_merge_agg (aggregate merge)
5. Custom maxMapSize parameter
6. Numeric data types (Int, Long, Double)
7. Null handling (all nulls, mixed nulls, single row)
8. Small capacity approximation behavior + error type comparison
9. Comprehensive rollup test: 7-day visitor data with dimension hierarchy, verifying daily vs weekly top visitors diverge as expected
3 Python test methods covering all functions, small capacity, and null handling.
Golden file regenerated via SPARK_GENERATE_GOLDEN_FILES=1.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: claude-sonnet-4

…ems) functions ``` build/mvn install -DskipTests -am -pl core SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl core \ -Dtest=none \ -Dsuites=org.apache.spark.SparkThrowableSuite \ -Dtests="Error conditions are correctly formatted" \ -DfailIfNoTests=false ``` ``` build/mvn install -DskipTests -am -pl sql/core SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl sql/core \ -Dtest=none \ -Dsuites=org.apache.spark.sql.ExpressionsSchemaSuite \ -DfailIfNoTests=false ```

xiongbo-sjtu · 2026-03-12T21:43:23Z

@dtenedor @cboumalh Please help review this PR when you get a chance. Thanks!

cboumalh · 2026-03-12T21:52:23Z

Hi @xiongbo-sjtu, thanks for working on this! I believe Spark already supports ItemsSketch through approx_top_k, approx_top_k_accumulate, approx_top_k_combine. Do those serve your use case?

xiongbo-sjtu · 2026-03-12T22:04:53Z

Hi @xiongbo-sjtu, thanks for working on this! I believe Spark already supports ItemsSketch through approx_top_k, approx_top_k_accumulate, approx_top_k_combine. Do those serve your use case?

Good question! I did look into the approx_top_k family of functions, which is great for the common case where users simply want a quick, easy answer to "what are the most frequent items?" with minimal configuration. It's well-designed for that workflow.

However, it doesn't fully serve our use cases. The example here illustrates a simplified version of our real-world usage. The key gaps we see are:

Full sketch introspection: We need to retrieve all frequent items tracked by the sketch with their estimated frequencies and error bounds, as well as query the estimated frequency of a specific item, rather than only extracting a top-k list.
Configurable error guarantees: Users can choose between NO_FALSE_POSITIVES (every returned item is truly frequent) and NO_FALSE_NEGATIVES (all truly frequent items are returned), which is important for different analytical scenarios.

We see these two sets of functions as complementary: approx_top_k for simple top-k queries, and the new items_sketch functions for more advanced analytical workflows that need finer control over accuracy, error semantics, and data types.

xiongbo-sjtu · 2026-03-13T16:23:03Z

The main motivation is architectural consistency. Every other sketch family in Spark (HLL, Theta, Tuple, KLL) follows the *_sketch_agg / *_merge_agg / scalar query pattern with opaque BinaryType output designed for table storage and multi-level rollup. The approx_top_k family uses a different pattern (struct-based state, separate accumulate/combine/estimate) that predates this convention. The items_sketch functions bring frequent-items into the same consistent API shape, with a self-describing binary wire format suitable for persisting in tables and merging across time horizons without re-scanning raw data.

That said, if reviewers feel strongly that we should enhance the existing approx_top_k functions instead (e.g., adding point frequency queries, binary output format, etc.), I'm open to that direction. Happy to discuss which approach the community prefers.

xiongbo-sjtu · 2026-03-19T16:30:55Z

@dtenedor

Hi Daniel, I'd appreciate your review on this PR whenever you have a moment. Thanks!

xiongbo-sjtu force-pushed the master branch 8 times, most recently from 9981750 to 24f60f7 Compare March 12, 2026 05:53

xiongbo-sjtu force-pushed the master branch from 24f60f7 to 930d6e7 Compare March 12, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745

[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745
xiongbo-sjtu wants to merge 1 commit intoapache:masterfrom
xiongbo-sjtu:master

xiongbo-sjtu commented Mar 10, 2026

Uh oh!

xiongbo-sjtu commented Mar 12, 2026

Uh oh!

cboumalh commented Mar 12, 2026

Uh oh!

xiongbo-sjtu commented Mar 12, 2026 •

edited

Loading

Uh oh!

xiongbo-sjtu commented Mar 13, 2026

Uh oh!

xiongbo-sjtu commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiongbo-sjtu commented Mar 10, 2026

What changes were proposed in this pull request?

Functions added:

Files changed:

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xiongbo-sjtu commented Mar 12, 2026

Uh oh!

cboumalh commented Mar 12, 2026

Uh oh!

xiongbo-sjtu commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiongbo-sjtu commented Mar 13, 2026

Uh oh!

xiongbo-sjtu commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiongbo-sjtu commented Mar 12, 2026 •

edited

Loading