[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745
[SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions#54745xiongbo-sjtu wants to merge 1 commit intoapache:masterfrom
Conversation
9981750 to
24f60f7
Compare
…ems) functions ``` build/mvn install -DskipTests -am -pl core SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl core \ -Dtest=none \ -Dsuites=org.apache.spark.SparkThrowableSuite \ -Dtests="Error conditions are correctly formatted" \ -DfailIfNoTests=false ``` ``` build/mvn install -DskipTests -am -pl sql/core SPARK_GENERATE_GOLDEN_FILES=1 build/mvn test -pl sql/core \ -Dtest=none \ -Dsuites=org.apache.spark.sql.ExpressionsSchemaSuite \ -DfailIfNoTests=false ```
|
Hi @xiongbo-sjtu, thanks for working on this! I believe Spark already supports ItemsSketch through |
Good question! I did look into the However, it doesn't fully serve our use cases. The example here illustrates a simplified version of our real-world usage. The key gaps we see are:
We see these two sets of functions as complementary: |
|
The main motivation is architectural consistency. Every other sketch family in Spark (HLL, Theta, Tuple, KLL) follows the *_sketch_agg / *_merge_agg / scalar query pattern with opaque BinaryType output designed for table storage and multi-level rollup. The approx_top_k family uses a different pattern (struct-based state, separate accumulate/combine/estimate) that predates this convention. The items_sketch functions bring frequent-items into the same consistent API shape, with a self-describing binary wire format suitable for persisting in tables and merging across time horizons without re-scanning raw data. That said, if reviewers feel strongly that we should enhance the existing approx_top_k functions instead (e.g., adding point frequency queries, binary output format, etc.), I'm open to that direction. Happy to discuss which approach the community prefers. |
|
Hi Daniel, I'd appreciate your review on this PR whenever you have a moment. Thanks! |
What changes were proposed in this pull request?
This PR adds 6 built-in SQL functions for the Apache DataSketches ItemsSketch (Frequent Items) algorithm, following the same architectural patterns established by the existing Theta, Tuple, and KLL sketch implementations.
Functions added:
items_sketch_aggitems_sketch_merge_aggitems_sketch_get_frequent_itemsitems_sketch_get_estimateitems_sketch_mergeitems_sketch_to_stringFiles changed:
New files (3):
sql/catalyst/.../util/ItemsSketchUtils.scala— Utility class for validation, sketch creation, SerDe dispatchsql/catalyst/.../expressions/aggregate/itemsSketchAggregates.scala—ItemsSketchAggandItemsSketchMergeAggaggregate expressions with ExpressionBuilder companionssql/catalyst/.../expressions/itemsSketchExpressions.scala—ItemsSketchSerDeHelper(wire format),ItemsSketchGetFrequentItems,ItemsSketchGetEstimate,ItemsSketchMerge,ItemsSketchToStringscalar expressionsModified files:
sql/catalyst/.../analysis/FunctionRegistry.scala— Register 6 functions (2 viaexpressionBuilder, 4 viaexpression)sql/api/.../sql/functions.scala— Public Scala/Java API methods (since 4.2.0)common/utils/.../error/error-conditions.json— 6 new error conditionssql/catalyst/.../errors/QueryExecutionErrors.scala— 6 error factory methodsdocs/sql-ref-sketch-aggregates.md— Full documentation for all 6 functionspython/pyspark/sql/functions/builtin.py— PySpark function wrapperspython/pyspark/sql/connect/functions/builtin.py— Spark Connect PySpark wrapperspython/pyspark/sql/functions/__init__.py— PySpark exportspython/docs/source/reference/pyspark.sql/functions.rst— PySpark docs entriesTest files:
sql/core/.../DataFrameAggregateSuite.scala— 9 test casespython/pyspark/sql/tests/test_functions.py— 3 test methodsAuto-generated:
sql/core/.../sql-functions/sql-expression-schema.md— RegeneratedWhy are the changes needed?
Spark already provides built-in support for HLL, Theta, Tuple, and KLL sketches. The ItemsSketch fills a gap by providing frequency estimation with:
NO_FALSE_POSITIVES(every returned item is truly frequent) andNO_FALSE_NEGATIVES(all truly frequent items are returned).Does this PR introduce any user-facing change?
Yes. 6 new SQL functions are available:
How was this patch tested?
9 Scala test cases in
DataFrameAggregateSuite:items_sketch_agg+items_sketch_get_frequent_itemsitems_sketch_get_estimatefor existing and non-existent itemsitems_sketch_merge(scalar merge of two sketches)items_sketch_merge_agg(aggregate merge)maxMapSizeparameter3 Python test methods covering all functions, small capacity, and null handling.
Golden file regenerated via
SPARK_GENERATE_GOLDEN_FILES=1.Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: claude-sonnet-4