-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55279][SQL] Add sketch_funcs group for DataSketches SQL functions
#54061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JIRA Issue Information=== Improvement SPARK-55279 === This comment was automatically generated by GitHub Actions |
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dtenedor
5a52f4b to
3349dc5
Compare
sketch_funcs group for DataSketches SQL functions
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @yaooqinn and @allisonwang-db .
cc @peter-toth
3349dc5 to
4c8b0c9
Compare
4c8b0c9 to
8cc53f4
Compare
All DataSketches-related expression functions should have their own 'sketch_funcs' group instead of being grouped under 'misc_funcs'. This improves consistency with how other specialized function categories are organized and makes the documentation clearer for users. Move all sketch-related expression functions from misc_funcs to sketch_funcs: - HLL sketch functions: hll_sketch_estimate, hll_union - Theta sketch functions: theta_sketch_estimate, theta_union, theta_difference, theta_intersection - KLL sketch functions: kll_sketch_to_string_*, kll_sketch_get_n_*, kll_sketch_get_rank_*, kll_sketch_get_quantile_*, kll_sketch_get_pmf_*, kll_sketch_get_cdf_*, kll_sketch_merge_* - Tuple sketch functions: tuple_sketch_* expression functions - ApproxTopK: approx_top_k_estimate Add sketch_funcs to the groups set in gen-sql-functions-docs.py. Note: Aggregate functions (like hll_sketch_agg, theta_sketch_agg, kll_sketch_agg_*, etc.) remain in 'agg_funcs'. This PR moves 34 DataSketches-related expression functions from misc_funcs to a dedicated sketch_funcs group. These 34 functions represent over 60% of all misc_funcs, making misc_funcs a catch-all bucket that reduces documentation clarity. By creating sketch_funcs, we achieve consistency with other specialized function groups (avro_funcs, json_funcs, csv_funcs, xml_funcs, etc.) and make it easier for users to discover and understand DataSketches functionality in Spark SQL. No functional changes. The only difference is in how functions are grouped in documentation. Existing tests. Yes, GitHub Copilot was used to assist with this change.
8cc53f4 to
bd8912a
Compare
|
Merged to master, thank you @allisonwang-db @dongjoon-hyun @peter-toth BTW, the test failure is irrelevant and I'm trying to fix it in #54072 |
What changes were proposed in this pull request?
All DataSketches-related expression functions should have their own
sketch_funcsgroup instead of being grouped undermisc_funcs.Move all sketch-related expression functions from
misc_funcstosketch_funcs:hll_sketch_estimate,hll_uniontheta_sketch_estimate,theta_union,theta_difference,theta_intersectionkll_sketch_to_string_*,kll_sketch_get_n_*,kll_sketch_get_rank_*,kll_sketch_get_quantile_*,kll_sketch_get_pmf_*,kll_sketch_get_cdf_*,kll_sketch_merge_*tuple_sketch_*expression functionsapprox_top_k_estimateAdd
sketch_funcsto the groups set ingen-sql-functions-docs.py.Note: Aggregate functions (like
hll_sketch_agg,theta_sketch_agg,kll_sketch_agg_*, etc.) remain inagg_funcs.Why are the changes needed?
This PR moves 34 DataSketches-related expression functions from
misc_funcsto a dedicatedsketch_funcsgroup. These 34 functions represent over 60% of allmisc_funcs, makingmisc_funcsa catch-all bucket that reduces documentation clarity. By creatingsketch_funcs, we achieve consistency with other specialized function groups (avro_funcs,json_funcs,csv_funcs,xml_funcs, etc.) and make it easier for users to discover and understand DataSketches functionality in Spark SQL.Does this PR introduce any user-facing change?
No functional changes. The only difference is in how functions are grouped in documentation.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
Yes, GitHub Copilot was used to assist with this change.