Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds SQL_GROUPED_AGG_PANDAS_ITER_UDF to the list of supported eval types in UDFRegistration.register() method, allowing users to register Pandas Grouped Iter Aggregate UDFs for SQL usage.

Why are the changes needed?

Currently, the iterator API for grouped aggregate Pandas UDFs cannot be registered for SQL usage via spark.udf.register(). This is inconsistent with other UDF types like SQL_GROUPED_AGG_ARROW_ITER_UDF which is already supported.

With this change, users can now register iterator-based grouped aggregate UDFs and use them in SQL queries:

@pandas_udf("double")
def sum_iter_udf(it: Iterator[pd.Series]) -> float:
    total = 0.0
    for series in it:
        total += series.sum()
    return total

spark.udf.register("sum_iter_udf", sum_iter_udf)
spark.sql("SELECT sum_iter_udf(v) FROM table GROUP BY id")

Does this PR introduce any user-facing change?

Yes. Users can now register Pandas Grouped Iter Aggregate UDFs (Iterator[pd.Series] -> scalar) for SQL usage.

How was this patch tested?

Added a new test case test_register_grouped_agg_iter_udf in python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang changed the title [SPARK-54722][PYTHON] Register Pandas Grouped Iter Aggregate UDF for SQL usage [SPARK-54722][PYTHON][SQL] Register Pandas Grouped Iter Aggregate UDF for SQL usage Dec 17, 2025
@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants