[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF by Yicong-Huang · Pull Request #55675 · apache/spark

Yicong-Huang · 2026-05-04T21:09:51Z

What changes were proposed in this pull request?

Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (GroupedMapPandasIterUDFTimeBench, single run with -a repeat=5):

master: 4b3f8c3796e  vs  PR: 29538fd7980

Time (ms, lower = better)
scenario           udf                   master       PR       diff
sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
mixed_types        identity_udf            446.2    431.3    -3.34%
mixed_types        sort_udf                471.2    450.0    -4.50%
mixed_types        key_identity_udf        399.8    383.4    -4.10%
SUM                                       9137.8   8948.1    -2.08%

Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise.

Peakmem benchmark (GroupedMapPandasIterUDFPeakmemBench) was essentially flat (SUM -0.02%).

Was this patch authored or co-authored using generative AI tooling?

No.

devin-petersohn

LGTM

### What changes were proposed in this pull request? Refactor `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to be self-contained in `read_udfs()`. ### Why are the changes needed? Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in `read_udfs()` improves readability and makes it easier to reason about the data flow for each eval type independently. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. No behavior change. ASV benchmark (`GroupedMapPandasIterUDFTimeBench`, single run with `-a repeat=5`): ```text master: 4b3f8c3 vs PR: 29538fd Time (ms, lower = better) scenario udf master PR diff sm_grp_few_col identity_udf 447.4 441.0 -1.43% sm_grp_few_col sort_udf 499.5 498.8 -0.14% sm_grp_few_col key_identity_udf 449.9 411.8 -8.46% sm_grp_many_col identity_udf 358.3 375.5 +4.79% sm_grp_many_col sort_udf 378.5 388.7 +2.70% sm_grp_many_col key_identity_udf 371.3 341.1 -8.14% lg_grp_few_col identity_udf 802.7 791.6 -1.39% lg_grp_few_col sort_udf 993.7 949.8 -4.42% lg_grp_few_col key_identity_udf 682.4 691.2 +1.30% lg_grp_many_col identity_udf 928.7 911.1 -1.89% lg_grp_many_col sort_udf 1010.4 963.1 -4.69% lg_grp_many_col key_identity_udf 897.8 919.7 +2.44% mixed_types identity_udf 446.2 431.3 -3.34% mixed_types sort_udf 471.2 450.0 -4.50% mixed_types key_identity_udf 399.8 383.4 -4.10% SUM 9137.8 8948.1 -2.08% ``` Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise. Peakmem benchmark (`GroupedMapPandasIterUDFPeakmemBench`) was essentially flat (SUM -0.02%). ### Was this patch authored or co-authored using generative AI tooling? No. Closes #55675 from Yicong-Huang/SPARK-56691. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 5126054) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-05-06T01:38:56Z

merged to master and 4.x

refactor: extract grouped map pandas iter UDF logic into read_udfs

29538fd

devin-petersohn approved these changes May 5, 2026

View reviewed changes

zhengruifeng closed this in 5126054 May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56691

Yicong-Huang commented May 4, 2026

Uh oh!

devin-petersohn left a comment

Uh oh!

zhengruifeng commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yicong-Huang commented May 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants