Skip to content

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675

Closed
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56691
Closed

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56691

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (GroupedMapPandasIterUDFTimeBench, single run with -a repeat=5):

master: 4b3f8c3796e  vs  PR: 29538fd7980

Time (ms, lower = better)
scenario           udf                   master       PR       diff
sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
mixed_types        identity_udf            446.2    431.3    -3.34%
mixed_types        sort_udf                471.2    450.0    -4.50%
mixed_types        key_identity_udf        399.8    383.4    -4.10%
SUM                                       9137.8   8948.1    -2.08%

Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise.

Peakmem benchmark (GroupedMapPandasIterUDFPeakmemBench) was essentially flat (SUM -0.02%).

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Copy Markdown
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

zhengruifeng pushed a commit that referenced this pull request May 6, 2026
### What changes were proposed in this pull request?

Refactor `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to be self-contained in `read_udfs()`.

### Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in `read_udfs()` improves readability and makes it easier to reason about the data flow for each eval type independently.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (`GroupedMapPandasIterUDFTimeBench`, single run with `-a repeat=5`):

```text
master: 4b3f8c3  vs  PR: 29538fd

Time (ms, lower = better)
scenario           udf                   master       PR       diff
sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
mixed_types        identity_udf            446.2    431.3    -3.34%
mixed_types        sort_udf                471.2    450.0    -4.50%
mixed_types        key_identity_udf        399.8    383.4    -4.10%
SUM                                       9137.8   8948.1    -2.08%
```

Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise.

Peakmem benchmark (`GroupedMapPandasIterUDFPeakmemBench`) was essentially flat (SUM -0.02%).

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #55675 from Yicong-Huang/SPARK-56691.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 5126054)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor

merged to master and 4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants