[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench by viirya · Pull Request #56171 · apache/spark

viirya · 2026-05-28T03:08:18Z

What changes were proposed in this pull request?

Rename the wide_values scenario key to wide_cols in three mixins of python/benchmarks/bench_eval_type.py:

_CogroupedMapArrowBenchMixin
_CogroupedMapPandasBenchMixin
_GroupedMapArrowBenchMixin

Why are the changes needed?

The umbrella SPARK-55724 introduced "wide-column" scenarios under two different keys: wide_values (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and wide_cols (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns.

The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. wide_cols is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling.

Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

grep confirmed no remaining wide_values references in the file.
Ran setup + time_worker on wide_cols for the three renamed *TimeBench classes (CogroupedMapArrowUDFTimeBench, CogroupedMapPandasUDFTimeBench, GroupedMapArrowUDFTimeBench); all passed.
Ran the same on the four pre-existing wide_cols *TimeBench classes (GroupedAggArrowUDFTimeBench, GroupedAggPandasUDFTimeBench, WindowAggArrowUDFTimeBench, WindowAggPandasUDFTimeBench) to confirm no regression.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

… in bench ### What changes were proposed in this pull request? Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`: - `_CogroupedMapArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` - `_GroupedMapArrowBenchMixin` ### Why are the changes needed? The umbrella SPARK-55724 introduced "wide-column" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns. The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling. Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - `grep` confirmed no remaining `wide_values` references in the file. - Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed. - Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7)

dongjoon-hyun

+1, LGTM.

Yicong-Huang

LGTM!

…in bench ### What changes were proposed in this pull request? Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`: - `_CogroupedMapArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` - `_GroupedMapArrowBenchMixin` ### Why are the changes needed? The umbrella SPARK-55724 introduced \"wide-column\" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns. The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling. Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - `grep` confirmed no remaining `wide_values` references in the file. - Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed. - Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56171 from viirya/SPARK-55724-wide-cols-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit b76b6a8) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2026-05-28T19:57:00Z

Merged to master/branch-4.x/branch-4.2.

Thanks @dongjoon-hyun @Yicong-Huang

…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 40e9a14) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya mentioned this pull request May 28, 2026

[SPARK-57137][PYTHON][TESTS] Share base mixin across Arrow/Pandas siblings #56173

Closed

dongjoon-hyun approved these changes May 28, 2026

View reviewed changes

Yicong-Huang approved these changes May 28, 2026

View reviewed changes

viirya closed this in b76b6a8 May 28, 2026

viirya deleted the SPARK-55724-wide-cols-followup branch May 28, 2026 19:57

viirya mentioned this pull request May 29, 2026

[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings #56194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171

[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-wide-cols-followup

viirya commented May 28, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Yicong-Huang left a comment

Uh oh!

viirya commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

viirya commented May 28, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants