[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings by viirya · Pull Request #56194 · apache/spark

viirya · 2026-05-29T00:20:15Z

What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in python/benchmarks/bench_eval_type.py:

_WindowAggPandasBenchMixin now subclasses _WindowAggArrowBenchMixin
_CogroupedMapPandasBenchMixin now subclasses _CogroupedMapArrowBenchMixin

The shared _build_scenario / _write_scenario are pulled up into the Arrow base, with the eval type parameterized via the _eval_type class attribute and _build_scenario converted from @staticmethod to @classmethod so subclasses read their own _scenario_configs (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

Window: the Pandas half drops _scenario_configs entirely (identical to the Arrow variant) and keeps only _eval_type, its UDFs, and params.
Cogroup: the Arrow _udfs values are normalized to (func, n_args) tuples to match the Pandas sibling, so the inherited _write_scenario works unchanged. The Pandas half keeps its own scaled-down _scenario_configs and the extra 3-arg key_identity_udf variant.

Net diff: +39 / -102 lines.

Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed wide_values to wide_cols in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical _write_scenario bodies differing only in the PythonEvalType.SQL_... constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

Structural: confirmed _eval_type resolves correctly via MRO for all four affected mixins; confirmed _CogroupedMapPandasBenchMixin._scenario_configs still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching _build_scenario to @classmethod.
Ran setup + time_worker end-to-end for all four affected *TimeBench classes across every UDF (including the Pandas-only 3-arg key_identity_udf).
Ran peakmem_worker (disk-replay path) for the Window and Cogroup Pandas classes.
Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `@staticmethod` to `@classmethod` so subclasses read their own `_scenario_configs` (same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (apache#56167 made the Window Arrow mixin scenarios lazy; apache#56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set), a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `@classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Co-authored-by: Claude Code

dongjoon-hyun

+1, LGTM.

…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2026-05-29T07:35:08Z

Merged to master/branch-4.x/branch-4.2.

Thanks @dongjoon-hyun

dongjoon-hyun approved these changes May 29, 2026

View reviewed changes

viirya closed this in 481f8a9 May 29, 2026

viirya deleted the SPARK-55724-window-cogroup-base branch May 29, 2026 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings#56194

[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings#56194
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-window-cogroup-base

viirya commented May 29, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

viirya commented May 29, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants