Skip to content

[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171

Closed
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-wide-cols-followup
Closed

[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-wide-cols-followup

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 28, 2026

What changes were proposed in this pull request?

Rename the wide_values scenario key to wide_cols in three mixins of python/benchmarks/bench_eval_type.py:

  • _CogroupedMapArrowBenchMixin
  • _CogroupedMapPandasBenchMixin
  • _GroupedMapArrowBenchMixin

Why are the changes needed?

The umbrella SPARK-55724 introduced "wide-column" scenarios under two different keys: wide_values (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and wide_cols (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns.

The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. wide_cols is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling.

Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

  • grep confirmed no remaining wide_values references in the file.
  • Ran setup + time_worker on wide_cols for the three renamed *TimeBench classes (CogroupedMapArrowUDFTimeBench, CogroupedMapPandasUDFTimeBench, GroupedMapArrowUDFTimeBench); all passed.
  • Ran the same on the four pre-existing wide_cols *TimeBench classes (GroupedAggArrowUDFTimeBench, GroupedAggPandasUDFTimeBench, WindowAggArrowUDFTimeBench, WindowAggPandasUDFTimeBench) to confirm no regression.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

… in bench

### What changes were proposed in this pull request?

Rename the `wide_values` scenario key to `wide_cols` in three mixins of
`python/benchmarks/bench_eval_type.py`:

- `_CogroupedMapArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin`
- `_GroupedMapArrowBenchMixin`

### Why are the changes needed?

The umbrella SPARK-55724 introduced "wide-column" scenarios under two
different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381
cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols`
(SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562
grouped agg pandas, SPARK-56658 window agg pandas). All seven describe
the same shape: few rows per group, many columns.

The drift makes the ASV summary harder to read across eval types and
adds friction when wiring shared helpers across siblings. `wide_cols`
is the more descriptive name (the scenario varies the column count,
not the value semantics) and is already the majority spelling.

Since these benchmarks have no nightly CI consumer yet, renaming now
costs nothing in ASV history continuity; deferring the rename only
makes the eventual cleanup costlier.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- `grep` confirmed no remaining `wide_values` references in the file.
- Ran `setup` + `time_worker` on `wide_cols` for the three renamed
  `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`,
  `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all
  passed.
- Ran the same on the four pre-existing `wide_cols` `*TimeBench`
  classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`,
  `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to
  confirm no regression.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@viirya viirya closed this in b76b6a8 May 28, 2026
viirya added a commit that referenced this pull request May 28, 2026
…in bench

### What changes were proposed in this pull request?

Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`:

- `_CogroupedMapArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin`
- `_GroupedMapArrowBenchMixin`

### Why are the changes needed?

The umbrella SPARK-55724 introduced \"wide-column\" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns.

The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling.

Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- `grep` confirmed no remaining `wide_values` references in the file.
- Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed.
- Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

Closes #56171 from viirya/SPARK-55724-wide-cols-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit b76b6a8)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 28, 2026
…in bench

### What changes were proposed in this pull request?

Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`:

- `_CogroupedMapArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin`
- `_GroupedMapArrowBenchMixin`

### Why are the changes needed?

The umbrella SPARK-55724 introduced \"wide-column\" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns.

The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling.

Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- `grep` confirmed no remaining `wide_values` references in the file.
- Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed.
- Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

Closes #56171 from viirya/SPARK-55724-wide-cols-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit b76b6a8)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 28, 2026

Merged to master/branch-4.x/branch-4.2.

Thanks @dongjoon-hyun @Yicong-Huang

@viirya viirya deleted the SPARK-55724-wide-cols-followup branch May 28, 2026 19:57
viirya added a commit that referenced this pull request May 28, 2026
…ings

### What changes were proposed in this pull request?

Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs:

- `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin`
- `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin`

The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`.

As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`.

Net diff: +27 / -102 lines.

### Why are the changes needed?

Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis.

The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas).
- Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants).
- Ran `peakmem_worker` for one bench class per pair.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

Closes #56173 from viirya/SPARK-55724-sibling-base-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 28, 2026
…ings

### What changes were proposed in this pull request?

Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs:

- `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin`
- `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin`

The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`.

As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`.

Net diff: +27 / -102 lines.

### Why are the changes needed?

Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis.

The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas).
- Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants).
- Ran `peakmem_worker` for one bench class per pair.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

Closes #56173 from viirya/SPARK-55724-sibling-base-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 40e9a14)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 28, 2026
…ings

### What changes were proposed in this pull request?

Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs:

- `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin`
- `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin`

The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`.

As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`.

Net diff: +27 / -102 lines.

### Why are the changes needed?

Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis.

The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas).
- Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants).
- Ran `peakmem_worker` for one bench class per pair.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-7)

Closes #56173 from viirya/SPARK-55724-sibling-base-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 40e9a14)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 29, 2026
…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Closes #56194 from viirya/SPARK-55724-window-cogroup-base.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 29, 2026
…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Closes #56194 from viirya/SPARK-55724-window-cogroup-base.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 481f8a9)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 29, 2026
…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Closes #56194 from viirya/SPARK-55724-window-cogroup-base.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 481f8a9)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants