Skip to content

[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings#56194

Closed
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-window-cogroup-base
Closed

[SPARK-57138][PYTHON][TEST] Share base mixin across Window and Cogroup Arrow/Pandas siblings#56194
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-55724-window-cogroup-base

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 29, 2026

What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in python/benchmarks/bench_eval_type.py:

  • _WindowAggPandasBenchMixin now subclasses _WindowAggArrowBenchMixin
  • _CogroupedMapPandasBenchMixin now subclasses _CogroupedMapArrowBenchMixin

The shared _build_scenario / _write_scenario are pulled up into the Arrow base, with the eval type parameterized via the _eval_type class attribute and _build_scenario converted from @staticmethod to @classmethod so subclasses read their own _scenario_configs (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

  • Window: the Pandas half drops _scenario_configs entirely (identical to the Arrow variant) and keeps only _eval_type, its UDFs, and params.
  • Cogroup: the Arrow _udfs values are normalized to (func, n_args) tuples to match the Pandas sibling, so the inherited _write_scenario works unchanged. The Pandas half keeps its own scaled-down _scenario_configs and the extra 3-arg key_identity_udf variant.

Net diff: +39 / -102 lines.

Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed wide_values to wide_cols in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical _write_scenario bodies differing only in the PythonEvalType.SQL_... constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

Does this PR introduce any user-facing change?

No. Test-only change in the benchmark module.

How was this patch tested?

  • Structural: confirmed _eval_type resolves correctly via MRO for all four affected mixins; confirmed _CogroupedMapPandasBenchMixin._scenario_configs still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching _build_scenario to @classmethod.
  • Ran setup + time_worker end-to-end for all four affected *TimeBench classes across every UDF (including the Pandas-only 3-arg key_identity_udf).
  • Ran peakmem_worker (disk-replay path) for the Window and Cogroup Pandas classes.
  • Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the
remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base,
with the eval type parameterized via the `_eval_type` class attribute and
`_build_scenario` converted from `@staticmethod` to `@classmethod` so subclasses read
their own `_scenario_configs` (same mechanism SPARK-57137 used for the Scalar and
GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow
  variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match
  the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas
  half keeps its own scaled-down `_scenario_configs` and the extra 3-arg
  `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with
two in-flight PRs (apache#56167 made the Window Arrow mixin scenarios lazy; apache#56171 renamed
`wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs
can now be folded into the same base-class pattern. Before this change, each sibling
pair carried two near-identical `_write_scenario` bodies differing only in the
`PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set), a known footgun where
any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected
  mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the
  scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of
  switching `_build_scenario` to `@classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes
  across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for
  Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded
  class-hierarchy reference (same length, identical execution), matching what SPARK-57137
  produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Co-authored-by: Claude Code
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@viirya viirya closed this in 481f8a9 May 29, 2026
viirya added a commit that referenced this pull request May 29, 2026
…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Closes #56194 from viirya/SPARK-55724-window-cogroup-base.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 481f8a9)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request May 29, 2026
…p Arrow/Pandas siblings

### What changes were proposed in this pull request?

Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:

- `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
- `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin`

The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).

- Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
- Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant.

Net diff: +39 / -102 lines.

### Why are the changes needed?

These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern.

Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves.

### Does this PR introduce _any_ user-facing change?

No. Test-only change in the benchmark module.

### How was this patch tested?

- Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`.
- Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
- Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes.
- Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.

### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude Code (claude-opus-4-8)

Closes #56194 from viirya/SPARK-55724-window-cogroup-base.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 481f8a9)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya deleted the SPARK-55724-window-cogroup-base branch May 29, 2026 07:34
@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 29, 2026

Merged to master/branch-4.x/branch-4.2.

Thanks @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants