[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171
Closed
viirya wants to merge 1 commit into
Closed
[SPARK-55724][PYTHON][TESTS][FOLLOWUP] Unify wide_values to wide_cols in bench#56171viirya wants to merge 1 commit into
viirya wants to merge 1 commit into
Conversation
… in bench ### What changes were proposed in this pull request? Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`: - `_CogroupedMapArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` - `_GroupedMapArrowBenchMixin` ### Why are the changes needed? The umbrella SPARK-55724 introduced "wide-column" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns. The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling. Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - `grep` confirmed no remaining `wide_values` references in the file. - Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed. - Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7)
viirya
added a commit
that referenced
this pull request
May 28, 2026
…in bench ### What changes were proposed in this pull request? Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`: - `_CogroupedMapArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` - `_GroupedMapArrowBenchMixin` ### Why are the changes needed? The umbrella SPARK-55724 introduced \"wide-column\" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns. The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling. Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - `grep` confirmed no remaining `wide_values` references in the file. - Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed. - Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56171 from viirya/SPARK-55724-wide-cols-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit b76b6a8) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…in bench ### What changes were proposed in this pull request? Rename the `wide_values` scenario key to `wide_cols` in three mixins of `python/benchmarks/bench_eval_type.py`: - `_CogroupedMapArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` - `_GroupedMapArrowBenchMixin` ### Why are the changes needed? The umbrella SPARK-55724 introduced \"wide-column\" scenarios under two different keys: `wide_values` (SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) and `wide_cols` (SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns. The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings. `wide_cols` is the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling. Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - `grep` confirmed no remaining `wide_values` references in the file. - Ran `setup` + `time_worker` on `wide_cols` for the three renamed `*TimeBench` classes (`CogroupedMapArrowUDFTimeBench`, `CogroupedMapPandasUDFTimeBench`, `GroupedMapArrowUDFTimeBench`); all passed. - Ran the same on the four pre-existing `wide_cols` `*TimeBench` classes (`GroupedAggArrowUDFTimeBench`, `GroupedAggPandasUDFTimeBench`, `WindowAggArrowUDFTimeBench`, `WindowAggPandasUDFTimeBench`) to confirm no regression. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56171 from viirya/SPARK-55724-wide-cols-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit b76b6a8) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Member
Author
|
Merged to master/branch-4.x/branch-4.2. Thanks @dongjoon-hyun @Yicong-Huang |
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 40e9a14) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 40e9a14) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Rename the
wide_valuesscenario key towide_colsin three mixins ofpython/benchmarks/bench_eval_type.py:_CogroupedMapArrowBenchMixin_CogroupedMapPandasBenchMixin_GroupedMapArrowBenchMixinWhy are the changes needed?
The umbrella SPARK-55724 introduced "wide-column" scenarios under two different keys:
wide_values(SPARK-55947 grouped map Arrow, SPARK-56381 cogrouped map Arrow, SPARK-56629 cogrouped map pandas) andwide_cols(SPARK-56085 grouped agg, SPARK-56120 window agg Arrow, SPARK-56562 grouped agg pandas, SPARK-56658 window agg pandas). All seven describe the same shape: few rows per group, many columns.The drift makes the ASV summary harder to read across eval types and adds friction when wiring shared helpers across siblings.
wide_colsis the more descriptive name (the scenario varies the column count, not the value semantics) and is already the majority spelling.Since these benchmarks have no nightly CI consumer yet, renaming now costs nothing in ASV history continuity; deferring the rename only makes the eventual cleanup costlier.
Does this PR introduce any user-facing change?
No. Test-only change in the benchmark module.
How was this patch tested?
grepconfirmed no remainingwide_valuesreferences in the file.setup+time_workeronwide_colsfor the three renamed*TimeBenchclasses (CogroupedMapArrowUDFTimeBench,CogroupedMapPandasUDFTimeBench,GroupedMapArrowUDFTimeBench); all passed.wide_cols*TimeBenchclasses (GroupedAggArrowUDFTimeBench,GroupedAggPandasUDFTimeBench,WindowAggArrowUDFTimeBench,WindowAggPandasUDFTimeBench) to confirm no regression.Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: Claude Code (claude-opus-4-7)