[SPARK-56120][PYTHON][TESTS][FOLLOWUP] Make _WindowAggArrowBenchMixin scenarios lazy#56167
Closed
viirya wants to merge 1 commit into
Closed
[SPARK-56120][PYTHON][TESTS][FOLLOWUP] Make _WindowAggArrowBenchMixin scenarios lazy#56167viirya wants to merge 1 commit into
viirya wants to merge 1 commit into
Conversation
… scenarios lazy ### What changes were proposed in this pull request? Convert `_WindowAggArrowBenchMixin` in `python/benchmarks/bench_eval_type.py` to the lazy `_scenario_configs` + `@staticmethod _build_scenario(name)` pattern used by every other mixin in the file, matching the immediately-following `_WindowAggPandasBenchMixin`. ### Why are the changes needed? SPARK-56244 follow-up (commit 1c807ad) removed eager `_scenarios = _build_scenarios()` from all mixins so that importing the benchmark module no longer materializes every scenario's Arrow data -- a prerequisite for accurate per-scenario `peakmem_*` readings under ASV (ASV reports the max RSS observed in the worker process, so any import-time allocation inflates every subsequent peakmem result). SPARK-56120 (`78aaf11728b`, merged the day after the follow-up) reintroduced the eager pattern in `_WindowAggArrowBenchMixin`, leaving it as the only mixin in the file still doing class-body data construction. As a result, `WindowAggArrowUDFPeakmemBench` readings are dominated by the global import-time allocation rather than the per-scenario footprint. Measured locally with `tracemalloc`: - before: import peak = 394.54 MiB - after: import peak = 29.17 MiB ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Imported `python.benchmarks.bench_eval_type` and asserted the lazy structure is in place (`_scenario_configs` present, `_scenarios` absent, `_build_scenario` is a staticmethod). - Ran `WindowAggArrowUDFTimeBench.setup` + `time_worker` for `(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf)`. - Ran `WindowAggArrowUDFPeakmemBench.setup` + `peakmem_worker` for `many_groups_sm/sum_udf`. - Compared import-time peak memory before/after (numbers above). ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7)
viirya
added a commit
that referenced
this pull request
May 28, 2026
…scenarios lazy ### What changes were proposed in this pull request? Convert `_WindowAggArrowBenchMixin` in `python/benchmarks/bench_eval_type.py` to the lazy `_scenario_configs` + `staticmethod _build_scenario(name)` pattern used by every other mixin in the file, matching the immediately-following `_WindowAggPandasBenchMixin`. ### Why are the changes needed? SPARK-56244 follow-up (commit 1c807ad) removed eager `_scenarios = _build_scenarios()` from all mixins so that importing the benchmark module no longer materializes every scenario's Arrow data -- a prerequisite for accurate per-scenario `peakmem_*` readings under ASV (ASV reports the max RSS observed in the worker process, so any import-time allocation inflates every subsequent peakmem result). SPARK-56120 (`78aaf11728b`, merged the day after the follow-up) reintroduced the eager pattern in `_WindowAggArrowBenchMixin`, leaving it as the only mixin in the file still doing class-body data construction. As a result, `WindowAggArrowUDFPeakmemBench` readings are dominated by the global import-time allocation rather than the per-scenario footprint. Measured locally with `tracemalloc`: - before: import peak = 394.54 MiB - after: import peak = 29.17 MiB ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Imported `python.benchmarks.bench_eval_type` and asserted the lazy structure is in place (`_scenario_configs` present, `_scenarios` absent, `_build_scenario` is a staticmethod). - Ran `WindowAggArrowUDFTimeBench.setup` + `time_worker` for `(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf)`. - Ran `WindowAggArrowUDFPeakmemBench.setup` + `peakmem_worker` for `many_groups_sm/sum_udf`. - Compared import-time peak memory before/after (numbers above). ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56167 from viirya/SPARK-56120-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 7da33f3) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…scenarios lazy ### What changes were proposed in this pull request? Convert `_WindowAggArrowBenchMixin` in `python/benchmarks/bench_eval_type.py` to the lazy `_scenario_configs` + `staticmethod _build_scenario(name)` pattern used by every other mixin in the file, matching the immediately-following `_WindowAggPandasBenchMixin`. ### Why are the changes needed? SPARK-56244 follow-up (commit 1c807ad) removed eager `_scenarios = _build_scenarios()` from all mixins so that importing the benchmark module no longer materializes every scenario's Arrow data -- a prerequisite for accurate per-scenario `peakmem_*` readings under ASV (ASV reports the max RSS observed in the worker process, so any import-time allocation inflates every subsequent peakmem result). SPARK-56120 (`78aaf11728b`, merged the day after the follow-up) reintroduced the eager pattern in `_WindowAggArrowBenchMixin`, leaving it as the only mixin in the file still doing class-body data construction. As a result, `WindowAggArrowUDFPeakmemBench` readings are dominated by the global import-time allocation rather than the per-scenario footprint. Measured locally with `tracemalloc`: - before: import peak = 394.54 MiB - after: import peak = 29.17 MiB ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Imported `python.benchmarks.bench_eval_type` and asserted the lazy structure is in place (`_scenario_configs` present, `_scenarios` absent, `_build_scenario` is a staticmethod). - Ran `WindowAggArrowUDFTimeBench.setup` + `time_worker` for `(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf)`. - Ran `WindowAggArrowUDFPeakmemBench.setup` + `peakmem_worker` for `many_groups_sm/sum_udf`. - Compared import-time peak memory before/after (numbers above). ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56167 from viirya/SPARK-56120-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 7da33f3) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Member
Author
|
Merged to master/branch-4.x/branch-4.2. Thanks @dongjoon-hyun |
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 40e9a14) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 28, 2026
…ings ### What changes were proposed in this pull request? Reduce duplication between Arrow and Pandas sibling mixins in `python/benchmarks/bench_eval_type.py` by making the Pandas variant subclass the Arrow variant, mirroring the existing iter-subclasses-noniter pattern. Applies to two pairs: - `_ScalarPandasBenchMixin` now subclasses `_ScalarArrowBenchMixin` - `_GroupedAggPandasBenchMixin` now subclasses `_GroupedAggArrowBenchMixin` The shared `_build_scenario` and `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute (the Scalar pair already used this; this PR extends the same pattern to GroupedAgg). `_build_scenario` is converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs`. As a follow-on benefit, `_GroupedAggArrowIterBenchMixin` (already a subclass of the Arrow base) drops its now-redundant copy of `_write_scenario`. Net diff: +27 / -102 lines. ### Why are the changes needed? Before: each sibling pair had two near-identical `_write_scenario` bodies, differing only in the hard-coded `PythonEvalType.SQL_...` constant and the UDF set. Any change to the protocol-writing logic (runner_conf, eval_conf, payload framing) had to be applied in lock-step across both sibling halves -- a known footgun. The pattern already used by the iter subclasses (override `_eval_type` + `_udfs`, inherit everything else) generalizes cleanly to the Arrow/Pandas axis. The Window and CogroupedMap pairs are intentionally left out of this PR to avoid conflicting with two in-flight PRs (#56167 makes the Window Arrow mixin lazy; #56171 renames `wide_values` to `wide_cols` in the Cogroup pair). Both pairs can be folded into the same base-class pattern in a follow-up once those land. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all seven affected mixins (Scalar Arrow / Arrow-Iter / Pandas / Pandas-Iter, GroupedAgg Arrow / Arrow-Iter / Pandas). - Confirmed `_ScalarPandasBenchMixin._scenario_configs` still holds the pandas-sized row counts (1M `pure_ints`), not the Arrow base's 5M -- the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all 7 `*TimeBench` classes (including both UDF arities for the GroupedAgg variants). - Ran `peakmem_worker` for one bench class per pair. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes #56173 from viirya/SPARK-55724-sibling-base-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 40e9a14) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya
added a commit
that referenced
this pull request
May 29, 2026
…p Arrow/Pandas siblings ### What changes were proposed in this pull request? Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern to the remaining two pairs in `python/benchmarks/bench_eval_type.py`: - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin` - `_CogroupedMapPandasBenchMixin` now subclasses `_CogroupedMapArrowBenchMixin` The shared `_build_scenario` / `_write_scenario` are pulled up into the Arrow base, with the eval type parameterized via the `_eval_type` class attribute and `_build_scenario` converted from `staticmethod` to `classmethod` so subclasses read their own `_scenario_configs` (the same mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs). - Window: the Pandas half drops `_scenario_configs` entirely (identical to the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`. - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` tuples to match the Pandas sibling, so the inherited `_write_scenario` works unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and the extra 3-arg `key_identity_udf` variant. Net diff: +39 / -102 lines. ### Why are the changes needed? These two pairs were intentionally left out of SPARK-57137 to avoid conflicting with two in-flight PRs (#56167 made the Window Arrow mixin scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup pair). Both have since merged, so the pairs can now be folded into the same base-class pattern. Before this change, each sibling pair carried two near-identical `_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` constant (and, for Cogroup, the UDF set) -- a known footgun where any protocol-writing change had to be applied in lock-step across both halves. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Structural: confirmed `_eval_type` resolves correctly via MRO for all four affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` still holds the scaled-down pandas row counts (not the Arrow base's), the main MRO-resolution risk of switching `_build_scenario` to `classmethod`. - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`). - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas classes. - Confirmed the generated wire bytes are byte-identical to the pre-refactor output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only in the embedded class-hierarchy reference (same length, identical execution), matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-8) Closes #56194 from viirya/SPARK-55724-window-cogroup-base. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 481f8a9) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Convert
_WindowAggArrowBenchMixininpython/benchmarks/bench_eval_type.pyto the lazy
_scenario_configs+@staticmethod _build_scenario(name)patternused by every other mixin in the file, matching the immediately-following
_WindowAggPandasBenchMixin.Why are the changes needed?
SPARK-56244 follow-up (commit 1c807ad) removed eager
_scenarios = _build_scenarios()from all mixins so that importing the benchmark module no longer materializes every scenario's Arrow data -- a prerequisite for accurate per-scenariopeakmem_*readings under ASV (ASV reports the max RSS observed in the worker process, so any import-time allocation inflates every subsequent peakmem result).SPARK-56120 (
78aaf11728b, merged the day after the follow-up) reintroduced the eager pattern in_WindowAggArrowBenchMixin, leaving it as the only mixin in the file still doing class-body data construction. As a result,WindowAggArrowUDFPeakmemBenchreadings are dominated by the global import-time allocation rather than the per-scenario footprint.Measured locally with
tracemalloc:Does this PR introduce any user-facing change?
No. Test-only change in the benchmark module.
How was this patch tested?
python.benchmarks.bench_eval_typeand asserted the lazy structure is in place (_scenario_configspresent,_scenariosabsent,_build_scenariois a staticmethod).WindowAggArrowUDFTimeBench.setup+time_workerfor(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf).WindowAggArrowUDFPeakmemBench.setup+peakmem_workerformany_groups_sm/sum_udf.Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: Claude Code (claude-opus-4-7)