[SPARK-55726][PYTHON][TEST][FOLLOW-UP] Write grouped benchmark data as one Arrow IPC stream per DataFrame by Yicong-Huang · Pull Request #55527 · apache/spark

Yicong-Huang · 2026-04-24T04:50:32Z

What changes were proposed in this pull request?

Fix MockProtocolWriter.write_grouped_data_payload in python/benchmarks/bench_eval_type.py to write each DataFrame as one Arrow IPC stream (multiple batches per stream), matching the real worker wire protocol. num_dfs is now inferred from the group tuple length. Grouped/cogrouped data factories return nested-list shapes accordingly.

Why are the changes needed?

The old writer emitted one stream per RecordBatch, while declaring num_dfs=1 upfront. When a group spanned more than one batch, the worker read the next stream's bytes as num_dfs for the next group. The lg_grp_few_col / lg_grp_many_col scenarios in GroupedMapPandasUDF{Time,Peakmem}Bench (100K rows/group with default MAX_RECORDS_PER_BATCH=10_000) hit this:

pyspark.errors.exceptions.base.PySparkValueError:
  [INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP] Invalid number of dataframes in group 1208025088.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran all 15 _GroupedMapPandasBenchMixin scenario x UDF combinations and one scenario per other grouped/cogrouped bench class locally. All pass. lg_grp_* fail on master before this patch and pass after.

Was this patch authored or co-authored using generative AI tooling?

No.

zhengruifeng · 2026-04-24T06:30:36Z

merged to master

Yicong-Huang added 3 commits April 24, 2026 04:49

fix: write grouped benchmark data as one Arrow stream per DataFrame

516a24a

refactor: tighten grouped benchmark type hints (all list[])

015ca9e

refactor: add GroupedBatches type alias for benchmark wire shape

68ad5a2

zhengruifeng approved these changes Apr 24, 2026

View reviewed changes

zhengruifeng closed this in d0edf1a Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55726][PYTHON][TEST][FOLLOW-UP] Write grouped benchmark data as one Arrow IPC stream per DataFrame#55527

[SPARK-55726][PYTHON][TEST][FOLLOW-UP] Write grouped benchmark data as one Arrow IPC stream per DataFrame#55527
Yicong-Huang wants to merge 3 commits into
apache:masterfrom
Yicong-Huang:SPARK-55726-followup

Yicong-Huang commented Apr 24, 2026 •

edited

Loading

Uh oh!

zhengruifeng commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Apr 24, 2026 •

edited

Loading