Skip to content

[SPARK-55726][PYTHON][TEST][FOLLOW-UP] Write grouped benchmark data as one Arrow IPC stream per DataFrame#55527

Closed
Yicong-Huang wants to merge 3 commits into
apache:masterfrom
Yicong-Huang:SPARK-55726-followup
Closed

[SPARK-55726][PYTHON][TEST][FOLLOW-UP] Write grouped benchmark data as one Arrow IPC stream per DataFrame#55527
Yicong-Huang wants to merge 3 commits into
apache:masterfrom
Yicong-Huang:SPARK-55726-followup

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 24, 2026

What changes were proposed in this pull request?

Fix MockProtocolWriter.write_grouped_data_payload in python/benchmarks/bench_eval_type.py to write each DataFrame as one Arrow IPC stream (multiple batches per stream), matching the real worker wire protocol. num_dfs is now inferred from the group tuple length. Grouped/cogrouped data factories return nested-list shapes accordingly.

Why are the changes needed?

The old writer emitted one stream per RecordBatch, while declaring num_dfs=1 upfront. When a group spanned more than one batch, the worker read the next stream's bytes as num_dfs for the next group. The lg_grp_few_col / lg_grp_many_col scenarios in GroupedMapPandasUDF{Time,Peakmem}Bench (100K rows/group with default MAX_RECORDS_PER_BATCH=10_000) hit this:

pyspark.errors.exceptions.base.PySparkValueError:
  [INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP] Invalid number of dataframes in group 1208025088.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran all 15 _GroupedMapPandasBenchMixin scenario x UDF combinations and one scenario per other grouped/cogrouped bench class locally. All pass. lg_grp_* fail on master before this patch and pass after.

Was this patch authored or co-authored using generative AI tooling?

No.

@zhengruifeng
Copy link
Copy Markdown
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants