[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf#55834
Closed
Yicong-Huang wants to merge 1 commit into
Closed
Conversation
zhengruifeng
approved these changes
May 13, 2026
zhengruifeng
pushed a commit
that referenced
this pull request
May 13, 2026
…e via EvalConf ### What changes were proposed in this pull request? `_ArrowBatchedBenchMixin._write_scenario` in `python/benchmarks/bench_eval_type.py` wrote the `input_type` schema JSON as a length-prefixed UTF-8 string before the UDF payload. This was the old wire-protocol shape. Since [SPARK-56340](https://issues.apache.org/jira/browse/SPARK-56340) (move input_type schema to eval conf), the worker reads `input_type` via `EvalConf` instead, so the extra prefix gets parsed as the UDF count and the worker exits with `UnicodeDecodeError` while reading subsequent UTF-8 fields. This PR moves the schema to `eval_conf={"input_type": schema.json()}`, matching the pattern already used by the `_ArrowTableUDFBenchMixin`. ### Why are the changes needed? Running any `ArrowBatchedUDFTimeBench` / `ArrowBatchedUDFPeakmemBench` ASV benchmark currently fails with: ``` File "pyspark/worker.py", line 3581, in main init_info = WorkerInitInfo.from_stream(infile) ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 353: invalid start byte ``` The bench file is the only `SQL_ARROW_BATCHED_UDF` mock writer in the tree and was missed when the worker protocol changed. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Running both bench classes locally now succeeds. Numbers from one run: ```text === bench_eval_type.ArrowBatchedUDFTimeBench.time_worker === scenario identity_udf stringify_udf nullcheck_udf sm_batch_few_col 44.3+/-0.3ms 46.9+/-0.3ms 45.0+/-0.4ms sm_batch_many_col 112+/-0.7ms 113+/-1ms 112+/-0.5ms lg_batch_few_col 106+/-0.7ms 113+/-2ms 106+/-0.4ms lg_batch_many_col 448+/-1ms 449+/-0.3ms 447+/-3ms pure_ints 157+/-1ms 162+/-1ms 156+/-2ms pure_floats 148+/-0.2ms 170+/-1ms 149+/-2ms pure_strings 302+/-0.5ms 305+/-3ms 295+/-0.7ms mixed_types 226+/-0.9ms 230+/-1ms 222+/-0.9ms === bench_eval_type.ArrowBatchedUDFPeakmemBench.peakmem_worker === scenario identity_udf stringify_udf nullcheck_udf sm_batch_few_col 464M 464M 464M sm_batch_many_col 469M 469M 469M lg_batch_few_col 469M 470M 469M lg_batch_many_col 509M 510M 509M pure_ints 469M 470M 469M pure_floats 469M 470M 469M pure_strings 473M 473M 473M mixed_types 471M 471M 470M ``` Run commands: ```bash COLUMNS=120 asv run --bench ArrowBatchedUDFTimeBench -a repeat=3 --python=same COLUMNS=120 asv run --bench ArrowBatchedUDFPeakmemBench -a repeat=3 --python=same ``` Smoke-tested all 40 benchmark classes in the file (every other class still passes; only the two ArrowBatched classes were broken). ### Was this patch authored or co-authored using generative AI tooling? No. Closes #55834 from Yicong-Huang/SPARK-56837/fix/bench-arrow-batched-input-type. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit f40eccd) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
|
merged to master/4.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
_ArrowBatchedBenchMixin._write_scenarioinpython/benchmarks/bench_eval_type.pywrote theinput_typeschema JSON as a length-prefixed UTF-8 string before the UDF payload. This was the old wire-protocol shape. Since SPARK-56340 (move input_type schema to eval conf), the worker readsinput_typeviaEvalConfinstead, so the extra prefix gets parsed as the UDF count and the worker exits withUnicodeDecodeErrorwhile reading subsequent UTF-8 fields.This PR moves the schema to
eval_conf={"input_type": schema.json()}, matching the pattern already used by the_ArrowTableUDFBenchMixin.Why are the changes needed?
Running any
ArrowBatchedUDFTimeBench/ArrowBatchedUDFPeakmemBenchASV benchmark currently fails with:The bench file is the only
SQL_ARROW_BATCHED_UDFmock writer in the tree and was missed when the worker protocol changed.Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
Running both bench classes locally now succeeds. Numbers from one run:
Run commands:
Smoke-tested all 40 benchmark classes in the file (every other class still passes; only the two ArrowBatched classes were broken).
Was this patch authored or co-authored using generative AI tooling?
No.