[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF by Yicong-Huang · Pull Request #55704 · apache/spark

Yicong-Huang · 2026-05-06T09:05:06Z

What changes were proposed in this pull request?

Backport of #55691 to branch-4.x.

This PR adds ASV microbenchmarks for SQL_ARROW_UDTF (PyArrow-native Python UDTFs created via arrow_udtf) to python/benchmarks/bench_eval_type.py.

The new _ArrowUDTFBenchMixin produces ArrowUDTFTimeBench and ArrowUDTFPeakmemBench, parametrized by scenario (batch size, column count, type pool) and three handler variants:

identity_udtf - yields the input batch as a pa.Table
filter_udtf - keeps rows whose first column is non-null (vectorized)
count_udtf - aggregates each batch into a single-row count table

To support this, two helpers are added to MockProtocolWriter:

write_worker_input gains an optional eval_conf parameter (UDTF needs table_arg_offsets in EvalConf)
write_arrow_udtf_payload mirrors PythonUDTFRunner.writeUDTF on the JVM side: argument offsets, partition-child indexes, optional pickled AnalyzeResult, the cloudpickled UDTF class, the result schema, and the UDTF name

The wire batch carries one struct column _0 whose fields are the table's schema; table_arg_offsets=[0] tells ArrowStreamArrowUDTFSerializer to flatten that struct into a pa.RecordBatch for the UDTF's eval(batch) method.

Why are the changes needed?

This is part of SPARK-55724 (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for SQL_ARROW_UDTF is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request? This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`. The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants: - `identity_udtf` - yields the input batch as a `pa.Table` - `filter_udtf` - keeps rows whose first column is non-null (vectorized) - `count_udtf` - aggregates each batch into a single-row count table To support this, two helpers are added to `MockProtocolWriter`: - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf) - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New ASV microbenchmarks. Two stable runs on the same machine produced consistent numbers (one run shown): ```text === bench_eval_type.ArrowUDTFTimeBench.time_worker === scenario udtf value sm_batch_few_col identity_udtf 1.23 ms sm_batch_few_col filter_udtf 1.62 ms sm_batch_few_col count_udtf 0.97 ms sm_batch_many_col identity_udtf 2.23 ms sm_batch_many_col filter_udtf 3.13 ms sm_batch_many_col count_udtf 1.01 ms lg_batch_few_col identity_udtf 1.66 ms lg_batch_few_col filter_udtf 3.57 ms lg_batch_few_col count_udtf 1.14 ms lg_batch_many_col identity_udtf 9.75 ms lg_batch_many_col filter_udtf 15.67 ms lg_batch_many_col count_udtf 3.06 ms pure_ints identity_udtf 5.73 ms pure_ints filter_udtf 7.17 ms pure_ints count_udtf 2.02 ms pure_strings identity_udtf 5.06 ms pure_strings filter_udtf 9.55 ms pure_strings count_udtf 2.19 ms === bench_eval_type.ArrowUDTFPeakmemBench.peakmem_worker === scenario udtf value sm_batch_few_col identity_udtf 442.2 MB sm_batch_few_col filter_udtf 443.8 MB sm_batch_few_col count_udtf 442.2 MB sm_batch_many_col identity_udtf 443.2 MB sm_batch_many_col filter_udtf 445.6 MB sm_batch_many_col count_udtf 442.6 MB lg_batch_few_col identity_udtf 446.0 MB lg_batch_few_col filter_udtf 447.7 MB lg_batch_few_col count_udtf 443.5 MB lg_batch_many_col identity_udtf 458.7 MB lg_batch_many_col filter_udtf 460.8 MB lg_batch_many_col count_udtf 452.1 MB pure_ints identity_udtf 447.5 MB pure_ints filter_udtf 449.0 MB pure_ints count_udtf 444.8 MB pure_strings identity_udtf 451.1 MB pure_strings filter_udtf 453.6 MB pure_strings count_udtf 445.2 MB ``` Run command: ```bash COLUMNS=120 asv run --bench ArrowUDTF -a repeat=3 --python=same ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#55691 from Yicong-Huang/SPARK-56716/bench/arrow-udtf. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 257c99d)

…W_UDTF ### What changes were proposed in this pull request? Backport of #55691 to `branch-4.x`. This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`. The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants: - `identity_udtf` - yields the input batch as a `pa.Table` - `filter_udtf` - keeps rows whose first column is non-null (vectorized) - `count_udtf` - aggregates each batch into a single-row count table To support this, two helpers are added to `MockProtocolWriter`: - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf) - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #55704 from Yicong-Huang/SPARK-56716-4.x. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-05-06T10:03:52Z

thanks, merged to 4.x

Yicong-Huang mentioned this pull request May 6, 2026

[SPARK-56716][PYTHON][TESTS] Add ASV microbenchmark for SQL_ARROW_UDTF #55691

Closed

zhengruifeng closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704

[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704
Yicong-Huang wants to merge 1 commit into
apache:branch-4.xfrom
Yicong-Huang:SPARK-56716-4.x

Yicong-Huang commented May 6, 2026

Uh oh!

zhengruifeng commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented May 6, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants