[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704
Closed
Yicong-Huang wants to merge 1 commit into
Closed
[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704Yicong-Huang wants to merge 1 commit into
Yicong-Huang wants to merge 1 commit into
Conversation
### What changes were proposed in this pull request? This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`. The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants: - `identity_udtf` - yields the input batch as a `pa.Table` - `filter_udtf` - keeps rows whose first column is non-null (vectorized) - `count_udtf` - aggregates each batch into a single-row count table To support this, two helpers are added to `MockProtocolWriter`: - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf) - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New ASV microbenchmarks. Two stable runs on the same machine produced consistent numbers (one run shown): ```text === bench_eval_type.ArrowUDTFTimeBench.time_worker === scenario udtf value sm_batch_few_col identity_udtf 1.23 ms sm_batch_few_col filter_udtf 1.62 ms sm_batch_few_col count_udtf 0.97 ms sm_batch_many_col identity_udtf 2.23 ms sm_batch_many_col filter_udtf 3.13 ms sm_batch_many_col count_udtf 1.01 ms lg_batch_few_col identity_udtf 1.66 ms lg_batch_few_col filter_udtf 3.57 ms lg_batch_few_col count_udtf 1.14 ms lg_batch_many_col identity_udtf 9.75 ms lg_batch_many_col filter_udtf 15.67 ms lg_batch_many_col count_udtf 3.06 ms pure_ints identity_udtf 5.73 ms pure_ints filter_udtf 7.17 ms pure_ints count_udtf 2.02 ms pure_strings identity_udtf 5.06 ms pure_strings filter_udtf 9.55 ms pure_strings count_udtf 2.19 ms === bench_eval_type.ArrowUDTFPeakmemBench.peakmem_worker === scenario udtf value sm_batch_few_col identity_udtf 442.2 MB sm_batch_few_col filter_udtf 443.8 MB sm_batch_few_col count_udtf 442.2 MB sm_batch_many_col identity_udtf 443.2 MB sm_batch_many_col filter_udtf 445.6 MB sm_batch_many_col count_udtf 442.6 MB lg_batch_few_col identity_udtf 446.0 MB lg_batch_few_col filter_udtf 447.7 MB lg_batch_few_col count_udtf 443.5 MB lg_batch_many_col identity_udtf 458.7 MB lg_batch_many_col filter_udtf 460.8 MB lg_batch_many_col count_udtf 452.1 MB pure_ints identity_udtf 447.5 MB pure_ints filter_udtf 449.0 MB pure_ints count_udtf 444.8 MB pure_strings identity_udtf 451.1 MB pure_strings filter_udtf 453.6 MB pure_strings count_udtf 445.2 MB ``` Run command: ```bash COLUMNS=120 asv run --bench ArrowUDTF -a repeat=3 --python=same ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#55691 from Yicong-Huang/SPARK-56716/bench/arrow-udtf. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 257c99d)
zhengruifeng
pushed a commit
that referenced
this pull request
May 6, 2026
…W_UDTF ### What changes were proposed in this pull request? Backport of #55691 to `branch-4.x`. This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`. The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants: - `identity_udtf` - yields the input batch as a `pa.Table` - `filter_udtf` - keeps rows whose first column is non-null (vectorized) - `count_udtf` - aggregates each batch into a single-row count table To support this, two helpers are added to `MockProtocolWriter`: - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf) - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #55704 from Yicong-Huang/SPARK-56716-4.x. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
|
thanks, merged to 4.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport of #55691 to
branch-4.x.This PR adds ASV microbenchmarks for
SQL_ARROW_UDTF(PyArrow-native Python UDTFs created viaarrow_udtf) topython/benchmarks/bench_eval_type.py.The new
_ArrowUDTFBenchMixinproducesArrowUDTFTimeBenchandArrowUDTFPeakmemBench, parametrized by scenario (batch size, column count, type pool) and three handler variants:identity_udtf- yields the input batch as apa.Tablefilter_udtf- keeps rows whose first column is non-null (vectorized)count_udtf- aggregates each batch into a single-row count tableTo support this, two helpers are added to
MockProtocolWriter:write_worker_inputgains an optionaleval_confparameter (UDTF needstable_arg_offsetsin EvalConf)write_arrow_udtf_payloadmirrorsPythonUDTFRunner.writeUDTFon the JVM side: argument offsets, partition-child indexes, optional pickledAnalyzeResult, the cloudpickled UDTF class, the result schema, and the UDTF nameThe wire batch carries one struct column
_0whose fields are the table's schema;table_arg_offsets=[0]tellsArrowStreamArrowUDTFSerializerto flatten that struct into apa.RecordBatchfor the UDTF'seval(batch)method.Why are the changes needed?
This is part of SPARK-55724 (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for
SQL_ARROW_UDTFis a prerequisite for the upcoming serializer refactor so we can detect any regression objectively.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR.
Was this patch authored or co-authored using generative AI tooling?
No.