Skip to content

[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704

Closed
Yicong-Huang wants to merge 1 commit into
apache:branch-4.xfrom
Yicong-Huang:SPARK-56716-4.x
Closed

[SPARK-56716][PYTHON][TESTS][4.x] Add ASV microbenchmark for SQL_ARROW_UDTF#55704
Yicong-Huang wants to merge 1 commit into
apache:branch-4.xfrom
Yicong-Huang:SPARK-56716-4.x

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Backport of #55691 to branch-4.x.

This PR adds ASV microbenchmarks for SQL_ARROW_UDTF (PyArrow-native Python UDTFs created via arrow_udtf) to python/benchmarks/bench_eval_type.py.

The new _ArrowUDTFBenchMixin produces ArrowUDTFTimeBench and ArrowUDTFPeakmemBench, parametrized by scenario (batch size, column count, type pool) and three handler variants:

  • identity_udtf - yields the input batch as a pa.Table
  • filter_udtf - keeps rows whose first column is non-null (vectorized)
  • count_udtf - aggregates each batch into a single-row count table

To support this, two helpers are added to MockProtocolWriter:

  • write_worker_input gains an optional eval_conf parameter (UDTF needs table_arg_offsets in EvalConf)
  • write_arrow_udtf_payload mirrors PythonUDTFRunner.writeUDTF on the JVM side: argument offsets, partition-child indexes, optional pickled AnalyzeResult, the cloudpickled UDTF class, the result schema, and the UDTF name

The wire batch carries one struct column _0 whose fields are the table's schema; table_arg_offsets=[0] tells ArrowStreamArrowUDTFSerializer to flatten that struct into a pa.RecordBatch for the UDTF's eval(batch) method.

Why are the changes needed?

This is part of SPARK-55724 (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for SQL_ARROW_UDTF is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request?

This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`.

The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants:
- `identity_udtf` - yields the input batch as a `pa.Table`
- `filter_udtf` - keeps rows whose first column is non-null (vectorized)
- `count_udtf` - aggregates each batch into a single-row count table

To support this, two helpers are added to `MockProtocolWriter`:
- `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf)
- `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name

The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method.

### Why are the changes needed?

This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New ASV microbenchmarks. Two stable runs on the same machine produced consistent numbers (one run shown):

```text
=== bench_eval_type.ArrowUDTFTimeBench.time_worker ===
scenario            udtf                   value
sm_batch_few_col    identity_udtf        1.23 ms
sm_batch_few_col    filter_udtf          1.62 ms
sm_batch_few_col    count_udtf           0.97 ms
sm_batch_many_col   identity_udtf        2.23 ms
sm_batch_many_col   filter_udtf          3.13 ms
sm_batch_many_col   count_udtf           1.01 ms
lg_batch_few_col    identity_udtf        1.66 ms
lg_batch_few_col    filter_udtf          3.57 ms
lg_batch_few_col    count_udtf           1.14 ms
lg_batch_many_col   identity_udtf        9.75 ms
lg_batch_many_col   filter_udtf         15.67 ms
lg_batch_many_col   count_udtf           3.06 ms
pure_ints           identity_udtf        5.73 ms
pure_ints           filter_udtf          7.17 ms
pure_ints           count_udtf           2.02 ms
pure_strings        identity_udtf        5.06 ms
pure_strings        filter_udtf          9.55 ms
pure_strings        count_udtf           2.19 ms

=== bench_eval_type.ArrowUDTFPeakmemBench.peakmem_worker ===
scenario            udtf                   value
sm_batch_few_col    identity_udtf       442.2 MB
sm_batch_few_col    filter_udtf         443.8 MB
sm_batch_few_col    count_udtf          442.2 MB
sm_batch_many_col   identity_udtf       443.2 MB
sm_batch_many_col   filter_udtf         445.6 MB
sm_batch_many_col   count_udtf          442.6 MB
lg_batch_few_col    identity_udtf       446.0 MB
lg_batch_few_col    filter_udtf         447.7 MB
lg_batch_few_col    count_udtf          443.5 MB
lg_batch_many_col   identity_udtf       458.7 MB
lg_batch_many_col   filter_udtf         460.8 MB
lg_batch_many_col   count_udtf          452.1 MB
pure_ints           identity_udtf       447.5 MB
pure_ints           filter_udtf         449.0 MB
pure_ints           count_udtf          444.8 MB
pure_strings        identity_udtf       451.1 MB
pure_strings        filter_udtf         453.6 MB
pure_strings        count_udtf          445.2 MB
```

Run command:

```bash
COLUMNS=120 asv run --bench ArrowUDTF -a repeat=3 --python=same
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#55691 from Yicong-Huang/SPARK-56716/bench/arrow-udtf.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 257c99d)
zhengruifeng pushed a commit that referenced this pull request May 6, 2026
…W_UDTF

### What changes were proposed in this pull request?

Backport of #55691 to `branch-4.x`.

This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `arrow_udtf`) to `python/benchmarks/bench_eval_type.py`.

The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants:
- `identity_udtf` - yields the input batch as a `pa.Table`
- `filter_udtf` - keeps rows whose first column is non-null (vectorized)
- `count_udtf` - aggregates each batch into a single-row count table

To support this, two helpers are added to `MockProtocolWriter`:
- `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf)
- `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name

The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method.

### Why are the changes needed?

This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Clean cherry-pick from master (commit 257c99d). New ASV microbenchmarks; same as the master PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #55704 from Yicong-Huang/SPARK-56716-4.x.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor

thanks, merged to 4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants