What is the problem the feature request solves?
PR #4234 introduces CometColumnarPythonInput.copyVector, which walks Comet's FieldVector tree in lockstep with a destination IPC root and setBytes-copies each Arrow buffer, then sets value counts bottom-up so offset bytes aren't rewritten. This is exactly the kind of recursive bytewise walk where hand-written cases catch the obvious bugs and miss the long tail.
CometCodegenFuzzSuite (#4267) already does the equivalent job for the codegen path: random schema, random data, run an identity UDF over every column, assert row-equivalence. The same pattern translates directly to the pyarrow UDF path.
Describe the potential solution
A pytest module under spark/src/test/resources/pyspark/ (alongside test_pyarrow_udf.py) that:
- Generates random schemas. A type generator that draws from primitives (Long, Int, Short, Byte, Boolean, Float, Double, Date, Timestamp, TimestampNTZ, String, Binary, Decimal with varied precision/scale) and recurses into ArrayType / StructType / MapType with bounded depth. Seeded for reproducibility.
- Generates random data. Per column, draw a null fraction from
{0, 0.01, 0.5, 0.99, 1.0} and fill with random values matching the type. Length distribution for variable-width and list types should cover the small + large extremes.
- Writes parquet, reads it back through
mapInArrow(passthrough), asserts row-equivalence between accelerated and fallback modes.
- Forces multiple Arrow IPC batches per partition by setting
spark.sql.execution.arrow.maxRecordsPerBatch low, so the persistent destination IPC root is exercised across batches in every run.
This closes three of the test gaps from @mbutrovich's review on #4234 (items 1, 3, 4 from the test section) in one move: the harness exercises recursive vector-tree walks, validity-bit handling across null densities, and multi-batch per partition without needing per-shape hand-written cases.
The targeted hand-written cases already added in PR #4234 (decimal precision sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch, transforming array UDF) stay — they're cheap to keep and they pin the boundaries that fuzz won't find reliably (the precision sweep, for example, deterministically hits the 18/19 boundary that fuzz might happen to skip).
Additional context
What is the problem the feature request solves?
PR #4234 introduces
CometColumnarPythonInput.copyVector, which walks Comet'sFieldVectortree in lockstep with a destination IPC root andsetBytes-copies each Arrow buffer, then sets value counts bottom-up so offset bytes aren't rewritten. This is exactly the kind of recursive bytewise walk where hand-written cases catch the obvious bugs and miss the long tail.CometCodegenFuzzSuite(#4267) already does the equivalent job for the codegen path: random schema, random data, run an identity UDF over every column, assert row-equivalence. The same pattern translates directly to the pyarrow UDF path.Describe the potential solution
A pytest module under
spark/src/test/resources/pyspark/(alongsidetest_pyarrow_udf.py) that:{0, 0.01, 0.5, 0.99, 1.0}and fill with random values matching the type. Length distribution for variable-width and list types should cover the small + large extremes.mapInArrow(passthrough), asserts row-equivalence between accelerated and fallback modes.spark.sql.execution.arrow.maxRecordsPerBatchlow, so the persistent destination IPC root is exercised across batches in every run.This closes three of the test gaps from @mbutrovich's review on #4234 (items 1, 3, 4 from the test section) in one move: the harness exercises recursive vector-tree walks, validity-bit handling across null densities, and multi-batch per partition without needing per-shape hand-written cases.
The targeted hand-written cases already added in PR #4234 (decimal precision sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch, transforming array UDF) stay — they're cheap to keep and they pin the boundaries that fuzz won't find reliably (the precision sweep, for example, deterministically hits the 18/19 boundary that fuzz might happen to skip).
Additional context
spark/src/test/resources/pyspark/test_pyarrow_udf.py(PR feat: Add experimental support for accelerated PyArrow UDFs #4234) — accelerated/fallback parametrisation pattern, jar resolution, session setup all reusable.CometCodegenFuzzSuite(feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267, harness shape to borrow from).