Add randomised fuzz harness for pyarrow UDF vector-copy path

## What is the problem the feature request solves?

PR #4234 introduces `CometColumnarPythonInput.copyVector`, which walks Comet's `FieldVector` tree in lockstep with a destination IPC root and `setBytes`-copies each Arrow buffer, then sets value counts bottom-up so offset bytes aren't rewritten. This is exactly the kind of recursive bytewise walk where hand-written cases catch the obvious bugs and miss the long tail.

`CometCodegenFuzzSuite` (#4267) already does the equivalent job for the codegen path: random schema, random data, run an identity UDF over every column, assert row-equivalence. The same pattern translates directly to the pyarrow UDF path.

## Describe the potential solution

A pytest module under `spark/src/test/resources/pyspark/` (alongside `test_pyarrow_udf.py`) that:

1. **Generates random schemas.** A type generator that draws from primitives (Long, Int, Short, Byte, Boolean, Float, Double, Date, Timestamp, TimestampNTZ, String, Binary, Decimal with varied precision/scale) and recurses into ArrayType / StructType / MapType with bounded depth. Seeded for reproducibility.
2. **Generates random data.** Per column, draw a null fraction from `{0, 0.01, 0.5, 0.99, 1.0}` and fill with random values matching the type. Length distribution for variable-width and list types should cover the small + large extremes.
3. **Writes parquet, reads it back through `mapInArrow(passthrough)`**, asserts row-equivalence between accelerated and fallback modes.
4. **Forces multiple Arrow IPC batches per partition** by setting `spark.sql.execution.arrow.maxRecordsPerBatch` low, so the persistent destination IPC root is exercised across batches in every run.

This closes three of the test gaps from @mbutrovich's review on #4234 (items 1, 3, 4 from the test section) in one move: the harness exercises recursive vector-tree walks, validity-bit handling across null densities, and multi-batch per partition without needing per-shape hand-written cases.

The targeted hand-written cases already added in PR #4234 (decimal precision sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch, transforming array UDF) stay — they're cheap to keep and they pin the boundaries that fuzz won't find reliably (the precision sweep, for example, deterministically hits the 18/19 boundary that fuzz might happen to skip).

## Additional context

- Reference test fixture: `spark/src/test/resources/pyspark/test_pyarrow_udf.py` (PR #4234) — accelerated/fallback parametrisation pattern, jar resolution, session setup all reusable.
- Related: PR #4234 (introduces the operator), #4383 (drops the per-batch buffer copy on the same code path — fuzz harness gains value once that's landed, since the new direct-from-Comet path has its own walk to validate), `CometCodegenFuzzSuite` (#4267, harness shape to borrow from).
- Originally surfaced as the first item in @mbutrovich's "Tests" section on #4234.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add randomised fuzz harness for pyarrow UDF vector-copy path #4384

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add randomised fuzz harness for pyarrow UDF vector-copy path #4384

Description

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions