Skip to content

Add randomised fuzz harness for pyarrow UDF vector-copy path #4384

@andygrove

Description

@andygrove

What is the problem the feature request solves?

PR #4234 introduces CometColumnarPythonInput.copyVector, which walks Comet's FieldVector tree in lockstep with a destination IPC root and setBytes-copies each Arrow buffer, then sets value counts bottom-up so offset bytes aren't rewritten. This is exactly the kind of recursive bytewise walk where hand-written cases catch the obvious bugs and miss the long tail.

CometCodegenFuzzSuite (#4267) already does the equivalent job for the codegen path: random schema, random data, run an identity UDF over every column, assert row-equivalence. The same pattern translates directly to the pyarrow UDF path.

Describe the potential solution

A pytest module under spark/src/test/resources/pyspark/ (alongside test_pyarrow_udf.py) that:

  1. Generates random schemas. A type generator that draws from primitives (Long, Int, Short, Byte, Boolean, Float, Double, Date, Timestamp, TimestampNTZ, String, Binary, Decimal with varied precision/scale) and recurses into ArrayType / StructType / MapType with bounded depth. Seeded for reproducibility.
  2. Generates random data. Per column, draw a null fraction from {0, 0.01, 0.5, 0.99, 1.0} and fill with random values matching the type. Length distribution for variable-width and list types should cover the small + large extremes.
  3. Writes parquet, reads it back through mapInArrow(passthrough), asserts row-equivalence between accelerated and fallback modes.
  4. Forces multiple Arrow IPC batches per partition by setting spark.sql.execution.arrow.maxRecordsPerBatch low, so the persistent destination IPC root is exercised across batches in every run.

This closes three of the test gaps from @mbutrovich's review on #4234 (items 1, 3, 4 from the test section) in one move: the harness exercises recursive vector-tree walks, validity-bit handling across null densities, and multi-batch per partition without needing per-shape hand-written cases.

The targeted hand-written cases already added in PR #4234 (decimal precision sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch, transforming array UDF) stay — they're cheap to keep and they pin the boundaries that fuzz won't find reliably (the precision sweep, for example, deterministically hits the 18/19 boundary that fuzz might happen to skip).

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions