[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF by Yicong-Huang · Pull Request #55377 · apache/spark

Yicong-Huang · 2026-04-16T21:10:25Z

What changes were proposed in this pull request?

Refactor SQL_COGROUPED_MAP_ARROW_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark comparison (CogroupedMapArrowUDFTimeBench, average of 5 runs):

scenario             udf                   before (ms)   after (ms)     diff
----------------------------------------------------------------------------
few_groups_sm        identity_udf                12.36         9.91  -19.8%
few_groups_sm        concat_udf                  15.81        12.41  -21.5%
few_groups_sm        left_semi_udf               70.58        67.06   -5.0%
few_groups_lg        identity_udf                58.67        66.00  +12.5%
few_groups_lg        concat_udf                  87.44        84.23   -3.7%
few_groups_lg        left_semi_udf              242.04       226.07   -6.6%
many_groups_sm       identity_udf               393.54       323.73  -17.7%
many_groups_sm       concat_udf                 518.67       413.64  -20.2%
many_groups_sm       left_semi_udf             1581.58      1489.98   -5.8%
many_groups_lg       identity_udf               208.82       184.77  -11.5%
many_groups_lg       concat_udf                 291.11       257.13  -11.7%
many_groups_lg       left_semi_udf              945.73       930.55   -1.6%
wide_values          identity_udf               306.41       293.69   -4.2%
wide_values          concat_udf                 399.99       365.39   -8.7%
wide_values          left_semi_udf              699.24       599.22  -14.3%
multi_key            identity_udf                76.26        71.23   -6.6%
multi_key            concat_udf                 116.50       105.90   -9.1%
multi_key            left_semi_udf              221.64       210.75   -4.9%

few_groups_lg/identity_udf +12.5% is a benchmark ordering artifact -- when run in isolation (54.25 -> 54.02 ms, -0.4%), no regression is observed. The effect comes from prior scenarios polluting the Python process memory state, which does not occur in production where each Spark task runs in a fresh Python worker. 17 of 18 scenarios show improvement or no change.

Was this patch authored or co-authored using generative AI tooling?

No.

…-contained in read_udfs

zhengruifeng

Same SPARK-55388 effort as #55495, this time migrating SQL_COGROUPED_MAP_ARROW_UDF. Drops wrap_cogrouped_map_arrow_udf and the bottom-of-read_udfs mapper case; switches the serializer from CogroupArrowUDFSerializer (which handled struct wrap + by-name reorder) to ArrowStreamCoGroupSerializer(write_start_stream=True); inlines select_columns for keys/values, UDF invocation, strict result validation, by-name reorder, and wrap_struct inside the new cogrouped_func. A small ArrowBatchTransformer.select_columns helper is added in conversion.py, and verify_result_type is factored out of verify_arrow_table/verify_arrow_batch.

The fix commit (444d6fb) is a worthwhile correction: it walks back from a permissive verify_result_type(result, pa.Table) + enforce_schema (which silently coerces types) to the original strict verify_arrow_table with expected_cols_and_types. The "no silent coercion" comment captures the intent well.

Unlike #55495, this PR does not introduce a prefers_large_types behavior change — the path was Arrow-in/Arrow-out before, output types pass through from the user UDF, and the by-name reorder is type-agnostic.

Architecture is clean. Two minor items below.

zhengruifeng · 2026-04-30T02:13:29Z

+        def cogrouped_func(
+            split_index: int,
+            data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],
+        ) -> Iterator[pa.RecordBatch]:


Suggest adding a docstring to match the peer pattern. The Arrow analogue at worker.py:2802 has """Apply groupBy Arrow UDF (non-iterator variant).""", and the new grouped_func in #55495 has a longer one. Without it cogrouped_func reads a bit terse compared to its peers.

Suggested change

def cogrouped_func(

split_index: int,

data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],

) -> Iterator[pa.RecordBatch]:

def cogrouped_func(

split_index: int,

data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],

) -> Iterator[pa.RecordBatch]:

"""Apply cogroupBy Arrow UDF."""

zhengruifeng · 2026-04-30T02:13:29Z

    ArrowStreamPandasUDTFSerializer,
    GroupPandasUDFSerializer,
-    CogroupArrowUDFSerializer,
+    ArrowStreamCoGroupSerializer,


After this PR lands, CogroupArrowUDFSerializer (python/pyspark/sql/pandas/serializers.py:704) loses its only user. Its parent ArrowStreamGroupUDFSerializer (line 301) was already orphaned by SPARK-55608 — its only remaining subclass was CogroupArrowUDFSerializer. After this PR, both classes are unreachable: no imports in worker.py, no other subclasses, no public re-exports in __init__.py. Suggest removing both in this PR (or as a small follow-up) so the dead code doesn't accumulate across the SPARK-55388 series.

Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch 3 times, most recently from 793e8cf to 0e5992c Compare April 17, 2026 22:05

Yicong-Huang changed the title ~~[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF to be self-contained in read_udfs~~ [SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF Apr 22, 2026

Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch 11 times, most recently from 9ef1a2b to 43b90ad Compare April 22, 2026 06:51

[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF to be self…

4b101f8

…-contained in read_udfs

Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch from 43b90ad to 4b101f8 Compare April 22, 2026 06:54

fix: restore strict result validation for cogrouped map arrow UDF

444d6fb

zhengruifeng reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF#55377

[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF#55377
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:refactor/cogrouped-map-arrow-udf

Yicong-Huang commented Apr 16, 2026 •

edited

Loading

Uh oh!

zhengruifeng left a comment

Uh oh!

zhengruifeng Apr 30, 2026

Uh oh!

zhengruifeng Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Apr 16, 2026 •

edited

Loading