Skip to content

[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF#55377

Open
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:refactor/cogrouped-map-arrow-udf
Open

[SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF#55377
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:refactor/cogrouped-map-arrow-udf

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 16, 2026

What changes were proposed in this pull request?

Refactor SQL_COGROUPED_MAP_ARROW_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark comparison (CogroupedMapArrowUDFTimeBench, average of 5 runs):

scenario             udf                   before (ms)   after (ms)     diff
----------------------------------------------------------------------------
few_groups_sm        identity_udf                12.36         9.91  -19.8%
few_groups_sm        concat_udf                  15.81        12.41  -21.5%
few_groups_sm        left_semi_udf               70.58        67.06   -5.0%
few_groups_lg        identity_udf                58.67        66.00  +12.5%
few_groups_lg        concat_udf                  87.44        84.23   -3.7%
few_groups_lg        left_semi_udf              242.04       226.07   -6.6%
many_groups_sm       identity_udf               393.54       323.73  -17.7%
many_groups_sm       concat_udf                 518.67       413.64  -20.2%
many_groups_sm       left_semi_udf             1581.58      1489.98   -5.8%
many_groups_lg       identity_udf               208.82       184.77  -11.5%
many_groups_lg       concat_udf                 291.11       257.13  -11.7%
many_groups_lg       left_semi_udf              945.73       930.55   -1.6%
wide_values          identity_udf               306.41       293.69   -4.2%
wide_values          concat_udf                 399.99       365.39   -8.7%
wide_values          left_semi_udf              699.24       599.22  -14.3%
multi_key            identity_udf                76.26        71.23   -6.6%
multi_key            concat_udf                 116.50       105.90   -9.1%
multi_key            left_semi_udf              221.64       210.75   -4.9%

few_groups_lg/identity_udf +12.5% is a benchmark ordering artifact -- when run in isolation (54.25 -> 54.02 ms, -0.4%), no regression is observed. The effect comes from prior scenarios polluting the Python process memory state, which does not occur in production where each Spark task runs in a fresh Python worker. 17 of 18 scenarios show improvement or no change.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch 3 times, most recently from 793e8cf to 0e5992c Compare April 17, 2026 22:05
@Yicong-Huang Yicong-Huang changed the title [SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF to be self-contained in read_udfs [SPARK-56312][PYTHON] Refactor SQL_COGROUPED_MAP_ARROW_UDF Apr 22, 2026
@Yicong-Huang Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch 11 times, most recently from 9ef1a2b to 43b90ad Compare April 22, 2026 06:51
@Yicong-Huang Yicong-Huang force-pushed the refactor/cogrouped-map-arrow-udf branch from 43b90ad to 4b101f8 Compare April 22, 2026 06:54
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same SPARK-55388 effort as #55495, this time migrating SQL_COGROUPED_MAP_ARROW_UDF. Drops wrap_cogrouped_map_arrow_udf and the bottom-of-read_udfs mapper case; switches the serializer from CogroupArrowUDFSerializer (which handled struct wrap + by-name reorder) to ArrowStreamCoGroupSerializer(write_start_stream=True); inlines select_columns for keys/values, UDF invocation, strict result validation, by-name reorder, and wrap_struct inside the new cogrouped_func. A small ArrowBatchTransformer.select_columns helper is added in conversion.py, and verify_result_type is factored out of verify_arrow_table/verify_arrow_batch.

The fix commit (444d6fb) is a worthwhile correction: it walks back from a permissive verify_result_type(result, pa.Table) + enforce_schema (which silently coerces types) to the original strict verify_arrow_table with expected_cols_and_types. The "no silent coercion" comment captures the intent well.

Unlike #55495, this PR does not introduce a prefers_large_types behavior change — the path was Arrow-in/Arrow-out before, output types pass through from the user UDF, and the by-name reorder is type-agnostic.

Architecture is clean. Two minor items below.

Comment thread python/pyspark/worker.py
Comment on lines +2947 to +2950
def cogrouped_func(
split_index: int,
data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],
) -> Iterator[pa.RecordBatch]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a docstring to match the peer pattern. The Arrow analogue at worker.py:2802 has """Apply groupBy Arrow UDF (non-iterator variant).""", and the new grouped_func in #55495 has a longer one. Without it cogrouped_func reads a bit terse compared to its peers.

Suggested change
def cogrouped_func(
split_index: int,
data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],
) -> Iterator[pa.RecordBatch]:
def cogrouped_func(
split_index: int,
data: Iterator[Tuple[list[pa.RecordBatch], list[pa.RecordBatch]]],
) -> Iterator[pa.RecordBatch]:
"""Apply cogroupBy Arrow UDF."""

Comment thread python/pyspark/worker.py
ArrowStreamPandasUDTFSerializer,
GroupPandasUDFSerializer,
CogroupArrowUDFSerializer,
ArrowStreamCoGroupSerializer,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR lands, CogroupArrowUDFSerializer (python/pyspark/sql/pandas/serializers.py:704) loses its only user. Its parent ArrowStreamGroupUDFSerializer (line 301) was already orphaned by SPARK-55608 — its only remaining subclass was CogroupArrowUDFSerializer. After this PR, both classes are unreachable: no imports in worker.py, no other subclasses, no public re-exports in __init__.py. Suggest removing both in this PR (or as a small follow-up) so the dead code doesn't accumulate across the SPARK-55388 series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants