[SPARK-56929][PYTHON] Pass prefers_large_types when building expected schema for Arrow grouped/cogrouped map UDFs by Yicong-Huang · Pull Request #55961 · apache/spark

Yicong-Huang · 2026-05-18T22:02:04Z

What changes were proposed in this pull request?

Forward prefers_large_types=runner_conf.use_large_var_types when building expected_cols_and_types in python/pyspark/worker.py for SQL_GROUPED_MAP_ARROW_UDF, SQL_GROUPED_MAP_ARROW_ITER_UDF, and SQL_COGROUPED_MAP_ARROW_UDF. The matching arrow_return_type already forwards the flag; the per-field expected schema was missing it.

Why are the changes needed?

With spark.sql.execution.arrow.useLargeVarTypes=true, the result table contains large_string/large_binary (per arrow_return_type) while the expected schema contains plain string/binary, so verify_arrow_result raises a spurious RESULT_COLUMN_TYPES_MISMATCH:

spark.conf.set("spark.sql.execution.arrow.useLargeVarTypes", True)
df = spark.createDataFrame([(0, "foo", b"foo")], "id long, s string, b binary")
df.groupBy("id").applyInArrow(lambda t: t, "id long, s string, b binary").collect()
# [RESULT_COLUMN_TYPES_MISMATCH] column 's' (expected string, actual large_string), ...

Pre-requisite for SPARK-56608.

Does this PR introduce any user-facing change?

Yes. applyInArrow (grouped and cogrouped, iterator and non-iterator) no longer raises a spurious RESULT_COLUMN_TYPES_MISMATCH under useLargeVarTypes=true. Default behavior unchanged.

How was this patch tested?

Added test_apply_in_arrow_large_var_types to test_arrow_grouped_map.py and test_arrow_cogrouped_map.py, covering name-based and positional assignment for all three eval types (Spark Connect parity tests pick them up via the mixins). Confirmed the new tests fail on master without the worker.py change and pass with it.

Was this patch authored or co-authored using generative AI tooling?

No

…/cogrouped map UDFs

gaogaotiantian · 2026-05-18T22:30:19Z

+        schema = "id long, s string, b binary"
+
+        def func(table):
+            assert table.schema.field("s").type == pa.large_string()


Let's use is here.

is does not work here as the type from table is not the same instance from the factory function. changed to use pa.types.is_large_binary/is_large_string per offline discussion.

… schema for Arrow grouped/cogrouped map UDFs ### What changes were proposed in this pull request? Forward `prefers_large_types=runner_conf.use_large_var_types` when building `expected_cols_and_types` in `python/pyspark/worker.py` for `SQL_GROUPED_MAP_ARROW_UDF`, `SQL_GROUPED_MAP_ARROW_ITER_UDF`, and `SQL_COGROUPED_MAP_ARROW_UDF`. The matching `arrow_return_type` already forwards the flag; the per-field expected schema was missing it. ### Why are the changes needed? With `spark.sql.execution.arrow.useLargeVarTypes=true`, the result table contains `large_string`/`large_binary` (per `arrow_return_type`) while the expected schema contains plain `string`/`binary`, so `verify_arrow_result` raises a spurious `RESULT_COLUMN_TYPES_MISMATCH`: ```python spark.conf.set("spark.sql.execution.arrow.useLargeVarTypes", True) df = spark.createDataFrame([(0, "foo", b"foo")], "id long, s string, b binary") df.groupBy("id").applyInArrow(lambda t: t, "id long, s string, b binary").collect() # [RESULT_COLUMN_TYPES_MISMATCH] column 's' (expected string, actual large_string), ... ``` Pre-requisite for SPARK-56608. ### Does this PR introduce _any_ user-facing change? Yes. `applyInArrow` (grouped and cogrouped, iterator and non-iterator) no longer raises a spurious `RESULT_COLUMN_TYPES_MISMATCH` under `useLargeVarTypes=true`. Default behavior unchanged. ### How was this patch tested? Added `test_apply_in_arrow_large_var_types` to `test_arrow_grouped_map.py` and `test_arrow_cogrouped_map.py`, covering name-based and positional assignment for all three eval types (Spark Connect parity tests pick them up via the mixins). Confirmed the new tests fail on master without the worker.py change and pass with it. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55961 from Yicong-Huang/fix-arrow-map-large-var-types. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 31fe6dd) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-05-19T01:35:50Z

merged to master/4.x

fix: forward prefers_large_types to expected schema for Arrow grouped…

42a7a02

…/cogrouped map UDFs

gaogaotiantian reviewed May 18, 2026

View reviewed changes

test: use pa.types.is_large_string/binary predicates for type checks

ceb4caa

gaogaotiantian approved these changes May 18, 2026

View reviewed changes

zhengruifeng approved these changes May 19, 2026

View reviewed changes

zhengruifeng closed this in 31fe6dd May 19, 2026

zhengruifeng mentioned this pull request May 19, 2026

[SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema #55530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56929][PYTHON] Pass prefers_large_types when building expected schema for Arrow grouped/cogrouped map UDFs#55961

[SPARK-56929][PYTHON] Pass prefers_large_types when building expected schema for Arrow grouped/cogrouped map UDFs#55961
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:fix-arrow-map-large-var-types

Yicong-Huang commented May 18, 2026 •

edited

Loading

Uh oh!

gaogaotiantian May 18, 2026

Uh oh!

Yicong-Huang May 18, 2026

Uh oh!

zhengruifeng commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yicong-Huang commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gaogaotiantian May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang May 18, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented May 18, 2026 •

edited

Loading