Skip to content

[SPARK-56929][PYTHON] Pass prefers_large_types when building expected schema for Arrow grouped/cogrouped map UDFs#55961

Closed
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:fix-arrow-map-large-var-types
Closed

[SPARK-56929][PYTHON] Pass prefers_large_types when building expected schema for Arrow grouped/cogrouped map UDFs#55961
Yicong-Huang wants to merge 2 commits into
apache:masterfrom
Yicong-Huang:fix-arrow-map-large-var-types

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented May 18, 2026

What changes were proposed in this pull request?

Forward prefers_large_types=runner_conf.use_large_var_types when building expected_cols_and_types in python/pyspark/worker.py for SQL_GROUPED_MAP_ARROW_UDF, SQL_GROUPED_MAP_ARROW_ITER_UDF, and SQL_COGROUPED_MAP_ARROW_UDF. The matching arrow_return_type already forwards the flag; the per-field expected schema was missing it.

Why are the changes needed?

With spark.sql.execution.arrow.useLargeVarTypes=true, the result table contains large_string/large_binary (per arrow_return_type) while the expected schema contains plain string/binary, so verify_arrow_result raises a spurious RESULT_COLUMN_TYPES_MISMATCH:

spark.conf.set("spark.sql.execution.arrow.useLargeVarTypes", True)
df = spark.createDataFrame([(0, "foo", b"foo")], "id long, s string, b binary")
df.groupBy("id").applyInArrow(lambda t: t, "id long, s string, b binary").collect()
# [RESULT_COLUMN_TYPES_MISMATCH] column 's' (expected string, actual large_string), ...

Pre-requisite for SPARK-56608.

Does this PR introduce any user-facing change?

Yes. applyInArrow (grouped and cogrouped, iterator and non-iterator) no longer raises a spurious RESULT_COLUMN_TYPES_MISMATCH under useLargeVarTypes=true. Default behavior unchanged.

How was this patch tested?

Added test_apply_in_arrow_large_var_types to test_arrow_grouped_map.py and test_arrow_cogrouped_map.py, covering name-based and positional assignment for all three eval types (Spark Connect parity tests pick them up via the mixins). Confirmed the new tests fail on master without the worker.py change and pass with it.

Was this patch authored or co-authored using generative AI tooling?

No

schema = "id long, s string, b binary"

def func(table):
assert table.schema.field("s").type == pa.large_string()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use is here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is does not work here as the type from table is not the same instance from the factory function. changed to use pa.types.is_large_binary/is_large_string per offline discussion.

zhengruifeng pushed a commit that referenced this pull request May 19, 2026
… schema for Arrow grouped/cogrouped map UDFs

### What changes were proposed in this pull request?

Forward `prefers_large_types=runner_conf.use_large_var_types` when building `expected_cols_and_types` in `python/pyspark/worker.py` for `SQL_GROUPED_MAP_ARROW_UDF`, `SQL_GROUPED_MAP_ARROW_ITER_UDF`, and `SQL_COGROUPED_MAP_ARROW_UDF`. The matching `arrow_return_type` already forwards the flag; the per-field expected schema was missing it.

### Why are the changes needed?

With `spark.sql.execution.arrow.useLargeVarTypes=true`, the result table contains `large_string`/`large_binary` (per `arrow_return_type`) while the expected schema contains plain `string`/`binary`, so `verify_arrow_result` raises a spurious `RESULT_COLUMN_TYPES_MISMATCH`:

```python
spark.conf.set("spark.sql.execution.arrow.useLargeVarTypes", True)
df = spark.createDataFrame([(0, "foo", b"foo")], "id long, s string, b binary")
df.groupBy("id").applyInArrow(lambda t: t, "id long, s string, b binary").collect()
# [RESULT_COLUMN_TYPES_MISMATCH] column 's' (expected string, actual large_string), ...
```

Pre-requisite for SPARK-56608.

### Does this PR introduce _any_ user-facing change?

Yes. `applyInArrow` (grouped and cogrouped, iterator and non-iterator) no longer raises a spurious `RESULT_COLUMN_TYPES_MISMATCH` under `useLargeVarTypes=true`. Default behavior unchanged.

### How was this patch tested?

Added `test_apply_in_arrow_large_var_types` to `test_arrow_grouped_map.py` and `test_arrow_cogrouped_map.py`, covering name-based and positional assignment for all three eval types (Spark Connect parity tests pick them up via the mixins). Confirmed the new tests fail on master without the worker.py change and pass with it.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #55961 from Yicong-Huang/fix-arrow-map-large-var-types.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 31fe6dd)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor

merged to master/4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants