[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19 by Yicong-Huang · Pull Request #56380 · apache/spark

Yicong-Huang · 2026-06-09T00:25:02Z

What changes were proposed in this pull request?

This is a follow-up to #56157 (commit 2f4ed64204e, [SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas), which introduced a regression in the minimum-dependencies CI job (pyarrow==18.0.0).

In create_arrow_array_from_pandas (python/pyspark/sql/pandas/conversion.py), after pa.Array.from_pandas(...), cast the result back to the requested arrow_type only when pyarrow is older than 19.0.0 and the produced type is the large_string/large_binary counterpart of the requested string/binary type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before.

Why are the changes needed?

Since pandas 2.2.0, the string[pyarrow] extension dtype is backed by large_string (64-bit offsets). When such a series is converted via the __arrow_array__ protocol, pyarrow < 19.0.0 ignores the requested type argument, so pa.Array.from_pandas(series, type=pa.string()) returns a large_string array even though string was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data.

pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (pyarrow==18.0.0) produces correct data instead of corrupt rows.

Does this PR introduce any user-facing change?

Yes. On pyarrow < 19.0.0, createDataFrame from a pandas object containing string[pyarrow]/large_string-backed columns (and the binary equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change.

How was this patch tested?

Existing pyspark.sql.tests.test_conversion suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with pyarrow==18.0.0; the pyspark-sql, pyspark-pandas, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810

Was this patch authored or co-authored using generative AI tooling?

No.

Yicong-Huang · 2026-06-09T22:55:12Z

Running CI on my fork.

… requested type on pyarrow < 19 ### What changes were proposed in this pull request? This is a follow-up to #56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`). In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before. ### Why are the changes needed? Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data. pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows. ### Does this PR introduce _any_ user-facing change? Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change. ### How was this patch tested? Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> (cherry picked from commit 3b5cbeb) Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>

Yicong-Huang · 2026-06-12T19:01:04Z

Thanks, @zhengruifeng and @gaogaotiantian, merged to master/4.x

… requested type on pyarrow < 19 ### What changes were proposed in this pull request? This is a follow-up to apache#56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`). In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before. ### Why are the changes needed? Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data. pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows. ### Does this PR introduce _any_ user-facing change? Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change. ### How was this patch tested? Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810 ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>

Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 8394faf to 55708b6 Compare June 9, 2026 00:31

zhengruifeng approved these changes Jun 9, 2026

View reviewed changes

gaogaotiantian reviewed Jun 9, 2026

View reviewed changes

Comment thread python/pyspark/sql/pandas/conversion.py

Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 55708b6 to 60ec02a Compare June 9, 2026 22:10

Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 5db9faa to 60ec02a Compare June 10, 2026 23:19

fix: cast large_string/large_binary to requested type on pyarrow < 19

b66cc63

Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 60ec02a to b66cc63 Compare June 11, 2026 22:59

Yicong-Huang closed this in 3b5cbeb Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19#56380

[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19#56380
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-46776-followup/fix/cast-large-string-pyarrow18

Yicong-Huang commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Yicong-Huang commented Jun 9, 2026

Uh oh!

Yicong-Huang commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Yicong-Huang commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Yicong-Huang commented Jun 9, 2026

Uh oh!

Yicong-Huang commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented Jun 9, 2026 •

edited

Loading