[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19#56380
Closed
Yicong-Huang wants to merge 1 commit into
Conversation
8394faf to
55708b6
Compare
zhengruifeng
approved these changes
Jun 9, 2026
55708b6 to
60ec02a
Compare
Contributor
Author
|
Running CI on my fork. |
5db9faa to
60ec02a
Compare
60ec02a to
b66cc63
Compare
Yicong-Huang
added a commit
that referenced
this pull request
Jun 12, 2026
… requested type on pyarrow < 19 ### What changes were proposed in this pull request? This is a follow-up to #56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`). In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before. ### Why are the changes needed? Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data. pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows. ### Does this PR introduce _any_ user-facing change? Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change. ### How was this patch tested? Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> (cherry picked from commit 3b5cbeb) Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
Contributor
Author
|
Thanks, @zhengruifeng and @gaogaotiantian, merged to master/4.x |
iemejia
pushed a commit
to iemejia/spark
that referenced
this pull request
Jun 17, 2026
… requested type on pyarrow < 19 ### What changes were proposed in this pull request? This is a follow-up to apache#56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`). In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before. ### Why are the changes needed? Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data. pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows. ### Does this PR introduce _any_ user-facing change? Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change. ### How was this patch tested? Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810 ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is a follow-up to #56157 (commit
2f4ed64204e,[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas), which introduced a regression in the minimum-dependencies CI job (pyarrow==18.0.0).In
create_arrow_array_from_pandas(python/pyspark/sql/pandas/conversion.py), afterpa.Array.from_pandas(...), cast the result back to the requestedarrow_typeonly when pyarrow is older than 19.0.0 and the produced type is thelarge_string/large_binarycounterpart of the requestedstring/binarytype. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before.Why are the changes needed?
Since pandas 2.2.0, the
string[pyarrow]extension dtype is backed bylarge_string(64-bit offsets). When such a series is converted via the__arrow_array__protocol, pyarrow < 19.0.0 ignores the requestedtypeargument, sopa.Array.from_pandas(series, type=pa.string())returns alarge_stringarray even thoughstringwas requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data.pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (
pyarrow==18.0.0) produces correct data instead of corrupt rows.Does this PR introduce any user-facing change?
Yes. On pyarrow < 19.0.0,
createDataFramefrom a pandas object containingstring[pyarrow]/large_string-backed columns (and thebinaryequivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change.How was this patch tested?
Existing
pyspark.sql.tests.test_conversionsuite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch withpyarrow==18.0.0; thepyspark-sql,pyspark-pandas, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810Was this patch authored or co-authored using generative AI tooling?
No.