Skip to content

[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19#56380

Closed
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-46776-followup/fix/cast-large-string-pyarrow18
Closed

[SPARK-46776][PYTHON][FOLLOWUP] Cast large_string/large_binary to the requested type on pyarrow < 19#56380
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-46776-followup/fix/cast-large-string-pyarrow18

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is a follow-up to #56157 (commit 2f4ed64204e, [SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas), which introduced a regression in the minimum-dependencies CI job (pyarrow==18.0.0).

In create_arrow_array_from_pandas (python/pyspark/sql/pandas/conversion.py), after pa.Array.from_pandas(...), cast the result back to the requested arrow_type only when pyarrow is older than 19.0.0 and the produced type is the large_string/large_binary counterpart of the requested string/binary type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before.

Why are the changes needed?

Since pandas 2.2.0, the string[pyarrow] extension dtype is backed by large_string (64-bit offsets). When such a series is converted via the __arrow_array__ protocol, pyarrow < 19.0.0 ignores the requested type argument, so pa.Array.from_pandas(series, type=pa.string()) returns a large_string array even though string was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data.

pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (pyarrow==18.0.0) produces correct data instead of corrupt rows.

Does this PR introduce any user-facing change?

Yes. On pyarrow < 19.0.0, createDataFrame from a pandas object containing string[pyarrow]/large_string-backed columns (and the binary equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change.

How was this patch tested?

Existing pyspark.sql.tests.test_conversion suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with pyarrow==18.0.0; the pyspark-sql, pyspark-pandas, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 8394faf to 55708b6 Compare June 9, 2026 00:31
Comment thread python/pyspark/sql/pandas/conversion.py
@Yicong-Huang Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 55708b6 to 60ec02a Compare June 9, 2026 22:10
@Yicong-Huang

Copy link
Copy Markdown
Contributor Author

Running CI on my fork.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 5db9faa to 60ec02a Compare June 10, 2026 23:19
@Yicong-Huang Yicong-Huang force-pushed the SPARK-46776-followup/fix/cast-large-string-pyarrow18 branch from 60ec02a to b66cc63 Compare June 11, 2026 22:59
Yicong-Huang added a commit that referenced this pull request Jun 12, 2026
… requested type on pyarrow < 19

### What changes were proposed in this pull request?

This is a follow-up to #56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`).

In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before.

### Why are the changes needed?

Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data.

pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows.

### Does this PR introduce _any_ user-facing change?

Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change.

### How was this patch tested?

Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
(cherry picked from commit 3b5cbeb)
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
@Yicong-Huang

Copy link
Copy Markdown
Contributor Author

Thanks, @zhengruifeng and @gaogaotiantian, merged to master/4.x

iemejia pushed a commit to iemejia/spark that referenced this pull request Jun 17, 2026
… requested type on pyarrow < 19

### What changes were proposed in this pull request?

This is a follow-up to apache#56157 (commit `2f4ed64204e`, `[SPARK-46776][PYTHON] Support pa.ChunkedArray columns in createDataFrame from pandas`), which introduced a regression in the minimum-dependencies CI job (`pyarrow==18.0.0`).

In `create_arrow_array_from_pandas` (`python/pyspark/sql/pandas/conversion.py`), after `pa.Array.from_pandas(...)`, cast the result back to the requested `arrow_type` only when pyarrow is older than 19.0.0 and the produced type is the `large_string`/`large_binary` counterpart of the requested `string`/`binary` type. Both guards are required: the version check avoids touching pyarrow >= 19.0.0 (which already honors the requested type), and the exact type check ensures only this offset-width mismatch is corrected, leaving any other type mismatch to surface as before.

### Why are the changes needed?

Since pandas 2.2.0, the `string[pyarrow]` extension dtype is backed by `large_string` (64-bit offsets). When such a series is converted via the `__arrow_array__` protocol, pyarrow < 19.0.0 ignores the requested `type` argument, so `pa.Array.from_pandas(series, type=pa.string())` returns a `large_string` array even though `string` was requested. The JVM then reads the 64-bit offset buffers against the 32-bit schema, silently corrupting the data.

pyarrow 19.0.0 fixed this in the protocol path (apache/arrow#44195) by casting the result to the requested type. This change replicates that behavior for older pyarrow so the minimum-dependencies build (`pyarrow==18.0.0`) produces correct data instead of corrupt rows.

### Does this PR introduce _any_ user-facing change?

Yes. On pyarrow < 19.0.0, `createDataFrame` from a pandas object containing `string[pyarrow]`/`large_string`-backed columns (and the `binary` equivalent) previously produced silently corrupted data; it now produces correct data. On pyarrow >= 19.0.0 there is no behavior change.

### How was this patch tested?

Existing `pyspark.sql.tests.test_conversion` suite passes locally. The regression environment was validated by manually dispatching the "Build / Python-only (Minimum dependencies of PySpark)" workflow on my fork against this branch with `pyarrow==18.0.0`; the `pyspark-sql`, `pyspark-pandas`, and related modules all passed: https://github.com/Yicong-Huang/spark/actions/runs/27244934810

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#56380 from Yicong-Huang/SPARK-46776-followup/fix/cast-large-string-pyarrow18.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants