[SPARK-55989][PS] Preserve non-int64 index dtypes in restore_index#54789
Closed
ueshin wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-55989][PS] Preserve non-int64 index dtypes in restore_index#54789ueshin wants to merge 2 commits intoapache:masterfrom
restore_index#54789ueshin wants to merge 2 commits intoapache:masterfrom
Conversation
Member
Author
python/pyspark/pandas/internal.py
Outdated
| for col, field in zip(pdf.columns, fields): | ||
| pdf[col] = DataTypeOps(field.dtype, field.spark_type).restore(pdf[col]) | ||
|
|
||
| restored_index_columns = { |
Contributor
There was a problem hiding this comment.
Since this is made to fix RangeIndex issue, can we just do the copy when there's only one column? I'm mainly concerned with the copies on large MultiIndex dataframes. It also has the benefit of being simpler.
Contributor
There was a problem hiding this comment.
if len(index_columns) == 1:
original_index_values = pdf[index_columns[0]].copy()
Member
Author
There was a problem hiding this comment.
That makes a lot of sense! Fixed.
HyukjinKwon
approved these changes
Mar 15, 2026
Member
|
Merged to master. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR updates
InternalFrame.restore_indexto preserve non-int64integer index dtypes when a pandas index is restored from columns.The restored pandas index is currently built with
set_index. For sequential integer values, pandas can normalize the result toRangeIndex, which changes the concrete pandas dtype toint64. This can happen even when pandas-on-Spark is already tracking a narrower integer dtype such asint32.This PR keeps a copy of the original index column values before
set_index, and if pandas restores a single-level integer index asRangeIndex, it rebuilds the pandas index aspd.Index(..., dtype=...)using the original dtype metadata for non-int64integer indexes.This PR also removes the default
Nonefrom thefieldsargument inrestore_index, which matches the current in-repo callers.Why are the changes needed?
The pandas-on-Spark index can already carry the correct logical dtype, but
to_pandas()can still lose it during pandas index restoration.For example, with pandas 3:
In this case, the accessor result has the expected logical dtype on the pandas-on-Spark side, but pandas materialization changes it to
RangeIndex, which is alwaysint64.Does this PR introduce any user-facing change?
Yes.
For single-level non-
int64integer indexes,to_pandas()can now preserve the original integer index dtype instead of materializing the result asRangeIndexwithint64.How was this patch tested?
This test passed locally in both pandas 3.0.0 and pandas 2.3.3 environments.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex GPT-5