Skip to content

[SPARK-55989][PS] Preserve non-int64 index dtypes in restore_index#54789

Closed
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-55989/rangeindex
Closed

[SPARK-55989][PS] Preserve non-int64 index dtypes in restore_index#54789
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-55989/rangeindex

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Mar 13, 2026

What changes were proposed in this pull request?

This PR updates InternalFrame.restore_index to preserve non-int64 integer index dtypes when a pandas index is restored from columns.

The restored pandas index is currently built with set_index. For sequential integer values, pandas can normalize the result to RangeIndex, which changes the concrete pandas dtype to int64. This can happen even when pandas-on-Spark is already tracking a narrower integer dtype such as int32.

This PR keeps a copy of the original index column values before set_index, and if pandas restores a single-level integer index as RangeIndex, it rebuilds the pandas index as pd.Index(..., dtype=...) using the original dtype metadata for non-int64 integer indexes.

This PR also removes the default None from the fields argument in restore_index, which matches the current in-repo callers.

Why are the changes needed?

The pandas-on-Spark index can already carry the correct logical dtype, but to_pandas() can still lose it during pandas index restoration.

For example, with pandas 3:

pidx = pd.DatetimeIndex(["2004-01-01", "2002-12-31", "2000-04-01"])
psidx = ps.from_pandas(pidx)

psidx.year.dtype
# dtype('int32')

psidx.year.to_pandas().dtype
# dtype('int64')

In this case, the accessor result has the expected logical dtype on the pandas-on-Spark side, but pandas materialization changes it to RangeIndex, which is always int64.

Does this PR introduce any user-facing change?

Yes.

For single-level non-int64 integer indexes, to_pandas() can now preserve the original integer index dtype instead of materializing the result as RangeIndex with int64.

How was this patch tested?

./python/run-tests.py --testnames pyspark.pandas.tests.indexes.test_datetime_property

This test passed locally in both pandas 3.0.0 and pandas 2.3.3 environments.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@ueshin
Copy link
Member Author

ueshin commented Mar 13, 2026

for col, field in zip(pdf.columns, fields):
pdf[col] = DataTypeOps(field.dtype, field.spark_type).restore(pdf[col])

restored_index_columns = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is made to fix RangeIndex issue, can we just do the copy when there's only one column? I'm mainly concerned with the copies on large MultiIndex dataframes. It also has the benefit of being simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(index_columns) == 1:
    original_index_values = pdf[index_columns[0]].copy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense! Fixed.

Copy link
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants