[SPARK-55989][PS] Preserve non-int64 index dtypes in `restore_index` by ueshin · Pull Request #54789 · apache/spark

ueshin · 2026-03-13T21:52:44Z

What changes were proposed in this pull request?

This PR updates InternalFrame.restore_index to preserve non-int64 integer index dtypes when a pandas index is restored from columns.

The restored pandas index is currently built with set_index. For sequential integer values, pandas can normalize the result to RangeIndex, which changes the concrete pandas dtype to int64. This can happen even when pandas-on-Spark is already tracking a narrower integer dtype such as int32.

This PR keeps a copy of the original index column values before set_index, and if pandas restores a single-level integer index as RangeIndex, it rebuilds the pandas index as pd.Index(..., dtype=...) using the original dtype metadata for non-int64 integer indexes.

This PR also removes the default None from the fields argument in restore_index, which matches the current in-repo callers.

Why are the changes needed?

The pandas-on-Spark index can already carry the correct logical dtype, but to_pandas() can still lose it during pandas index restoration.

For example, with pandas 3:

pidx = pd.DatetimeIndex(["2004-01-01", "2002-12-31", "2000-04-01"])
psidx = ps.from_pandas(pidx)

psidx.year.dtype
# dtype('int32')

psidx.year.to_pandas().dtype
# dtype('int64')

In this case, the accessor result has the expected logical dtype on the pandas-on-Spark side, but pandas materialization changes it to RangeIndex, which is always int64.

Does this PR introduce any user-facing change?

Yes.

For single-level non-int64 integer indexes, to_pandas() can now preserve the original integer index dtype instead of materializing the result as RangeIndex with int64.

How was this patch tested?

./python/run-tests.py --testnames pyspark.pandas.tests.indexes.test_datetime_property

This test passed locally in both pandas 3.0.0 and pandas 2.3.3 environments.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

ueshin · 2026-03-13T21:53:03Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

devin-petersohn · 2026-03-13T22:00:43Z

python/pyspark/pandas/internal.py

        for col, field in zip(pdf.columns, fields):
            pdf[col] = DataTypeOps(field.dtype, field.spark_type).restore(pdf[col])

+        restored_index_columns = {


Since this is made to fix RangeIndex issue, can we just do the copy when there's only one column? I'm mainly concerned with the copies on large MultiIndex dataframes. It also has the benefit of being simpler.

if len(index_columns) == 1: original_index_values = pdf[index_columns[0]].copy()

That makes a lot of sense! Fixed.

devin-petersohn

LGTM

HyukjinKwon · 2026-03-15T05:37:58Z

Merged to master.

Preserve non-int64 index dtypes in restore_index

45ed7f8

devin-petersohn reviewed Mar 13, 2026

View reviewed changes

Fix.

cfceb39

devin-petersohn approved these changes Mar 14, 2026

View reviewed changes

HyukjinKwon approved these changes Mar 15, 2026

View reviewed changes

HyukjinKwon closed this in 47424a3 Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55989][PS] Preserve non-int64 index dtypes in `restore_index`#54789

[SPARK-55989][PS] Preserve non-int64 index dtypes in `restore_index`#54789
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-55989/rangeindex

ueshin commented Mar 13, 2026

Uh oh!

ueshin commented Mar 13, 2026

Uh oh!

devin-petersohn Mar 13, 2026

Uh oh!

devin-petersohn Mar 13, 2026

Uh oh!

ueshin Mar 13, 2026

Uh oh!

devin-petersohn left a comment

Uh oh!

HyukjinKwon commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ueshin commented Mar 13, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin commented Mar 13, 2026

Uh oh!

devin-petersohn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

devin-petersohn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants