Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 3, 2026

What changes were proposed in this pull request?

Implement new arrow->pandas conversion

Why are the changes needed?

to replace existing conversion, based on my local test, the conversion of integers with nulls is 3x+ faster:

conversion used in convert_legacy

        # %timeit a.to_pandas(integer_object_nulls=True).astype(pd.Int64Dtype())
        # 2.05 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

new conversion

        # %timeit a.to_pandas(types_mapper=pd.ArrowDtype).astype(pd.Int64Dtype())
        # 589 ms ± 9.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Does this PR introduce any user-facing change?

no

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

JIRA Issue Information

=== Sub-task SPARK-54969 ===
Summary: Implement new arrow->pandas conversion
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

fix

fix
Converted pandas Series. If df_for_struct is True and the type is StructType,
returns a DataFrame with columns corresponding to struct fields.
"""
if cls._prefer_convert_numpy(target_type, df_for_struct):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our target is to replace convert_legacy with convert_numpy

@zhengruifeng
Copy link
Contributor Author

cc @fangchenli and @Yicong-Huang

the new conversion is to replace convert_legacy finally, with benchmarking.
And will also have a convert_pyarrow, and switch the conversion by runtime configs.

),
):
# TODO(ruifeng): revisit date_as_object
# TODO(ruifeng): implement coerce_temporal_nanoseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you convert this to the community style, TODO(SPARK-XXX), instead of personal TODO style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you!

@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the convert_np branch February 3, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants