Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46931][PS] Implement {Frame, Series}.to_hdf #44966

Closed
wants to merge 7 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Implement {Frame, Series}.to_hdf

Why are the changes needed?

pandas parity

Does this PR introduce any user-facing change?

yes

In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

In [4]: df.to_hdf('/tmp/data.h5', key='df', mode='w')

In [5]: psdf = ps.from_pandas(df)

In [6]: psdf.to_hdf('/tmp/data2.h5', key='df', mode='w')
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: PandasAPIOnSparkAdviceWarning: `to_hdf` loads all data into the driver's memory. It should only be used if the resulting DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
                                                                                
In [7]: !ls /tmp/*h5
/tmp/data.h5    /tmp/data2.h5

In [8]: !ls -lh /tmp/*h5
-rw-r--r--@ 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data.h5
-rw-r--r--@ 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data2.h5

How was this patch tested?

manually test, hdf requires additional library pytables which in turn needs many prerequisites

since pytables is just a optional dep of Pandas, so I think we can avoid adding it to CI first.

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Thank you so much for adding this, @zhengruifeng and @HyukjinKwon .

@zhengruifeng zhengruifeng deleted the ps_to_hdf branch February 1, 2024 00:43
@zhengruifeng
Copy link
Contributor Author

thanks @dongjoon-hyun and @HyukjinKwon for reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants