Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] supporting pandas sparse series in pyarrow #24837

Open
asfimport opened this issue May 3, 2020 · 5 comments
Open

[Python] supporting pandas sparse series in pyarrow #24837

asfimport opened this issue May 3, 2020 · 5 comments

Comments

@asfimport
Copy link

I've seen that Pandas sparse series was not supported in pyarrow since it was planned to be deprecated.  In Pandas 1.0.1 they released a stable version of sparse array and as far as I know it is not planned to be deprecated anymore. Are you planning to support sparse series in next versions of pyarrow ?

Environment: ubuntu 16/18
Reporter: Michael Novitsky

Note: This issue was originally created as ARROW-8679. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
You're welcome to submit a PR

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~michael.novitsky] how do you envision "support"? Or in which part of pyarrow to you want to see it supported?

Eg when converting a pandas DataFrame with sparse columns to a pyarrow Table, we could densify the sparse array (since Arrow has no support for sparse arrays in its columnar format), but I am not sure this is what users would expect.

Support for conversion to one of the sparse tensors in pyarrow could indeed be added.

@asfimport
Copy link
Author

Michael Novitsky:
@jorisvandenbossche Hi Joris, we are dealing with data that is sparse in its nature (contains many nans) and we currently have memory problems when dealing with a big Dataframe . We can't use scipy sparse matrices since they support compression on zeros only and not nans and we want the data to be sparse in the whole flow - dataframe->pyarrow->plasma store.

Support for conversion to one of the sparse tensors in pyarrow could indeed be added - can you please point me to the part where this conversion is happening? 

@asfimport
Copy link
Author

Prabhant Singh:
Hi, 
Is there any progress on this issue? or will it ever be supported. 

Otherwise is there any recommended way to deal with sparse data in the meantime? ie convert to dense data or something else?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
With the current Arrow data types, we don't really have support for sparse data, so there is no direct way to support conversion from/to pandas sparse Series (except for converting to dense).

 There has been some discussion in the past about extending the Arrow spec to sparse/compressed data (e.g. RLE), but no one has started yet on a full proposal.

@asfimport asfimport added this to the 0.17.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant