Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Create UnionArray from mixed-type pandas categorical #19391

Closed
asfimport opened this issue Aug 9, 2018 · 9 comments
Closed

[Python] Create UnionArray from mixed-type pandas categorical #19391

asfimport opened this issue Aug 9, 2018 · 9 comments

Comments

@asfimport
Copy link

asfimport commented Aug 9, 2018

While troublehsooting ARROW-2966 I updated my pandas dataframe with more type information. Specifically, I changed some mixed type columns to categorical instead of object. I assumed that the Table.from_pandas() would inspect the pandas type information and respect that when converting it over to a table. It doesn't seem to.

For instance, I expected this code to work, but it throws the same ArrowTypeError as ARROW-2966.

 

import pandas as pd
import pyarrow as pa
import numpy as np
df=pd.DataFrame.from_dict({"col":[0,1,2,3,""]},dtype="category")
tb = pa.Table.from_pandas(df, columns=["col"])

 

Reporter: Christopher Brooks

Related issues:

Note: This issue was originally created as ARROW-3030. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thought we had implemented this. Confirmed it's an issue; fixing should not be too difficult

@asfimport
Copy link
Author

Wes McKinney / @wesm:
PRs welcome of course

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
The example can be reduced to:

pa.array([0, 1, 2, 3, ''])

Creating unions is not implemented yet, reference https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/inference.cc#L377

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Type inference may involve optional checks whether the candidate types are castable to a single one rather than creating a union of them.
OTOH In certain cases cast checks requires the value too, which can make the inference expensive.

What would the expected type outcome for the following cases:

a1 = [1.2, 3.4, "4.6"]
a2 = [b"binary", "string"]
a3 = [[1, 2, 3], [4.0]]
a4 = ["2018-01-05", datetime.data(2018, 01, 05)]

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Moved to 0.12. Dealing with unions is out of scope for 0.11, but maybe we can look into it for 0.12 or 0.13

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Moved out of 0.12. We don't have much support for unions yet, which is what would be needed here. That's a bigger project

In the meantime I suggest homogenizing the types of your data to not have a mix of integers and strings

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
This sounds undesirable to do by default for the same reason as for ARROW-2774.

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I think the same discussion as in ARROW-2774 (supporting or not? optional?) for general conversion applies to Categorical as well, so let's close this referring to ARROW-2774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant