-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion #5866
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion #5866
Conversation
…rting pd.Categorical
# (the type of the 'codes' in pandas is not part of the data type) | ||
typ = pa.dictionary(index_type=pa.int16(), value_type=pa.string()) | ||
result = pa.array(cat, type=typ) | ||
assert result.type.equals(typ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you check also the array value? (for example using result.to_pylist()
)
typ = pa.dictionary(index_type=pa.int16(), value_type=pa.string()) | ||
result = pa.array(cat, type=typ) | ||
assert result.type.equals(typ) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should you perhaps check passing a mask
argument to pa.array()
?
@pitrou thanks! Added some additional tests |
Also added some additional code to do it with a warning for now, falling back to the previous behaviour if the new behaviour fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the warning is necessary, and it's more maintenance work in the future, but I'll let you make the final choice. +1 otherwise.
We had some regressions with the changes in 0.15 related to following the types in the schema and the pandas index etc, so thought to be more cautious this time. But certainly not attached to it (I can't really judge if it will be something easy to run into or not). I think the extra maintenance work is limited, we will only need to remove the warnings and replace with an error at some point. |
Ok, merged as is :-) |
I guess this should be in #5866 Closes #5952 from mrkn/ARROW-7306 and squashes the following commits: e0793a3 <Kenta Murata> Add Result-returning version of FileSystemFromUri Authored-by: Kenta Murata <mrkn@mrkn.jp> Signed-off-by: Antoine Pitrou <antoine@python.org>
…pe (int/string) Pandas dataframe to pyarrow Table This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`. The message for `Pandas` column with `int` followed by `string` is now ``` In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object') ``` the same as for `double` followed by `string`: ``` In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object') ``` As a side effect, this snippet [xref #5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16): ``` In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object)) ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64()) ...: result = pa.array(cat, type=typ) (... traceback...) ArrowInvalid: Could not convert a with type str: tried to convert to int ``` Finally, this *does* break a test [xref #4484, ARROW-4036] - see code comment Closes #8044 from arw2019/ARROW-7663 Authored-by: arw2019 <andrew.r.wieteska@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…pe (int/string) Pandas dataframe to pyarrow Table This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`. The message for `Pandas` column with `int` followed by `string` is now ``` In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object') ``` the same as for `double` followed by `string`: ``` In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object') ``` As a side effect, this snippet [xref apache#5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16): ``` In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object)) ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64()) ...: result = pa.array(cat, type=typ) (... traceback...) ArrowInvalid: Could not convert a with type str: tried to convert to int ``` Finally, this *does* break a test [xref apache#4484, ARROW-4036] - see code comment Closes apache#8044 from arw2019/ARROW-7663 Authored-by: arw2019 <andrew.r.wieteska@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…pe (int/string) Pandas dataframe to pyarrow Table This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`. The message for `Pandas` column with `int` followed by `string` is now ``` In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object') ``` the same as for `double` followed by `string`: ``` In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']})) (... traceback...) ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object') ``` As a side effect, this snippet [xref apache#5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16): ``` In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object)) ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64()) ...: result = pa.array(cat, type=typ) (... traceback...) ArrowInvalid: Could not convert a with type str: tried to convert to int ``` Finally, this *does* break a test [xref apache#4484, ARROW-4036] - see code comment Closes apache#8044 from arw2019/ARROW-7663 Authored-by: arw2019 <andrew.r.wieteska@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
https://issues.apache.org/jira/browse/ARROW-7168
This change ensures that if you specify a
type
inpa.array
, we ensure the output actually has this type when converting to dictionary array (as we also do for other types).The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code.