Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion #5866

Conversation

jorisvandenbossche
Copy link
Member

https://issues.apache.org/jira/browse/ARROW-7168

This change ensures that if you specify a type in pa.array, we ensure the output actually has this type when converting to dictionary array (as we also do for other types).

The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code.

@github-actions
Copy link

# (the type of the 'codes' in pandas is not part of the data type)
typ = pa.dictionary(index_type=pa.int16(), value_type=pa.string())
result = pa.array(cat, type=typ)
assert result.type.equals(typ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check also the array value? (for example using result.to_pylist())

typ = pa.dictionary(index_type=pa.int16(), value_type=pa.string())
result = pa.array(cat, type=typ)
assert result.type.equals(typ)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should you perhaps check passing a mask argument to pa.array()?

@jorisvandenbossche
Copy link
Member Author

@pitrou thanks! Added some additional tests

@jorisvandenbossche
Copy link
Member Author

Also added some additional code to do it with a warning for now, falling back to the previous behaviour if the new behaviour fails.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the warning is necessary, and it's more maintenance work in the future, but I'll let you make the final choice. +1 otherwise.

@jorisvandenbossche
Copy link
Member Author

We had some regressions with the changes in 0.15 related to following the types in the schema and the pandas index etc, so thought to be more cautious this time. But certainly not attached to it (I can't really judge if it will be something easy to run into or not).

I think the extra maintenance work is limited, we will only need to remove the warnings and replace with an error at some point.

@pitrou pitrou closed this in 09eac5d Nov 21, 2019
@pitrou
Copy link
Member

pitrou commented Nov 21, 2019

Ok, merged as is :-)

pitrou pushed a commit that referenced this pull request Dec 4, 2019
I guess this should be in #5866

Closes #5952 from mrkn/ARROW-7306 and squashes the following commits:

e0793a3 <Kenta Murata> Add Result-returning version of FileSystemFromUri

Authored-by: Kenta Murata <mrkn@mrkn.jp>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche pushed a commit that referenced this pull request Sep 15, 2020
…pe (int/string) Pandas dataframe to pyarrow Table

This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`.

The message for `Pandas` column with `int` followed by `string`  is now
```
In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object')
```
the same as for `double` followed by `string`:
```
In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object')
```

As a side effect, this snippet [xref #5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16):
```
In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object))
   ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64())
   ...: result = pa.array(cat, type=typ)
(... traceback...)
ArrowInvalid: Could not convert a with type str: tried to convert to int
```
Finally, this *does* break a test [xref #4484, ARROW-4036] - see code comment

Closes #8044 from arw2019/ARROW-7663

Authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…pe (int/string) Pandas dataframe to pyarrow Table

This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`.

The message for `Pandas` column with `int` followed by `string`  is now
```
In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object')
```
the same as for `double` followed by `string`:
```
In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object')
```

As a side effect, this snippet [xref apache#5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16):
```
In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object))
   ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64())
   ...: result = pa.array(cat, type=typ)
(... traceback...)
ArrowInvalid: Could not convert a with type str: tried to convert to int
```
Finally, this *does* break a test [xref apache#4484, ARROW-4036] - see code comment

Closes apache#8044 from arw2019/ARROW-7663

Authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…pe (int/string) Pandas dataframe to pyarrow Table

This PR homogenizes error messages for mixed-type `Pandas` inputs to `pa.Table`.

The message for `Pandas` column with `int` followed by `string`  is now
```
In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object')
```
the same as for `double` followed by `string`:
```
In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']}))
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object')
```

As a side effect, this snippet [xref apache#5866, ARROW-7168] now throws an `ArrowInvalid` (has been `FutureWarning` since 0.16):
```
In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object))
   ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64())
   ...: result = pa.array(cat, type=typ)
(... traceback...)
ArrowInvalid: Could not convert a with type str: tried to convert to int
```
Finally, this *does* break a test [xref apache#4484, ARROW-4036] - see code comment

Closes apache#8044 from arw2019/ARROW-7663

Authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants