Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1997: [C++/Python] Ignore zero-copy-option in to_pandas when strings_to_categorical is True #1480

Closed

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Jan 16, 2018

This closes ARROW-1997.

The problem is

>>> import pandas as pd
>>> import pyarrow as pa
>>> 
>>> df = pd.DataFrame({
... 'Foo': ['A', 'A', 'B', 'B']
... })
>>> table = pa.Table.from_pandas(df)
>>> df = table.to_pandas(strings_to_categorical=True)
<class 'pandas.core.categorical.Categorical'>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 1043, in pyarrow.lib.Table.to_pandas
  File "pyarrow/pandas_compat.py", line 535, in table_to_blockmanager
    blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
  File "pyarrow/pandas_compat.py", line 629, in _table_to_blocks
    return [_reconstruct_block(item) for item in result]
  File "pyarrow/pandas_compat.py", line 436, in _reconstruct_block
    ordered=item['ordered'])
  File "/home/rito/miniconda3/envs/pyarrow-dev-27/lib/python2.7/site-packages/pandas/core/categorical.py", line 624, in from_codes
    raise ValueError("codes need to be between -1 and "
ValueError: codes need to be between -1 and len(categories)-1

When strings_to_categorical=True, the categorical index is newly created in to_pandas procedure.
But, this passes data to Python by zero-copy, so the array is deallocated.
https://github.com/Licht-T/arrow/blob/be58af6dd0333652abbe2333ee5968df3f2e371f/cpp/src/arrow/python/arrow_to_pandas.cc#L1040

@Licht-T Licht-T changed the title ARROW-1997: [C++/Python] Avoid zero-copy-option in to_pandas when strings_to_categorical is True ARROW-1997: [C++/Python] Ignore zero-copy-option in to_pandas when strings_to_categorical is True Jan 17, 2018
@xhochy
Copy link
Member

xhochy commented Jan 17, 2018

@Licht-T The code looks good but I fail to understand the initial problem and thus cannot really understand what the change should actually do. Can you explain it a bit more?

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 18, 2018

Thanks @xhochy, added the PR comment!

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I get it 👍

ss << "Needed to copy " << data.num_chunks() << " chunks with "
<< indices_first.null_count() << " indices nulls, but zero_copy_only was True";
if (needs_copy_) {
ss << "Zero-copy is not allowed, but zero_copy_only was True";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a unit test that hits this code path, or is there a test already that does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is a little bit unclear, I will tweak

@Licht-T Licht-T force-pushed the fix-to_pandas-with-strings_to_categorical branch from be58af6 to c1bc353 Compare January 21, 2018 23:13
@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 21, 2018

Thanks @wesm, added tests!

Change-Id: I30da415b6685795994ffe280cf531bf5110de9a7
@wesm
Copy link
Member

wesm commented Jan 23, 2018

+1. Will merge once the build is passing

@wesm wesm closed this in 72dea17 Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants