New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1689: [Python] Allow user to request no data copies #1233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of introducing such a flag. I guess due to Pandas nature the to_pandas
will mostly have copies inside. I'm really looking forward to having this option on the inverese (form_pandas
) in addition to a flag nan_is_null
which can be set to False
.
@@ -159,7 +159,7 @@ cdef class Column: | |||
sp_column.reset(new CColumn(boxed_field.sp_field, arr.sp_array)) | |||
return pyarrow_wrap_column(sp_column) | |||
|
|||
def to_pandas(self, strings_to_categorical=False): | |||
def to_pandas(self, strings_to_categorical=False, zero_copy_only=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that this nearly always fails to the BlockManager. It will work for Arrays/Series but as Pandas will always allocate a large matrix for columns of the same type, you most likely get near to no zero-copies. (e.g. a DataFrame with two float64 columns will need a copy.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy That makes sense - but why wouldn't you just make one PandasBlock
per column (without copying), instead of created a PandasBlock
per type and copying the various columns of that type into it?
Also, I've pushed a fix for the AppVeyor warning but the Travis errors seem unrelated (in the Go & JS codebases)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy Makes sense, definitely want to keep this change focussed so should I just remove the flag from pa.Table
? I'll fix the AppVeyor complaints tonight, & rebase off master to see if that fixes the other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, keep it. I just wanted to give you the hint that it will likely fail to do zero-copy in >90% of all cases.
pandas aggressively consolidates blocks. For non-categorical types, the maximum performance, most memory efficient choice for nearly all pandas users is to create pre-consolidated blocks. I’m a bit worried about doing significant work on this code without having ASV benchmarks set up |
68b4b1b
to
67fc614
Compare
@xhochy OK - Travis, &c. are all green now. @wesm thanks - that makes sense. I'm definitely not looking to do significant work! That said, MAP_FIXED looks like a path to mapping different columns in the Arrow file to a contiguous address space (i.e. a pre-consolidated block) without physically copying the data...:) |
cpp/src/arrow/status.h
Outdated
@@ -95,7 +95,8 @@ enum class StatusCode : char { | |||
PythonError = 12, | |||
PlasmaObjectExists = 20, | |||
PlasmaObjectNonexistent = 21, | |||
PlasmaStoreFull = 22 | |||
PlasmaStoreFull = 22, | |||
CopyRequired = 23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a new status code is needed. Can you return Invalid
instead? As a matter of style, we should try not to use Status for routine error handling in C++. I think having this bubble up as an exception in Python is OK for now, but if we need to do zero-copy detection in C++, we are going to want to handle that in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
RETURN_NOT_OK( | ||
ConvertArrayToPandas(options_, dict_type->dictionary(), nullptr, &dictionary)); | ||
lock.acquire(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp/src/arrow/python/python-test.cc
Outdated
@@ -86,7 +86,7 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) { | |||
|
|||
PyObject* out; | |||
Py_BEGIN_ALLOW_THREADS; | |||
PandasOptions options; | |||
PandasOptions options = {false, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you instead add a default ctor to PandasOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
def test_zero_copy_failure_when_nulls(self): | ||
with self.assertRaises(pa.ArrowException): | ||
pa.array([0, 1, None]).to_pandas(zero_copy_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 7 places where a zero copy error is being returned. Let's make sure we have unit tests that hit each of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
This makes performance debugging much easier, as it allows you to track down what (arrow) data is causing unexpected delays in loading.
67fc614
to
b06f50d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @njwhite!
This PR closes [ARROW-1689](https://issues.apache.org/jira/browse/ARROW-1689). I want to add the zero-copy option after #1233 merged. Author: Licht-T <licht-t@outlook.jp> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1237 from Licht-T/feature-categorical-index-zerocopy and squashes the following commits: 53342e8 [Wes McKinney] Use the PyCapsule API to preserve base references to C++ objects when no PyObject* is available to set as zero-copy ndarray base 0b847d1 [Wes McKinney] Fix flakes 4270b5d [Licht-T] Fix C++ lint issues ddc6b84 [Licht-T] TST: Add test_zero_copy_dictionaries e0561dc [Licht-T] ENH: Add zero_copy_only option check de4ed3e [Licht-T] ENH: Implement Categorical Block Zero-Copy
This makes performance debugging much easier, as it allows you to track down what (Arrow) data is causing unexpected delays in loading. It also makes testing features like ARROW-1689 easier as you can prove (via unit tests) that copies are not being made.