Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1689: [Python] Allow user to request no data copies #1233

Closed
wants to merge 2 commits into from

Conversation

njwhite
Copy link
Contributor

@njwhite njwhite commented Oct 21, 2017

This makes performance debugging much easier, as it allows you to track down what (Arrow) data is causing unexpected delays in loading. It also makes testing features like ARROW-1689 easier as you can prove (via unit tests) that copies are not being made.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of introducing such a flag. I guess due to Pandas nature the to_pandas will mostly have copies inside. I'm really looking forward to having this option on the inverese (form_pandas) in addition to a flag nan_is_null which can be set to False.

@@ -159,7 +159,7 @@ cdef class Column:
sp_column.reset(new CColumn(boxed_field.sp_field, arr.sp_array))
return pyarrow_wrap_column(sp_column)

def to_pandas(self, strings_to_categorical=False):
def to_pandas(self, strings_to_categorical=False, zero_copy_only=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that this nearly always fails to the BlockManager. It will work for Arrays/Series but as Pandas will always allocate a large matrix for columns of the same type, you most likely get near to no zero-copies. (e.g. a DataFrame with two float64 columns will need a copy.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy That makes sense - but why wouldn't you just make one PandasBlock per column (without copying), instead of created a PandasBlock per type and copying the various columns of that type into it?

Also, I've pushed a fix for the AppVeyor warning but the Travis errors seem unrelated (in the Go & JS codebases)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njwhite Because that is the underlying assumption based upon which Pandas 0.x DataFrames work on. There are several functions that use this assumptions to provide certain (slicing) features. @wesm might be able to go into detail/be more concrete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy Makes sense, definitely want to keep this change focussed so should I just remove the flag from pa.Table? I'll fix the AppVeyor complaints tonight, & rebase off master to see if that fixes the other issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, keep it. I just wanted to give you the hint that it will likely fail to do zero-copy in >90% of all cases.

@wesm
Copy link
Member

wesm commented Oct 22, 2017

pandas aggressively consolidates blocks. For non-categorical types, the maximum performance, most memory efficient choice for nearly all pandas users is to create pre-consolidated blocks.

I’m a bit worried about doing significant work on this code without having ASV benchmarks set up

@njwhite
Copy link
Contributor Author

njwhite commented Oct 24, 2017

@xhochy OK - Travis, &c. are all green now. @wesm thanks - that makes sense. I'm definitely not looking to do significant work! That said, MAP_FIXED looks like a path to mapping different columns in the Arrow file to a contiguous address space (i.e. a pre-consolidated block) without physically copying the data...:)

@@ -95,7 +95,8 @@ enum class StatusCode : char {
PythonError = 12,
PlasmaObjectExists = 20,
PlasmaObjectNonexistent = 21,
PlasmaStoreFull = 22
PlasmaStoreFull = 22,
CopyRequired = 23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a new status code is needed. Can you return Invalid instead? As a matter of style, we should try not to use Status for routine error handling in C++. I think having this bubble up as an exception in Python is OK for now, but if we need to do zero-copy detection in C++, we are going to want to handle that in a different way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

RETURN_NOT_OK(
ConvertArrayToPandas(options_, dict_type->dictionary(), nullptr, &dictionary));
lock.acquire();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already been run by this call to Write here - so this change just reuses the save dictionary instead of building it again.

@@ -86,7 +86,7 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) {

PyObject* out;
Py_BEGIN_ALLOW_THREADS;
PandasOptions options;
PandasOptions options = {false, false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead add a default ctor to PandasOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


def test_zero_copy_failure_when_nulls(self):
with self.assertRaises(pa.ArrowException):
pa.array([0, 1, None]).to_pandas(zero_copy_only=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 7 places where a zero copy error is being returned. Let's make sure we have unit tests that hit each of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

@wesm wesm changed the title ARROW-1689 Allow User To Request No Data Copies ARROW-1689: [Python] Allow user to request no data copies Oct 24, 2017
This makes performance debugging much easier, as it
allows you to track down what (arrow) data is causing
unexpected delays in loading.
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @njwhite!

@wesm wesm closed this in 6b16cca Oct 26, 2017
xhochy pushed a commit that referenced this pull request Oct 28, 2017
This PR closes [ARROW-1689](https://issues.apache.org/jira/browse/ARROW-1689).
I want to add the zero-copy option after #1233 merged.

Author: Licht-T <licht-t@outlook.jp>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1237 from Licht-T/feature-categorical-index-zerocopy and squashes the following commits:

53342e8 [Wes McKinney] Use the PyCapsule API to preserve base references to C++ objects when no PyObject* is available to set as zero-copy ndarray base
0b847d1 [Wes McKinney] Fix flakes
4270b5d [Licht-T] Fix C++ lint issues
ddc6b84 [Licht-T] TST: Add test_zero_copy_dictionaries
e0561dc [Licht-T] ENH: Add zero_copy_only option check
de4ed3e [Licht-T] ENH: Implement Categorical Block Zero-Copy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants