ARROW-1689: [Python] Allow user to request no data copies #1233

njwhite · 2017-10-21T20:53:44Z

This makes performance debugging much easier, as it allows you to track down what (Arrow) data is causing unexpected delays in loading. It also makes testing features like ARROW-1689 easier as you can prove (via unit tests) that copies are not being made.

xhochy

I really like the idea of introducing such a flag. I guess due to Pandas nature the to_pandas will mostly have copies inside. I'm really looking forward to having this option on the inverese (form_pandas) in addition to a flag nan_is_null which can be set to False.

xhochy · 2017-10-22T12:05:46Z

python/pyarrow/table.pxi

@@ -159,7 +159,7 @@ cdef class Column:
        sp_column.reset(new CColumn(boxed_field.sp_field, arr.sp_array))
        return pyarrow_wrap_column(sp_column)

-    def to_pandas(self, strings_to_categorical=False):
+    def to_pandas(self, strings_to_categorical=False, zero_copy_only=False):


I would guess that this nearly always fails to the BlockManager. It will work for Arrays/Series but as Pandas will always allocate a large matrix for columns of the same type, you most likely get near to no zero-copies. (e.g. a DataFrame with two float64 columns will need a copy.)

@xhochy That makes sense - but why wouldn't you just make one PandasBlock per column (without copying), instead of created a PandasBlock per type and copying the various columns of that type into it?

Also, I've pushed a fix for the AppVeyor warning but the Travis errors seem unrelated (in the Go & JS codebases)?

@njwhite Because that is the underlying assumption based upon which Pandas 0.x DataFrames work on. There are several functions that use this assumptions to provide certain (slicing) features. @wesm might be able to go into detail/be more concrete.

@xhochy Makes sense, definitely want to keep this change focussed so should I just remove the flag from pa.Table? I'll fix the AppVeyor complaints tonight, & rebase off master to see if that fixes the other issues.

No, keep it. I just wanted to give you the hint that it will likely fail to do zero-copy in >90% of all cases.

wesm · 2017-10-22T21:56:58Z

pandas aggressively consolidates blocks. For non-categorical types, the maximum performance, most memory efficient choice for nearly all pandas users is to create pre-consolidated blocks.

I’m a bit worried about doing significant work on this code without having ASV benchmarks set up

njwhite · 2017-10-24T08:26:51Z

@xhochy OK - Travis, &c. are all green now. @wesm thanks - that makes sense. I'm definitely not looking to do significant work! That said, MAP_FIXED looks like a path to mapping different columns in the Arrow file to a contiguous address space (i.e. a pre-consolidated block) without physically copying the data...:)

wesm · 2017-10-24T17:06:18Z

cpp/src/arrow/status.h

@@ -95,7 +95,8 @@ enum class StatusCode : char {
  PythonError = 12,
  PlasmaObjectExists = 20,
  PlasmaObjectNonexistent = 21,
-  PlasmaStoreFull = 22
+  PlasmaStoreFull = 22,
+  CopyRequired = 23


I don't think a new status code is needed. Can you return Invalid instead? As a matter of style, we should try not to use Status for routine error handling in C++. I think having this bubble up as an exception in Python is OK for now, but if we need to do zero-copy detection in C++, we are going to want to handle that in a different way.

wesm · 2017-10-24T17:07:50Z

cpp/src/arrow/python/arrow_to_pandas.cc

-    RETURN_NOT_OK(
-        ConvertArrayToPandas(options_, dict_type->dictionary(), nullptr, &dictionary));
-    lock.acquire();
-


Why is this deleted?

It's already been run by this call to Write here - so this change just reuses the save dictionary instead of building it again.

wesm · 2017-10-24T17:08:08Z

cpp/src/arrow/python/python-test.cc

@@ -86,7 +86,7 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) {

  PyObject* out;
  Py_BEGIN_ALLOW_THREADS;
-  PandasOptions options;
+  PandasOptions options = {false, false};


Can you instead add a default ctor to PandasOptions?

wesm · 2017-10-24T17:11:39Z

python/pyarrow/tests/test_convert_pandas.py

+
+    def test_zero_copy_failure_when_nulls(self):
+        with self.assertRaises(pa.ArrowException):
+            pa.array([0, 1, None]).to_pandas(zero_copy_only=True)


There are 7 places where a zero copy error is being returned. Let's make sure we have unit tests that hit each of them

This makes performance debugging much easier, as it allows you to track down what (arrow) data is causing unexpected delays in loading.

wesm

+1, thanks @njwhite!

This PR closes [ARROW-1689](https://issues.apache.org/jira/browse/ARROW-1689). I want to add the zero-copy option after #1233 merged. Author: Licht-T <licht-t@outlook.jp> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1237 from Licht-T/feature-categorical-index-zerocopy and squashes the following commits: 53342e8 [Wes McKinney] Use the PyCapsule API to preserve base references to C++ objects when no PyObject* is available to set as zero-copy ndarray base 0b847d1 [Wes McKinney] Fix flakes 4270b5d [Licht-T] Fix C++ lint issues ddc6b84 [Licht-T] TST: Add test_zero_copy_dictionaries e0561dc [Licht-T] ENH: Add zero_copy_only option check de4ed3e [Licht-T] ENH: Implement Categorical Block Zero-Copy

xhochy reviewed Oct 22, 2017

View reviewed changes

Licht-T mentioned this pull request Oct 22, 2017

ARROW-1689: [Python] Implement zero-copy conversions for DictionaryArray #1237

Closed

njwhite force-pushed the feature/zerocopycategories branch from 68b4b1b to 67fc614 Compare October 23, 2017 22:12

wesm requested changes Oct 24, 2017

View reviewed changes

wesm changed the title ~~ARROW-1689 Allow User To Request No Data Copies~~ ARROW-1689: [Python] Allow user to request no data copies Oct 24, 2017

njwhite added 2 commits October 24, 2017 23:25

ARROW-1689 Allow User To Request No Data Copies

a968b0b

This makes performance debugging much easier, as it allows you to track down what (arrow) data is causing unexpected delays in loading.

ARROW-1689 Don't Deserialize the Dictionary Twice

b06f50d

njwhite force-pushed the feature/zerocopycategories branch from 67fc614 to b06f50d Compare October 24, 2017 22:25

wesm approved these changes Oct 26, 2017

View reviewed changes

wesm closed this in 6b16cca Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1689: [Python] Allow user to request no data copies #1233

ARROW-1689: [Python] Allow user to request no data copies #1233

njwhite commented Oct 21, 2017

xhochy left a comment

xhochy Oct 22, 2017

njwhite Oct 22, 2017

xhochy Oct 22, 2017

njwhite Oct 23, 2017

xhochy Oct 23, 2017

wesm commented Oct 22, 2017

njwhite commented Oct 24, 2017

wesm Oct 24, 2017

njwhite Oct 24, 2017

wesm Oct 24, 2017

njwhite Oct 24, 2017

wesm Oct 24, 2017

njwhite Oct 24, 2017

wesm Oct 24, 2017

njwhite Oct 24, 2017

wesm left a comment

ARROW-1689: [Python] Allow user to request no data copies #1233

ARROW-1689: [Python] Allow user to request no data copies #1233

Conversation

njwhite commented Oct 21, 2017

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Oct 22, 2017

njwhite commented Oct 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment