ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame by wesm · Pull Request #252 · apache/arrow

wesm · 2016-12-26T20:37:06Z

This yields a substantial speedup on my laptop. On a 1GB numeric dataset, with 1 thread (the default prior to this patch):

>>> %timeit df2 = table.to_pandas(nthreads=1)
1 loop, best of 3: 498 ms per loop

With 4 threads (this is a true quad-core machine)

>>> %timeit df2 = table.to_pandas(nthreads=4)
1 loop, best of 3: 151 ms per loop

The default number of cores used is the os.cpu_count divided by 2 (since hyperthreading doesn't help with this largely memory-bound operation).

…me. Default to multiprocessing.cpu_count for now Change-Id: If00238db7460b6eed0347c5392b3b7d6afc2b43b

…used Change-Id: I5f34c800ab0f83bb5a7c613aa0031e5cf0a9805b

…umns Change-Id: Ib2ec14278ee66d7b3c333bd3e388587c3a30f07c

Change-Id: Idc51bcfdbcc332a2bb716eee9708b88ea53f100a

xhochy · 2016-12-27T18:12:47Z

python/CMakeLists.txt

 ############################################################

 # compiler flags that are common across debug/release builds
-set(CXX_COMMON_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall")


That will make adding additional compiler flags via CMake harder?

Factored out common cmake code between C++/Python libraries, so you can set CMAKE_CXX_FLAGS to modify externally

xhochy · 2016-12-27T18:13:35Z

python/pyarrow/table.pyx

-                data.append(PyObject_to_object(arr))
-
-            return pd.DataFrame(dict(zip(names, data)), columns=names)
+        if nthreads is None:


As a good practice, I would also limit here to environ['OMP_NUM_THREADS'].

done (it uses OMP_NUM_THREADS by default if it's set)

xhochy · 2016-12-27T18:14:45Z

python/pyarrow/tests/test_convert_pandas.py

+        'float32': np.arange(size, dtype=np.float32),
+        'float64': np.arange(size, dtype=np.float64),
+        'bool': np.random.randn(size) > 0,
+        # Pandas only support ns resolution, Arrow at the moment only ms


Arrow also support s, us, ns. Just Parquet is limited to ms and us.

Added a TODO. After the changes in this PR we should return to the timestamp resolution stuff in a separate patch

xhochy · 2016-12-27T18:16:39Z

python/src/pyarrow/adapters/pandas.cc


 // Functions for pandas conversion via NumPy

-#include <Python.h>


I think this include was needed for older Numpies, e.g. 1.8 and 1.9. (We should still support them as those are the default ones you should build manylinux1 packages against.) Will re-add (with an explanative comment) if this was a problematic change.

It's the first include in pyarrow/adapters/pandas.h, so this should have no effect

I'll add it back in IWYU spirit

…ke files. Add pyarrow.cpu_count/set_cpu_count functions per feedback Change-Id: I84a24335856bc855c9959a41f706a3764b35fb7e

wesm · 2016-12-28T12:48:51Z

+1

wesm added 4 commits December 26, 2016 15:26

Implement multithreaded conversion from Arrow table to pandas.DataFra…

79f5fd9

…me. Default to multiprocessing.cpu_count for now Change-Id: If00238db7460b6eed0347c5392b3b7d6afc2b43b

Return errors from threaded conversion. Add doc about number of cpus …

bc4dff7

…used Change-Id: I5f34c800ab0f83bb5a7c613aa0031e5cf0a9805b

Add missing GIL acquisition. Do not spawn too many threads if few col…

e70f16d

…umns Change-Id: Ib2ec14278ee66d7b3c333bd3e388587c3a30f07c

Tweak pyarrow cmake flags

cad89e9

Change-Id: Idc51bcfdbcc332a2bb716eee9708b88ea53f100a

xhochy reviewed Dec 27, 2016

View reviewed changes

Factor out common compiler flag code between Arrow C++ and Python CMa…

da929bf

…ke files. Add pyarrow.cpu_count/set_cpu_count functions per feedback Change-Id: I84a24335856bc855c9959a41f706a3764b35fb7e

wesm changed the title ~~ARROW-428: Multithreaded conversion from Arrow table to pandas.DataFrame~~ ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame Dec 28, 2016

asfgit closed this in ab5f66a Dec 28, 2016

wesm deleted the ARROW-428 branch December 28, 2016 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252

ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252
wesm wants to merge 5 commits intoapache:masterfrom
wesm:ARROW-428

wesm commented Dec 26, 2016

Uh oh!

xhochy Dec 27, 2016

Uh oh!

wesm Dec 27, 2016

Uh oh!

xhochy Dec 27, 2016

Uh oh!

wesm Dec 27, 2016

Uh oh!

xhochy Dec 27, 2016

Uh oh!

wesm Dec 27, 2016

Uh oh!

xhochy Dec 27, 2016

Uh oh!

wesm Dec 27, 2016

Uh oh!

wesm Dec 27, 2016

Uh oh!

wesm commented Dec 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		// Functions for pandas conversion via NumPy

		#include <Python.h>

Conversation

wesm commented Dec 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants