ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252
ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252wesm wants to merge 5 commits intoapache:masterfrom
Conversation
…me. Default to multiprocessing.cpu_count for now Change-Id: If00238db7460b6eed0347c5392b3b7d6afc2b43b
…used Change-Id: I5f34c800ab0f83bb5a7c613aa0031e5cf0a9805b
…umns Change-Id: Ib2ec14278ee66d7b3c333bd3e388587c3a30f07c
Change-Id: Idc51bcfdbcc332a2bb716eee9708b88ea53f100a
| ############################################################ | ||
|
|
||
| # compiler flags that are common across debug/release builds | ||
| set(CXX_COMMON_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall") |
There was a problem hiding this comment.
That will make adding additional compiler flags via CMake harder?
There was a problem hiding this comment.
Factored out common cmake code between C++/Python libraries, so you can set CMAKE_CXX_FLAGS to modify externally
| data.append(PyObject_to_object(arr)) | ||
|
|
||
| return pd.DataFrame(dict(zip(names, data)), columns=names) | ||
| if nthreads is None: |
There was a problem hiding this comment.
As a good practice, I would also limit here to environ['OMP_NUM_THREADS'].
There was a problem hiding this comment.
done (it uses OMP_NUM_THREADS by default if it's set)
| 'float32': np.arange(size, dtype=np.float32), | ||
| 'float64': np.arange(size, dtype=np.float64), | ||
| 'bool': np.random.randn(size) > 0, | ||
| # Pandas only support ns resolution, Arrow at the moment only ms |
There was a problem hiding this comment.
Arrow also support s, us, ns. Just Parquet is limited to ms and us.
There was a problem hiding this comment.
Added a TODO. After the changes in this PR we should return to the timestamp resolution stuff in a separate patch
|
|
||
| // Functions for pandas conversion via NumPy | ||
|
|
||
| #include <Python.h> |
There was a problem hiding this comment.
I think this include was needed for older Numpies, e.g. 1.8 and 1.9. (We should still support them as those are the default ones you should build manylinux1 packages against.) Will re-add (with an explanative comment) if this was a problematic change.
There was a problem hiding this comment.
It's the first include in pyarrow/adapters/pandas.h, so this should have no effect
There was a problem hiding this comment.
I'll add it back in IWYU spirit
…ke files. Add pyarrow.cpu_count/set_cpu_count functions per feedback Change-Id: I84a24335856bc855c9959a41f706a3764b35fb7e
|
+1 |
This yields a substantial speedup on my laptop. On a 1GB numeric dataset, with 1 thread (the default prior to this patch):
With 4 threads (this is a true quad-core machine)
The default number of cores used is the
os.cpu_countdivided by 2 (since hyperthreading doesn't help with this largely memory-bound operation).