Skip to content

ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252

Closed
wesm wants to merge 5 commits intoapache:masterfrom
wesm:ARROW-428
Closed

ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame#252
wesm wants to merge 5 commits intoapache:masterfrom
wesm:ARROW-428

Conversation

@wesm
Copy link
Member

@wesm wesm commented Dec 26, 2016

This yields a substantial speedup on my laptop. On a 1GB numeric dataset, with 1 thread (the default prior to this patch):

>>> %timeit df2 = table.to_pandas(nthreads=1)
1 loop, best of 3: 498 ms per loop

With 4 threads (this is a true quad-core machine)

>>> %timeit df2 = table.to_pandas(nthreads=4)
1 loop, best of 3: 151 ms per loop

The default number of cores used is the os.cpu_count divided by 2 (since hyperthreading doesn't help with this largely memory-bound operation).

wesm added 4 commits December 26, 2016 15:26
…me. Default to multiprocessing.cpu_count for now

Change-Id: If00238db7460b6eed0347c5392b3b7d6afc2b43b
…used

Change-Id: I5f34c800ab0f83bb5a7c613aa0031e5cf0a9805b
…umns

Change-Id: Ib2ec14278ee66d7b3c333bd3e388587c3a30f07c
Change-Id: Idc51bcfdbcc332a2bb716eee9708b88ea53f100a
############################################################

# compiler flags that are common across debug/release builds
set(CXX_COMMON_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will make adding additional compiler flags via CMake harder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out common cmake code between C++/Python libraries, so you can set CMAKE_CXX_FLAGS to modify externally

data.append(PyObject_to_object(arr))

return pd.DataFrame(dict(zip(names, data)), columns=names)
if nthreads is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a good practice, I would also limit here to environ['OMP_NUM_THREADS'].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (it uses OMP_NUM_THREADS by default if it's set)

'float32': np.arange(size, dtype=np.float32),
'float64': np.arange(size, dtype=np.float64),
'bool': np.random.randn(size) > 0,
# Pandas only support ns resolution, Arrow at the moment only ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow also support s, us, ns. Just Parquet is limited to ms and us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO. After the changes in this PR we should return to the timestamp resolution stuff in a separate patch


// Functions for pandas conversion via NumPy

#include <Python.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this include was needed for older Numpies, e.g. 1.8 and 1.9. (We should still support them as those are the default ones you should build manylinux1 packages against.) Will re-add (with an explanative comment) if this was a problematic change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the first include in pyarrow/adapters/pandas.h, so this should have no effect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it back in IWYU spirit

…ke files. Add pyarrow.cpu_count/set_cpu_count functions per feedback

Change-Id: I84a24335856bc855c9959a41f706a3764b35fb7e
@wesm wesm changed the title ARROW-428: Multithreaded conversion from Arrow table to pandas.DataFrame ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame Dec 28, 2016
@wesm
Copy link
Member Author

wesm commented Dec 28, 2016

+1

@asfgit asfgit closed this in ab5f66a Dec 28, 2016
@wesm wesm deleted the ARROW-428 branch December 28, 2016 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants