ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186

wesm · 2017-10-08T02:29:17Z

This results in nice speedups when column conversions do not require GIL to be held:

In [5]: import numpy as np

In [6]: import pandas as pd

In [7]: import pyarrow as pa

In [8]: NROWS = 1000000

In [9]: NCOLS = 50

In [10]: arr = np.random.randn(NCOLS, NROWS).T

In [11]: arr[::5] = np.nan

In [12]: df = pd.DataFrame(arr)

In [13]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=1)
10 loops, best of 3: 179 ms per loop

In [14]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=4)
10 loops, best of 3: 59.7 ms per loop

This introduces a dependency on the futures Python 2.7 backport of concurrent.futures (PSF license)

…t.futures for parallel processing Change-Id: Ic2a0232fbf2a7eca21fe8624099b2fc3ec49bfee

…_pandas default Change-Id: Ib73b1a6307997337f238d709664d5a716a724dcf

cpcloud

Nice perf improvement! LGTM, small comments.

cpcloud · 2017-10-08T02:30:28Z

python/pyarrow/pandas_compat.py

+                                  convert_types)]
+    else:
+        from concurrent import futures
+        with futures.ThreadPoolExecutor(nthreads) as executor:


Is it much slower to just use this code path?

I doubt it, it was more the principle of starting a thread pool for no reason. I'd be fine with this being the only code path

The thread pool seems to have some non-trivial fixed overhead, at least 20 microseconds per task on my machine. I did some quick testing and when the number of columns is "large" relative to the number of rows, then using a thread pool is slower than single thread. This suggests we should use some heuristic to decide whether to use the thread pool to avoid bad performance in wide tables without a ton of rows.

For example

NROWS = 10000 NCOLS = 500 arr = np.random.randn(NCOLS, NROWS).T arr[::5] = np.nan df = pd.DataFrame(arr) %timeit sdf = pa.serialize_pandas(df, nthreads=1) 10 loops, best of 3: 62.8 ms per loop %timeit sdf = pa.serialize_pandas(df, nthreads=2) 10 loops, best of 3: 83.4 ms per loop

Added a rough heuristic to turn off the thread pool if number of rows is less than 100 times the number of columns

cpcloud · 2017-10-08T02:33:29Z

python/requirements.txt

 cloudpickle
 numpy>=1.10.0
 six
+futures


What does this do when running under Python 3?

It's basically a no-op because concurrent.futures is part of the py3 standard library

Cool, just curious.

I was wrong, this should not be installed in py3. Sorting out a fix

…/deserialize consistent Change-Id: I0d717d20c754df3f57e2f23f3bca5ed752af51c1

wesm · 2017-10-08T02:46:38Z

I threaded (... sorry) this parameter through serialize_pandas, too. I also changed deserialize_pandas to use the same default for nthreads as to_pandas for consistency

wesm · 2017-10-08T02:47:03Z

I can add a unit test for this tomorrow

…s with nthreads Change-Id: Icbaaec800b9a84bb4ae47934d99fe0b092d2459d

Change-Id: I5b6adf842548b2e676fd3499dd0b7f47f7a99b25

Change-Id: I5040150b5407a76fecdf8d240ba86bd5ab8ac15f

wesm · 2017-10-08T15:44:51Z

+1

wesm added 2 commits October 7, 2017 22:10

Add nthreads argument to RecordBatch/Table.from_pandas. Use concurren…

6a58c03

…t.futures for parallel processing Change-Id: Ic2a0232fbf2a7eca21fe8624099b2fc3ec49bfee

Default to cpu_count() for nthreads in from_pandas to conform with to…

15841d1

…_pandas default Change-Id: Ib73b1a6307997337f238d709664d5a716a724dcf

wesm changed the title ~~[Python] Multithreaded conversions to Arrow in from_pandas~~ ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas Oct 8, 2017

cpcloud approved these changes Oct 8, 2017

View reviewed changes

Add nthreads argument to serialize_pandas, make default for serialize…

0afab34

…/deserialize consistent Change-Id: I0d717d20c754df3f57e2f23f3bca5ed752af51c1

wesm added 3 commits October 8, 2017 08:22

Only install concurrent.futures backport on py2, test serialize_panda…

5a69208

…s with nthreads Change-Id: Icbaaec800b9a84bb4ae47934d99fe0b092d2459d

Add heuristic to use threadpool conversion only if nrows > ncols * 100

c30e473

Change-Id: I5b6adf842548b2e676fd3499dd0b7f47f7a99b25

Only install futures on py2

a3072f0

Change-Id: I5040150b5407a76fecdf8d240ba86bd5ab8ac15f

asfgit closed this in 208e798 Oct 8, 2017

wesm deleted the multithreaded-from-pandas branch October 8, 2017 17:01

asfimport mentioned this pull request Oct 8, 2017

[Python] Enable multi-threaded conversions in Table.from_pandas #17607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186

ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186

Uh oh!

wesm commented Oct 8, 2017 •

edited

Loading

Uh oh!

cpcloud left a comment

Uh oh!

cpcloud Oct 8, 2017

Uh oh!

wesm Oct 8, 2017 •

edited

Loading

Uh oh!

wesm Oct 8, 2017

Uh oh!

wesm Oct 8, 2017

Uh oh!

cpcloud Oct 8, 2017

Uh oh!

wesm Oct 8, 2017

Uh oh!

cpcloud Oct 8, 2017

Uh oh!

wesm Oct 8, 2017

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186

ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186

Uh oh!

Conversation

wesm commented Oct 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

wesm Oct 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

wesm Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

wesm Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

wesm Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

wesm commented Oct 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wesm commented Oct 8, 2017 •

edited

Loading

wesm Oct 8, 2017 •

edited

Loading