Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Oct 8, 2017

This results in nice speedups when column conversions do not require GIL to be held:

In [5]: import numpy as np

In [6]: import pandas as pd

In [7]: import pyarrow as pa

In [8]: NROWS = 1000000

In [9]: NCOLS = 50

In [10]: arr = np.random.randn(NCOLS, NROWS).T

In [11]: arr[::5] = np.nan

In [12]: df = pd.DataFrame(arr)

In [13]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=1)
10 loops, best of 3: 179 ms per loop

In [14]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=4)
10 loops, best of 3: 59.7 ms per loop

This introduces a dependency on the futures Python 2.7 backport of concurrent.futures (PSF license)

wesm added 2 commits October 7, 2017 22:10
…t.futures for parallel processing

Change-Id: Ic2a0232fbf2a7eca21fe8624099b2fc3ec49bfee
…_pandas default

Change-Id: Ib73b1a6307997337f238d709664d5a716a724dcf
@wesm wesm changed the title [Python] Multithreaded conversions to Arrow in from_pandas ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas Oct 8, 2017
Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice perf improvement! LGTM, small comments.

convert_types)]
else:
from concurrent import futures
with futures.ThreadPoolExecutor(nthreads) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it much slower to just use this code path?

Copy link
Member Author

@wesm wesm Oct 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it, it was more the principle of starting a thread pool for no reason. I'd be fine with this being the only code path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread pool seems to have some non-trivial fixed overhead, at least 20 microseconds per task on my machine. I did some quick testing and when the number of columns is "large" relative to the number of rows, then using a thread pool is slower than single thread. This suggests we should use some heuristic to decide whether to use the thread pool to avoid bad performance in wide tables without a ton of rows.

For example

NROWS = 10000
NCOLS = 500

arr = np.random.randn(NCOLS, NROWS).T
arr[::5] = np.nan

df = pd.DataFrame(arr)

%timeit sdf = pa.serialize_pandas(df, nthreads=1)
10 loops, best of 3: 62.8 ms per loop

%timeit sdf = pa.serialize_pandas(df, nthreads=2)
10 loops, best of 3: 83.4 ms per loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a rough heuristic to turn off the thread pool if number of rows is less than 100 times the number of columns

cloudpickle
numpy>=1.10.0
six
futures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do when running under Python 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically a no-op because concurrent.futures is part of the py3 standard library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, this should not be installed in py3. Sorting out a fix

…/deserialize consistent

Change-Id: I0d717d20c754df3f57e2f23f3bca5ed752af51c1
@wesm
Copy link
Member Author

wesm commented Oct 8, 2017

I threaded (... sorry) this parameter through serialize_pandas, too. I also changed deserialize_pandas to use the same default for nthreads as to_pandas for consistency

@wesm
Copy link
Member Author

wesm commented Oct 8, 2017

I can add a unit test for this tomorrow

wesm added 3 commits October 8, 2017 08:22
…s with nthreads

Change-Id: Icbaaec800b9a84bb4ae47934d99fe0b092d2459d
Change-Id: I5b6adf842548b2e676fd3499dd0b7f47f7a99b25
Change-Id: I5040150b5407a76fecdf8d240ba86bd5ab8ac15f
@wesm
Copy link
Member Author

wesm commented Oct 8, 2017

+1

@asfgit asfgit closed this in 208e798 Oct 8, 2017
@wesm wesm deleted the multithreaded-from-pandas branch October 8, 2017 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants