-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t.futures for parallel processing Change-Id: Ic2a0232fbf2a7eca21fe8624099b2fc3ec49bfee
…_pandas default Change-Id: Ib73b1a6307997337f238d709664d5a716a724dcf
cpcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice perf improvement! LGTM, small comments.
| convert_types)] | ||
| else: | ||
| from concurrent import futures | ||
| with futures.ThreadPoolExecutor(nthreads) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it much slower to just use this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it, it was more the principle of starting a thread pool for no reason. I'd be fine with this being the only code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread pool seems to have some non-trivial fixed overhead, at least 20 microseconds per task on my machine. I did some quick testing and when the number of columns is "large" relative to the number of rows, then using a thread pool is slower than single thread. This suggests we should use some heuristic to decide whether to use the thread pool to avoid bad performance in wide tables without a ton of rows.
For example
NROWS = 10000
NCOLS = 500
arr = np.random.randn(NCOLS, NROWS).T
arr[::5] = np.nan
df = pd.DataFrame(arr)
%timeit sdf = pa.serialize_pandas(df, nthreads=1)
10 loops, best of 3: 62.8 ms per loop
%timeit sdf = pa.serialize_pandas(df, nthreads=2)
10 loops, best of 3: 83.4 ms per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a rough heuristic to turn off the thread pool if number of rows is less than 100 times the number of columns
python/requirements.txt
Outdated
| cloudpickle | ||
| numpy>=1.10.0 | ||
| six | ||
| futures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do when running under Python 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's basically a no-op because concurrent.futures is part of the py3 standard library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, this should not be installed in py3. Sorting out a fix
…/deserialize consistent Change-Id: I0d717d20c754df3f57e2f23f3bca5ed752af51c1
|
I threaded (... sorry) this parameter through |
|
I can add a unit test for this tomorrow |
…s with nthreads Change-Id: Icbaaec800b9a84bb4ae47934d99fe0b092d2459d
Change-Id: I5b6adf842548b2e676fd3499dd0b7f47f7a99b25
Change-Id: I5040150b5407a76fecdf8d240ba86bd5ab8ac15f
|
+1 |
This results in nice speedups when column conversions do not require GIL to be held:
This introduces a dependency on the
futuresPython 2.7 backport of concurrent.futures (PSF license)