[python legacy] use process pool instead of thread pool for files planning#4745
[python legacy] use process pool instead of thread pool for files planning#4745rdblue merged 1 commit intoapache:masterfrom
Conversation
|
Looking at the function passed to the pool (get_scans_for_manifest) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect this would scale better with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files? |
|
Sorry, I didn't mean to close this. That was an accident. |
Fokko
left a comment
There was a problem hiding this comment.
Looks good. If we run into memory issues, we could also replace the current map function:
iceberg/python_legacy/iceberg/core/data_table_scan.py
Lines 65 to 73 in 93111e8
Replace the map with the it's lazy imap brother:
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap
| import itertools | ||
| import logging | ||
| from multiprocessing import cpu_count | ||
| from multiprocessing.dummy import Pool |
There was a problem hiding this comment.
nit; we could combine the imports:
from multiprocessing import cpu_count, Pool
I think it would make sense to parallelize that as well to reduce the runtime of the function. Especially if we need to fetch files from external storage that introduces some latency, I would parallelize that kind of calls as well to improve overall throughput. |
There was a problem hiding this comment.
Overall I think this looks good as well.
I don't know if scrutinizing it super closely for corner cases is necessary given that we've seen the performance gains and it is the legacy python project.
I've put out some calls for testing in the community for people who I know do actively use the python_legacy project a lot. If there's any negative feedback, we can always revert if need be. 👍
Good find @puchengy!
|
In general in python its hard to deal with libraries that make this choice for you. Parallelism via threads and processes don't mix so the choice is better left to the application rather than the library. Can we make the choice configurable? |
|
@chyzzqo2 this is a great point. We already have the |
|
I think we want to follow up and implement @samredai's suggestion to add a pool type that controls this. We can do that in a follow-up, though. In the meantime, I'll merge this. |
|
Thanks, @puchengy! |
|
@rdblue Thanks I will have a follow up PR for that. |
Based on analysis,
plan_files()spent most of the time on doing copy and file reading which are both CPU bounded operations. However, current implementation is using thread pool instead of process pool which does not speed up the operation at all.This diff proposes switch to process pool instead of thread pool.
Frame graph

Command
py-spy record -o profile.svg -- python myprogram.pyCode