[python legacy] use process pool instead of thread pool for files planning by puchengy · Pull Request #4745 · apache/iceberg

puchengy · 2022-05-11T02:22:48Z

Based on analysis, plan_files() spent most of the time on doing copy and file reading which are both CPU bounded operations. However, current implementation is using thread pool instead of process pool which does not speed up the operation at all.

This diff proposes switch to process pool instead of thread pool.

Frame graph

Command
py-spy record -o profile.svg -- python myprogram.py

Code

from iceberg.hive import HiveTables

conf = {"hive.metastore.uris": 'xxxx', 'iceberg.scan.plan-in-worker-pool': True, 'iceberg.worker.num-threads': 4}
tables = HiveTables(conf)
table = tables.load('abc.xyz')
scan = table.new_scan()
files = scan.plan_files()

…nning

samredai · 2022-05-11T21:07:20Z

Looking at the function passed to the pool (get_scans_for_manifest) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect this would scale better with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?

rdblue · 2022-05-11T23:21:47Z

Sorry, I didn't mean to close this. That was an accident.

Fokko

Looks good. If we run into memory issues, we could also replace the current map function:

iceberg/python_legacy/iceberg/core/data_table_scan.py

Lines 65 to 73 in 93111e8

    
           if self.ops.conf.get(SCAN_THREAD_POOL_ENABLED): 
        
               with Pool(self.ops.conf.get(WORKER_THREAD_POOL_SIZE_PROP, 
        
                                           cpu_count())) as reader_scan_pool: 
        
                   return itertools.chain.from_iterable([scan for scan 
        
                                                         in reader_scan_pool.map(self.get_scans_for_manifest, 
        
                                                                                 matching_manifests)]) 
        
           else: 
        
               return itertools.chain.from_iterable([self.get_scans_for_manifest(manifest) 
        
                                                     for manifest in matching_manifests])

Replace the map with the it's lazy imap brother:
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap

Fokko · 2022-05-12T06:34:13Z

python_legacy/iceberg/core/data_table_scan.py

 import itertools
 import logging
 from multiprocessing import cpu_count
-from multiprocessing.dummy import Pool


nit; we could combine the imports:

from multiprocessing import cpu_count, Pool

Fokko · 2022-05-12T06:46:00Z

@samredai

Looking at the function passed to the pool (get_scans_for_manifest) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect this would scale better with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?

I think it would make sense to parallelize that as well to reduce the runtime of the function. Especially if we need to fetch files from external storage that introduces some latency, I would parallelize that kind of calls as well to improve overall throughput.

kbendick

Overall I think this looks good as well.

I don't know if scrutinizing it super closely for corner cases is necessary given that we've seen the performance gains and it is the legacy python project.

I've put out some calls for testing in the community for people who I know do actively use the python_legacy project a lot. If there's any negative feedback, we can always revert if need be. 👍

Good find @puchengy!

chyzzqo2 · 2022-05-12T22:37:01Z

In general in python its hard to deal with libraries that make this choice for you. Parallelism via threads and processes don't mix so the choice is better left to the application rather than the library. Can we make the choice configurable?

samredai · 2022-05-13T19:09:06Z

@chyzzqo2 this is a great point. We already have the iceberg.scan.plan-in-worker-pool config property that enables this code path. Maybe we can add a config along the lines of iceberg.scan.plan-in-worker-pool.type with values of "threads" or "processes". Also I just noticed that "threads" is in the config property name for pool sizes (here) which can be confusing if processes are actually being used.

rdblue · 2022-05-18T17:15:34Z

I think we want to follow up and implement @samredai's suggestion to add a pool type that controls this. We can do that in a follow-up, though. In the meantime, I'll merge this.

rdblue · 2022-05-18T17:16:44Z

Thanks, @puchengy!

puchengy · 2022-05-18T17:31:10Z

@rdblue Thanks I will have a follow up PR for that.

[python legacy] use process pool instead of thread pool for files pla…

93111e8

…nning

github-actions bot added the python label May 11, 2022

SinghAsDev approved these changes May 11, 2022

View reviewed changes

singhpk234 approved these changes May 11, 2022

View reviewed changes

rdblue closed this May 11, 2022

rdblue reopened this May 11, 2022

Fokko approved these changes May 12, 2022

View reviewed changes

kbendick approved these changes May 12, 2022

View reviewed changes

samredai approved these changes May 12, 2022

View reviewed changes

rdblue merged commit 3586e14 into apache:master May 18, 2022

puchengy deleted the legacy-py-process-pool branch May 18, 2022 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python legacy] use process pool instead of thread pool for files planning#4745

[python legacy] use process pool instead of thread pool for files planning#4745
rdblue merged 1 commit intoapache:masterfrom
puchengy:legacy-py-process-pool

puchengy commented May 11, 2022 •

edited

Loading

Uh oh!

samredai commented May 11, 2022 •

edited

Loading

Uh oh!

rdblue commented May 11, 2022

Uh oh!

Fokko left a comment

Uh oh!

Fokko May 12, 2022

Uh oh!

Fokko commented May 12, 2022

Uh oh!

kbendick left a comment •

edited

Loading

Uh oh!

chyzzqo2 commented May 12, 2022

Uh oh!

samredai commented May 13, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

puchengy commented May 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

	if self.ops.conf.get(SCAN_THREAD_POOL_ENABLED):
	with Pool(self.ops.conf.get(WORKER_THREAD_POOL_SIZE_PROP,
	cpu_count())) as reader_scan_pool:
	return itertools.chain.from_iterable([scan for scan
	in reader_scan_pool.map(self.get_scans_for_manifest,
	matching_manifests)])
	else:
	return itertools.chain.from_iterable([self.get_scans_for_manifest(manifest)
	for manifest in matching_manifests])

Conversation

puchengy commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samredai commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 11, 2022

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko May 12, 2022

Choose a reason for hiding this comment

Uh oh!

Fokko commented May 12, 2022

Uh oh!

kbendick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chyzzqo2 commented May 12, 2022

Uh oh!

samredai commented May 13, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

rdblue commented May 18, 2022

Uh oh!

puchengy commented May 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

puchengy commented May 11, 2022 •

edited

Loading

samredai commented May 11, 2022 •

edited

Loading

kbendick left a comment •

edited

Loading