Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9458: [Python] Release GIL in ScanTask.execute #7756

Closed
wants to merge 2 commits into from

Conversation

maartenbreddels
Copy link
Contributor

@jorisvandenbossche FYI I made you coauthor

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

Thanks for the PR! (OK, I was missing the with gil inside the for loop in my attempt ..)

(no worries about the authorship, I have my commits in arrow ;))

@jorisvandenbossche
Copy link
Member

And I can confirm using scan tasks now gives a similar parallelization speed up as using fragments.

@maartenbreddels
Copy link
Contributor Author

FYI, this:

## common code ##
import pyarrow as pa
import pyarrow.dataset as ds
import concurrent.futures
import glob
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
## end common code ##

def process(f):
    scan_count = 0
    return len(f.to_table(use_threads=False))
sum(pool.map(process, ds.get_fragments()))

For me takes between 10 and 16 second (very irregular)

While this:

def process(fragment):
    scanned = 0
    for scan_task in fragment.scan(use_threads=False):
        for record_batch in scan_task.execute():
            scanned += record_batch.num_rows
    return scanned
sum(pool.map(process, ds.get_fragments()))

takes 7-9 seconds, more consistently.

Each file (fragment) has 1 million rows (1 rowgroup). Could be uninteresting details, but I though I'd share it.

@github-actions
Copy link

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants