Skip to content

ARROW-9458: [Python] Release GIL in ScanTask.execute#7756

Closed
maartenbreddels wants to merge 2 commits into
apache:masterfrom
maartenbreddels:ARROW-9458
Closed

ARROW-9458: [Python] Release GIL in ScanTask.execute#7756
maartenbreddels wants to merge 2 commits into
apache:masterfrom
maartenbreddels:ARROW-9458

Conversation

@maartenbreddels

Copy link
Copy Markdown
Contributor

@jorisvandenbossche FYI I made you coauthor

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche

Copy link
Copy Markdown
Member

Thanks for the PR! (OK, I was missing the with gil inside the for loop in my attempt ..)

(no worries about the authorship, I have my commits in arrow ;))

@jorisvandenbossche

Copy link
Copy Markdown
Member

And I can confirm using scan tasks now gives a similar parallelization speed up as using fragments.

@maartenbreddels

Copy link
Copy Markdown
Contributor Author

FYI, this:

## common code ##
import pyarrow as pa
import pyarrow.dataset as ds
import concurrent.futures
import glob
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
## end common code ##

def process(f):
    scan_count = 0
    return len(f.to_table(use_threads=False))
sum(pool.map(process, ds.get_fragments()))

For me takes between 10 and 16 second (very irregular)

While this:

def process(fragment):
    scanned = 0
    for scan_task in fragment.scan(use_threads=False):
        for record_batch in scan_task.execute():
            scanned += record_batch.num_rows
    return scanned
sum(pool.map(process, ds.get_fragments()))

takes 7-9 seconds, more consistently.

Each file (fragment) has 1 million rows (1 rowgroup). Could be uninteresting details, but I though I'd share it.

@github-actions

Copy link
Copy Markdown

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants