Python: Parallelize IO #6645

Fokko · 2023-01-22T20:58:12Z

Alternative for the multithreading part of: #6590

This uses the ThreadPool approach instead of ThreadPoolExecutor.

The ThreadPoolExecutor is more flexible and works well with heterogeneous tasks. This allows the user to handle exceptions per task and able to cancel individual tasks. But the ThreadPoolExecutor also has some limitations such as not being able to forcefully terminate all the tasks.

For reading tasks I think the ThreadPool might be more appropriate, but for writing the ThreadPoolExecutor might be more applicable.

A very nice writeup of the differences is available in this blog: https://superfastpython.com/threadpool-vs-threadpoolexecutor/ (scroll to Summary of Differences)

Before:

➜  python git:(fd-threadpool) time python3 /tmp/test.py
python3 /tmp/test.py  3.45s user 2.84s system 2% cpu 3:34.19 total

After:

➜  python git:(fd-threadpool) ✗ time python3 /tmp/test.py
python3 /tmp/test.py  3.13s user 2.83s system 19% cpu 31.369 total
➜  python git:(fd-threadpool) ✗ time python3 /tmp/test.py
python3 /tmp/test.py  2.94s user 3.08s system 18% cpu 32.538 total
➜  python git:(fd-threadpool) ✗ time python3 /tmp/test.py
python3 /tmp/test.py  2.84s user 3.14s system 20% cpu 29.033 total

Longlining the requests from EU to the USA might impact the results a bit but makes IO more dominant.

Alternative for apache#6590 This uses the ThreadPool approach instead of ThreadPoolExecutor. The ThreadPoolExecutor is more flexible and works well with heterogeneous tasks. This allows the user to handle exceptions per task and able to cancel individual tasks. But the ThreadPoolExecutor also has some limitation such as not able to forcefully terminate all the tasks. For reading tasks I think the ThreadPool might be more appriopriate, but for writing the ThreadPoolExecutor might be more applicable. A very nice writeup of the differences is available in this blog: https://superfastpython.com/threadpool-vs-threadpoolexecutor/ Before: ``` ➜ python git:(fd-threadpool) time python3 /tmp/test.py python3 /tmp/test.py 3.45s user 2.84s system 2% cpu 3:34.19 total ``` After: ``` ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 3.13s user 2.83s system 19% cpu 31.369 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.94s user 3.08s system 18% cpu 32.538 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.84s user 3.14s system 20% cpu 29.033 total ``` Longlining the requests from EU to the USA, so this might impact the results a bit, but makes IO more dominant.

rdblue · 2023-01-24T20:23:33Z

python/pyiceberg/io/pyarrow.py

+    # Prune the stuff that we don't need anyway
+    file_project_schema_arrow = schema_to_pyarrow(file_project_schema)
+
+    arrow_table = ds.dataset(


When I was running tests, I noticed that reading into PyArrow became the bottleneck. I think we will probably want to configure the format at least with pre_buffer. That doesn't need to be done here though.

Nice, let me add that! Looking at the docs that makes a lot of sense.

What's also on my list is to test reading tables instead of datasets: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow-parquet-read-table

A dataset is a high-level construct to read data in a lazy fashion. Since we create a dataset per file, this might not be optimal. Maybe a table might be more efficient. I haven't done this because the dataset allows you to pass in an expression, and for the table, you need to pass in the filter in a DNF format, but we have the conversion available now.

rdblue · 2023-01-24T20:26:20Z

python/pyiceberg/io/pyarrow.py

    return boolean_expression_visit(expr, _ConvertToArrowExpression())


+def _file_to_table(


The reason why I passed row_filter and table to this method in my branch was so that we could make the method public. I think we are going to need that so that tasks can be distributed to other nodes or processes and read independently. Having a method that reads a task and another that reads an iterable of tasks in a threadpool seems like a good API to provide.

Shall we do that in a separate PR? I also gave multi-processing a shot, but I ran into a lot of serialization issues. Having this private until we fix all of that prevents us from changing public APIs later on.

…-clean

Fokko · 2023-01-24T22:32:07Z

Thanks for the review @rdblue 🙌🏻

* Python: Parallelize IO Alternative for apache#6590 This uses the ThreadPool approach instead of ThreadPoolExecutor. The ThreadPoolExecutor is more flexible and works well with heterogeneous tasks. This allows the user to handle exceptions per task and able to cancel individual tasks. But the ThreadPoolExecutor also has some limitation such as not able to forcefully terminate all the tasks. For reading tasks I think the ThreadPool might be more appriopriate, but for writing the ThreadPoolExecutor might be more applicable. A very nice writeup of the differences is available in this blog: https://superfastpython.com/threadpool-vs-threadpoolexecutor/ Before: ``` ➜ python git:(fd-threadpool) time python3 /tmp/test.py python3 /tmp/test.py 3.45s user 2.84s system 2% cpu 3:34.19 total ``` After: ``` ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 3.13s user 2.83s system 19% cpu 31.369 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.94s user 3.08s system 18% cpu 32.538 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.84s user 3.14s system 20% cpu 29.033 total ``` Longlining the requests from EU to the USA, so this might impact the results a bit, but makes IO more dominant. * Set read options

Fokko added this to the Python 0.3.0 release milestone Jan 22, 2023

github-actions bot added the python label Jan 22, 2023

Fokko force-pushed the fd-threadpool-clean branch from a264281 to 701e935 Compare January 22, 2023 21:15

rdblue reviewed Jan 24, 2023

View reviewed changes

rdblue approved these changes Jan 24, 2023

View reviewed changes

Fokko added 2 commits January 24, 2023 22:36

Set read options

3b3a442

Merge branch 'master' of github.com:apache/iceberg into fd-threadpool…

1a2b5e2

…-clean

Fokko merged commit 2671621 into apache:master Jan 24, 2023

Fokko deleted the fd-threadpool-clean branch January 24, 2023 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Parallelize IO #6645

Python: Parallelize IO #6645

Uh oh!

Fokko commented Jan 22, 2023 •

edited

Loading

Uh oh!

rdblue Jan 24, 2023

Uh oh!

Fokko Jan 24, 2023

Uh oh!

rdblue Jan 24, 2023

Uh oh!

Fokko Jan 24, 2023

Uh oh!

Fokko commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return boolean_expression_visit(expr, _ConvertToArrowExpression())


		def _file_to_table(

Python: Parallelize IO #6645

Python: Parallelize IO #6645

Uh oh!

Conversation

Fokko commented Jan 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fokko commented Jan 22, 2023 •

edited

Loading