[Parquet][Python] Potential regression in Parquet parallel reading #38591

eeroel · 2023-11-05T16:52:05Z

Describe the enhancement requested

UPDATE: this is looking more like a bug on closer look. What happens:

When calling to_table() on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,

Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
With current main, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.

Here's an example using 02de3c1, reading a FileSystemDataset using fragment_readahead = 100 and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):

With the current main fc8c6b7 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.

See comment below for reproducible example.

I'm running on Max OS 14.1.

Component(s)

Parquet

The text was updated successfully, but these errors were encountered:

eeroel · 2023-11-05T17:14:46Z

As a side note, I was a bit disappointed with the total download times I achieved with maximum parallelism using 02de3c1 on my system (Macbook to S3). The requests start with nice concurrency but a few of the GET requests end up taking quite long to finish, so fragment_readahead > 4 doesn't speed up the total completion time at all. Not sure if this is related to my network, or if it's something to optimize in Arrow.

eeroel · 2023-11-05T17:32:48Z

I noticed that there is an extra HEAD request coming from somewhere so I think what I wrote below is unlikely to be related

Just a hunch, haven't followed the whole code path, but could it be that e.g. from here https://github.com/apache/arrow/blob/fc8c6b7dc8287c672b62c62f3a2bd724b3835063/cpp/src/arrow/dataset/file_parquet.cc#L606C72-L606C72 or EnsureCompleteMetadata the sync version of GetReader gets called, not GetReaderAsync? Introduced in 76c4a6e

EDIT: or maybe from here

arrow/cpp/src/arrow/dataset/scanner.cc

Line 1003 in fc8c6b7

FragmentsToBatches(std::move(fragment_gen), scan_options));

as I think this function gets called when invoking Python Dataset.to_table

arrow/cpp/src/arrow/dataset/scanner.cc

Line 427 in fc8c6b7

Result<EnumeratedRecordBatchGenerator> AsyncScanner::ScanBatchesUnorderedAsync(

eeroel · 2023-11-07T06:35:44Z

Here's a reproducible example that doesn't use FileSystemDataset but parquet.read_table:

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

from pyarrow.dataset import (
    ParquetFileFormat,
    ParquetFragmentScanOptions
)

import pyarrow
import pyarrow.fs
import pyarrow.parquet

fs = pyarrow.fs.S3FileSystem(region="us-east-2")
fs = pyarrow.fs.SubTreeFileSystem("ursa-labs-taxi-data", fs)

pyarrow.parquet.read_table("2019/01/data.parquet", filesystem=fs)

This does not only make two HEAD requests, but four of them in total (first two in getting the schema?)

cat /tmp/foo2.log | grep HeaderOut
[DEBUG] 2023-11-07 06:30:22.913 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.059 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.190 CURL [0x1e4d11ec0] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.613 CURL [0x1e4d11ec0] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.904 CURL [0x1e4d11ec0] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.028 CURL [0x17017f000] (HeaderOut) HEAD /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.158 CURL [0x17020b000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.552 CURL [0x170297000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:26.542 CURL [0x17017f000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:29.142 CURL [0x17020b000] (HeaderOut) GET /2019/01/data.parquet HTTP/1.1

mapleFU · 2023-11-07T06:59:20Z

Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.

Since this is merged recently, is that same api being used ? I was a bit confused because this part of code didn't changed frequency recently...

eeroel · 2023-11-07T07:02:30Z

Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.

Since this is merged recently, is that same api being used ? I was a bit confused because this part of code didn't changed frequency recently...

I'm doing a bisect at the moment, I'll update here when I'm done.

mapleFU · 2023-11-07T07:14:38Z

#36765

The issue related patch changes Prefetch default mode from Default to LazyDefault, but it maybe doesn't increment the req-count..

eeroel · 2023-11-07T07:18:16Z

#36765

The issue related patch changes Prefetch default mode from Default to LazyDefault, but it maybe doesn't increment the req-count..

This is the commit after which there is an extra HEAD request:
0793432

eeroel · 2023-11-07T07:24:41Z

This is probably where the extra requests come from, I wonder if these source.Open() and parquet::ParquetFileReader::OpenAsync calls are necessary?

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 507 in 0793432

ARROW_ASSIGN_OR_RAISE(auto input, source.Open());

mapleFU · 2023-11-07T07:27:11Z

Very nice catch! Seems that this commit ignore your change and re-create the file. this might be done before your change is checked in 😅 And after your change and rebase, this didn't get modified.

eeroel · 2023-11-07T07:28:36Z

Very nice catch! Seems that this commit ignore your change and re-create the file. this might be done before your change is checked in 😅 And after your change and rebase, this didn't get modified.

Yeah that was my reading as well, I can try fixing it later

mapleFU · 2023-11-07T07:34:25Z

Yeah. Maybe we can add a test for counting IO-ops to prevent from regression later... This is so awkward :-(

eeroel · 2023-11-07T07:38:46Z

Yeah. Maybe we can add a test for counting IO-ops to prevent from regression later... This is so awkward :-(

Do you have an idea how to do it? I currently just tested by setting S3fs logging to trace level and filtered on the command line. But it's not really elegant and the problem is not related to S3 per se (but not sure if it will manifest with all file system implementations?)

mapleFU · 2023-11-07T07:41:41Z

I think we can Provide a MockInputStream(with readAsync and read counting) and hardcode an IO-count here. Any change changes the IO count can report the change here.

Also cc @pitrou for any more ideas...

But you can take a quick fixing for this issue.

…ormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38591 Authored-by: Eero Lihavainen <eero.lihavainen@nitor.com> Signed-off-by: mwish <maplewish117@gmail.com>

mapleFU · 2023-11-08T06:33:28Z

Thanks @eeroel

I've merged this patch now, maybe I'll try to add some counting test tonight

eeroel · 2023-11-08T07:51:54Z

Thanks @eeroel

I've merged this patch now, maybe I'll try to add some counting test tonight

Thank you!

…tFileFormat::GetReaderAsync` (apache#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#38591 Authored-by: Eero Lihavainen <eero.lihavainen@nitor.com> Signed-off-by: mwish <maplewish117@gmail.com>

…ormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from 0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38591 Authored-by: Eero Lihavainen <eero.lihavainen@nitor.com> Signed-off-by: mwish <maplewish117@gmail.com>

…tFileFormat::GetReaderAsync` (apache#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, apparently unintentional from apache@0793432. ### What changes are included in this PR? Remove the extra method calls. ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#38591 Authored-by: Eero Lihavainen <eero.lihavainen@nitor.com> Signed-off-by: mwish <maplewish117@gmail.com>

eeroel added the Type: enhancement label Nov 5, 2023

github-actions bot added the Component: Parquet label Nov 5, 2023

eeroel mentioned this issue Nov 5, 2023

[Python][Parquet] Parquet deserialization speeds slower on Linux #38389

Open

eeroel changed the title ~~Potential regression in Parquet parallel reading~~ [Parquet] Potential regression in Parquet parallel reading Nov 6, 2023

eeroel changed the title ~~[Parquet] Potential regression in Parquet parallel reading~~ [Parquet][Python] Potential regression in Parquet parallel reading Nov 6, 2023

github-actions bot mentioned this issue Nov 7, 2023

GH-38591: [Parquet][C++] Remove redundant open calls in ParquetFileFormat::GetReaderAsync #38621

Merged

github-actions bot assigned eeroel Nov 7, 2023

mapleFU closed this as completed in #38621 Nov 8, 2023

mapleFU added this to the 15.0.0 milestone Nov 8, 2023

jorisvandenbossche added the backport-candidate label Nov 8, 2023

raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023

raulcd removed the backport-candidate label Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet][Python] Potential regression in Parquet parallel reading #38591

[Parquet][Python] Potential regression in Parquet parallel reading #38591

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 7, 2023 •

edited

Loading

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

mapleFU commented Nov 8, 2023

eeroel commented Nov 8, 2023

[Parquet][Python] Potential regression in Parquet parallel reading #38591

[Parquet][Python] Potential regression in Parquet parallel reading #38591

Comments

eeroel commented Nov 5, 2023 • edited Loading

Describe the enhancement requested

Component(s)

eeroel commented Nov 5, 2023 • edited Loading

eeroel commented Nov 5, 2023 • edited Loading

eeroel commented Nov 7, 2023 • edited Loading

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

eeroel commented Nov 7, 2023

mapleFU commented Nov 7, 2023

mapleFU commented Nov 8, 2023

eeroel commented Nov 8, 2023

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 5, 2023 •

edited

Loading

eeroel commented Nov 7, 2023 •

edited

Loading