Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

Closed
eeroel opened this issue Sep 25, 2023 · 0 comments · Fixed by #37868
Closed

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

eeroel opened this issue Sep 25, 2023 · 0 comments · Fixed by #37868

Comments

@eeroel
Copy link
Contributor

eeroel commented Sep 25, 2023

Describe the enhancement requested

When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: #7547

(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)

I can try to put together a PR with some kind of an implementation.

Component(s)

Parquet, Python

@AlenkaF AlenkaF changed the title Allow passing file sizes to FileSystemDataset from Python [Python] Allow passing file sizes to FileSystemDataset from Python Sep 26, 2023
pitrou pushed a commit that referenced this issue Dec 5, 2023
### Rationale for this change

Allow passing known file sizes to `make_fragment`, to avoid potential network requests.

### What changes are included in this PR?

### Are these changes tested?

Yes, tests with S3 that file size gets used.

### Are there any user-facing changes?

Yes, new function arguments.

* Closes: #37857

Lead-authored-by: Eero Lihavainen <eero.lihavainen@nitor.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Eero Lihavainen <eeelliha@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 15.0.0 milestone Dec 5, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#37868)

### Rationale for this change

Allow passing known file sizes to `make_fragment`, to avoid potential network requests.

### What changes are included in this PR?

### Are these changes tested?

Yes, tests with S3 that file size gets used.

### Are there any user-facing changes?

Yes, new function arguments.

* Closes: apache#37857

Lead-authored-by: Eero Lihavainen <eero.lihavainen@nitor.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Eero Lihavainen <eeelliha@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche added a commit that referenced this issue Feb 27, 2024
…nit (#40143)

### Rationale for this change

Closes #40142

I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more.

Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings.

The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage.

This came up in #38389 (comment) and there's been related work already with #37857

### What changes are included in this PR?

Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects.

### Are these changes tested?

~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle.

### Are there any user-facing changes?

* Closes: #40142

Lead-authored-by: fjetter <fjetter@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…aset init (apache#40143)

### Rationale for this change

Closes apache#40142

I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more.

Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings.

The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage.

This came up in apache#38389 (comment) and there's been related work already with apache#37857

### What changes are included in this PR?

Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects.

### Are these changes tested?

~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle.

### Are there any user-facing changes?

* Closes: apache#40142

Lead-authored-by: fjetter <fjetter@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…aset init (apache#40143)

### Rationale for this change

Closes apache#40142

I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more.

Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings.

The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage.

This came up in apache#38389 (comment) and there's been related work already with apache#37857

### What changes are included in this PR?

Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects.

### Are these changes tested?

~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle.

### Are there any user-facing changes?

* Closes: apache#40142

Lead-authored-by: fjetter <fjetter@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants