ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

joosthooz · 2022-09-02T13:02:14Z

To fix an exception while writing large parquet files:

Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'

github-actions · 2022-09-02T13:02:35Z

https://issues.apache.org/jira/browse/ARROW-17583

github-actions · 2022-09-02T13:02:36Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

joosthooz · 2022-09-02T13:14:34Z

I'm using this script to reproduce the problem:

import os
import pyarrow as pa
import pyarrow.dataset as ds
import tempfile

def file_visitor(written_file):
    print(f"path={written_file.path}")
    print(f"metadata={written_file.metadata}")

with tempfile.TemporaryDirectory() as path:
    with open(f"{path}/part-0.csv", "w") as f:
        for i in range(2**22): # 4M values
            f.write(f"{i%123}\n")

    for add_part in range(1, 1000):
        os.symlink(f"{path}/part-0.csv", f"{path}/part-{add_part}.csv")

    d = ds.dataset(f"{path}", format=ds.CsvFileFormat())
    print(d.schema)
    outfile = f"{path}/pqfile.parquet"
    dataset_write_format = ds.ParquetFileFormat()
    write_options = dataset_write_format.make_write_options(compression=None)
    ds.write_dataset(
        d.scanner(),
        outfile,
        format=dataset_write_format,
        file_options=write_options,
        file_visitor=file_visitor
    )
    print("output file size: " + str(os.path.getsize(f"{outfile}/part-0.parquet")))

It's a bit cumbersome and takes a minute or so, so I don't think it is suitable to add as a unit test.

joosthooz · 2022-09-05T08:19:42Z

There's a failure in test_write_dataset_max_rows_per_file
FileNotFoundError: [Errno 2] Failed to open local file '/tmp/pytest-of-root/pytest-0/test_write_dataset_max_rows_pe0/ds/part-1.parquet'. Detail: [errno 2] No such file or directory
I don't think this is due to my change

jorisvandenbossche · 2022-09-05T08:56:02Z

There's a failure in test_write_dataset_max_rows_per_file. I don't think this is due to my change

That seems to be https://issues.apache.org/jira/browse/ARROW-17614

jorisvandenbossche · 2022-09-05T08:58:46Z

We do have a @pytest.mark.slow and @pytest.mark.large_memory markers for such tests (and those are not run by default). Although in this case, I think it is also fine to just merge without a test

joosthooz · 2022-09-08T08:26:42Z

Is there anything I can do? I'd be happy to run additional tests if needed

jorisvandenbossche · 2022-09-08T08:36:33Z

Let's just merge this as is. Thanks for the PR!

ursabot · 2022-09-08T11:42:40Z

Benchmark runs are scheduled for baseline = 6ff5224 and contender = 43670af. 43670af is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.1% ⬆️0.0%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 43670af0 ec2-t3-xlarge-us-east-2
[Failed] 43670af0 test-mac-arm
[Failed] 43670af0 ursa-i9-9960x
[Finished] 43670af0 ursa-thinkcentre-m75q
[Finished] 6ff52243 ec2-t3-xlarge-us-east-2
[Failed] 6ff52243 test-mac-arm
[Failed] 6ff52243 ursa-i9-9960x
[Finished] 6ff52243 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…nt64 to match C++ code (apache#14032) To fix an exception while writing large parquet files: ``` Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Changed datawidth of WrittenFile.size to int64 to match C++ code

4bf172c

github-actions bot added the Component: Python label Sep 2, 2022

jorisvandenbossche merged commit 43670af into apache:master Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

joosthooz commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

joosthooz commented Sep 2, 2022

joosthooz commented Sep 5, 2022

jorisvandenbossche commented Sep 5, 2022

jorisvandenbossche commented Sep 5, 2022

joosthooz commented Sep 8, 2022

jorisvandenbossche commented Sep 8, 2022

ursabot commented Sep 8, 2022

ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

Conversation

joosthooz commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

joosthooz commented Sep 2, 2022

joosthooz commented Sep 5, 2022

jorisvandenbossche commented Sep 5, 2022

jorisvandenbossche commented Sep 5, 2022

joosthooz commented Sep 8, 2022

jorisvandenbossche commented Sep 8, 2022

ursabot commented Sep 8, 2022