Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code #14032

Merged

Conversation

joosthooz
Copy link
Contributor

To fix an exception while writing large parquet files:

Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@joosthooz
Copy link
Contributor Author

I'm using this script to reproduce the problem:

import os
import pyarrow as pa
import pyarrow.dataset as ds
import tempfile

def file_visitor(written_file):
    print(f"path={written_file.path}")
    print(f"metadata={written_file.metadata}")

with tempfile.TemporaryDirectory() as path:
    with open(f"{path}/part-0.csv", "w") as f:
        for i in range(2**22): # 4M values
            f.write(f"{i%123}\n")

    for add_part in range(1, 1000):
        os.symlink(f"{path}/part-0.csv", f"{path}/part-{add_part}.csv")

    d = ds.dataset(f"{path}", format=ds.CsvFileFormat())
    print(d.schema)
    outfile = f"{path}/pqfile.parquet"
    dataset_write_format = ds.ParquetFileFormat()
    write_options = dataset_write_format.make_write_options(compression=None)
    ds.write_dataset(
        d.scanner(),
        outfile,
        format=dataset_write_format,
        file_options=write_options,
        file_visitor=file_visitor
    )
    print("output file size: " + str(os.path.getsize(f"{outfile}/part-0.parquet")))

It's a bit cumbersome and takes a minute or so, so I don't think it is suitable to add as a unit test.

@joosthooz
Copy link
Contributor Author

There's a failure in test_write_dataset_max_rows_per_file
FileNotFoundError: [Errno 2] Failed to open local file '/tmp/pytest-of-root/pytest-0/test_write_dataset_max_rows_pe0/ds/part-1.parquet'. Detail: [errno 2] No such file or directory
I don't think this is due to my change

@jorisvandenbossche
Copy link
Member

There's a failure in test_write_dataset_max_rows_per_file. I don't think this is due to my change

That seems to be https://issues.apache.org/jira/browse/ARROW-17614

@jorisvandenbossche
Copy link
Member

We do have a @pytest.mark.slow and @pytest.mark.large_memory markers for such tests (and those are not run by default). Although in this case, I think it is also fine to just merge without a test

@joosthooz
Copy link
Contributor Author

Is there anything I can do? I'd be happy to run additional tests if needed

@jorisvandenbossche
Copy link
Member

Let's just merge this as is. Thanks for the PR!

@jorisvandenbossche jorisvandenbossche merged commit 43670af into apache:master Sep 8, 2022
@ursabot
Copy link

ursabot commented Sep 8, 2022

Benchmark runs are scheduled for baseline = 6ff5224 and contender = 43670af. 43670af is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.1% ⬆️0.0%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 43670af0 ec2-t3-xlarge-us-east-2
[Failed] 43670af0 test-mac-arm
[Failed] 43670af0 ursa-i9-9960x
[Finished] 43670af0 ursa-thinkcentre-m75q
[Finished] 6ff52243 ec2-t3-xlarge-us-east-2
[Failed] 6ff52243 test-mac-arm
[Failed] 6ff52243 ursa-i9-9960x
[Finished] 6ff52243 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…nt64 to match C++ code (apache#14032)

To fix an exception while writing large parquet files:
```
Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…nt64 to match C++ code (apache#14032)

To fix an exception while writing large parquet files:
```
Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants