Skip to content

CompressedOutputStream can't correctly compress/write files bigger than 16 GB #14699

@izapolsk

Description

@izapolsk

Describe the bug, including details regarding any error messages, version, and platform.

I encountered an issue in pyarrow 10.0.0 with CompressedOutputStream.
It's unable to compress files bigger than 16 GB. I tried several times with different arrow files.
environment: debian/ubuntu

import pyarrow as pa
from pathlib import Path
import datasets as ds
#%%
pa.__version__
>> '10.0.0'
#%%
data_dir = Path('~/tmp').expanduser()
big_dataset = data_dir.joinpath('train.arrow')
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 28G Nov 11 13:35 ~/tmp/train.arrow
#%%
!md5sum ~/tmp/train.arrow
>>5afe31d206ce07249c127e067bcfa0fb  ~/tmp/train.arrow
#%%
schema = pa.schema([...])
#%%
compressed_dataset = data_dir.joinpath('train.arrow.bz2')
with pa.ipc.open_stream(str(big_dataset)) as istream:
    with pa.OSFile(str(compressed_dataset), 'wb') as output_file:
        with pa.CompressedOutputStream(output_file, compression='bz2') as ostream:
            with pa.RecordBatchStreamWriter(ostream, schema) as writer:
                try:
                    while True:
                        writer.write_batch(istream.read_next_batch())
                except StopIteration:
                    print('done')

>> done
#%%
!ls -lh ~/tmp/train.arrow.bz2
>> -rw-rw-r-- 1 yzapols yzapols 2.4G Nov 11 17:54 ~/tmp/train.arrow.bz2
#%%
!mv ~/tmp/train.arrow ~/tmp/train.arrow.old
#%%
!bunzip2 -k ~/tmp/train.arrow.bz2
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 16G Nov 11 17:54 ~/tmp/train.arrow
#%%
!md5sum  ~/tmp/train.arrow
>> 2460c2c81c5c8672f4b488cfa2ecd8c1 ~/tmp/train.arrow

Component

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions