-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
I encountered an issue in pyarrow 10.0.0 with CompressedOutputStream.
It's unable to compress files bigger than 16 GB. I tried several times with different arrow files.
environment: debian/ubuntu
import pyarrow as pa
from pathlib import Path
import datasets as ds
#%%
pa.__version__
>> '10.0.0'
#%%
data_dir = Path('~/tmp').expanduser()
big_dataset = data_dir.joinpath('train.arrow')
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 28G Nov 11 13:35 ~/tmp/train.arrow
#%%
!md5sum ~/tmp/train.arrow
>>5afe31d206ce07249c127e067bcfa0fb ~/tmp/train.arrow
#%%
schema = pa.schema([...])
#%%
compressed_dataset = data_dir.joinpath('train.arrow.bz2')
with pa.ipc.open_stream(str(big_dataset)) as istream:
with pa.OSFile(str(compressed_dataset), 'wb') as output_file:
with pa.CompressedOutputStream(output_file, compression='bz2') as ostream:
with pa.RecordBatchStreamWriter(ostream, schema) as writer:
try:
while True:
writer.write_batch(istream.read_next_batch())
except StopIteration:
print('done')
>> done
#%%
!ls -lh ~/tmp/train.arrow.bz2
>> -rw-rw-r-- 1 yzapols yzapols 2.4G Nov 11 17:54 ~/tmp/train.arrow.bz2
#%%
!mv ~/tmp/train.arrow ~/tmp/train.arrow.old
#%%
!bunzip2 -k ~/tmp/train.arrow.bz2
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 16G Nov 11 17:54 ~/tmp/train.arrow
#%%
!md5sum ~/tmp/train.arrow
>> 2460c2c81c5c8672f4b488cfa2ecd8c1 ~/tmp/train.arrowComponent
Python