[Format] Clarify that 8 byte padding must not be applied to compressed buffers #31141

asfimport · 2022-02-15T16:47:33Z

I was unable to find where this is discussed, but I think we do not mention that 8 byte padding must not be applied when the buffer is compressed, as it causes us to lose the size of the compressed buffer.

For example

import pyarrow.ipc

data = [
    pyarrow.array([1, 2, 3, 4, 5], type="int32"),
]

batch = pyarrow.record_batch(data, names=['f0'])

with pyarrow.OSFile('test1.arrow', 'wb') as sink:
    with pyarrow.ipc.new_file(sink, batch.schema, options=pyarrow.ipc.IpcWriteOptions(compression="zstd")) as writer:
        writer.write(batch)

outputs a single data buffer with

[20, 0, 0, 0, 0, 0, 0, 0, 40, 181, 47, 253, 32, 20, 161, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0]

which has 37 bytes (padding would require 40 bytes).

My understanding is that we do not pad because doing so make us unable to recover the original size of the (compressed) data, and offers no advantage since users can't mmap data anyways.

Reporter: Jorge Leitão / @jorgecarleitao

_{Note: This issue was originally created as ARROW-15687. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Format] Clarify that 8 byte padding must not be applied to compressed buffers #31141

[Format] Clarify that 8 byte padding must not be applied to compressed buffers #31141

asfimport commented Feb 15, 2022

[Format] Clarify that 8 byte padding must not be applied to compressed buffers #31141

[Format] Clarify that 8 byte padding must not be applied to compressed buffers #31141

Comments

asfimport commented Feb 15, 2022