-
Notifications
You must be signed in to change notification settings - Fork 8
fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… now written with correct alignment and metadata length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add tests for this?
var rbBlocks = [org_apache_arrow_flatbuf_Block]() | ||
|
||
for batch in batches { | ||
addPadForAlignment(&writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do we need this?
What is this padding for in the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Without this padding, there is an unaligned block, however I realise now that the schema hasn't been padded. I'll revert this one and place the padding after the schema message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that calling addPadForAlignment(&writer)
after writing the schema fixed the alignment problem. Fixed in second commit.
withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))} | ||
withUnsafeBytes(of: rbResult.1.o.littleEndian) {writer.append(Data($0))} | ||
writer.append(rbResult.0) | ||
addPadForAlignment(&writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for "Padding bytes to an 8-byte boundary" https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I believe that every time a FlatBuffers message is written it needs to be padded. I think that FlatBuffers messages align to 8 bytes internally but not necessarily where they terminate.
let metadataEnd = writer.count | ||
let metadataLength = metadataEnd - startIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes the above padding for "The metadata_size includes the size of the Message plus padding" https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. What's strange however is that unless the 8 byte prefix, i.e.:
<continuation: 0xFFFFFFFF>
<metadata_size: int32>
is included in the metadata length, the file is invalid according to PyArrow.
Doing a small experiment with the example in testFileWriter_bool, writing the block metadata like this, without the 8 byte prefix in the metadataLength:
offset: 120
metadataLength: 208
bodyLength: 296
PyArrow throws an error:
pyarrow.lib.ArrowInvalid: flatbuffer size 8 invalid. File offset: 128, metadata length: 208
However if the metadataLength includes the 8 byte prefix, i.e.:
offset: 120
metadataLength: 216
bodyLength: 296
The file is valid according to PyArrow. I think this might be missing from the spec. Or maybe a bug in the C++ implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let's open an issue for it on https://github.com/apache/arrow .
let metadataLength = metadataEnd - startIndex | ||
switch writeRecordBatchData(&writer, fields: batch.schema.fields, columns: batch.columns) { | ||
case .success: | ||
addPadForAlignment(&writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do we need this?
What is this padding for in the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual data need to be padded but you're right it's superfluous because this is already done in writeRecordBatchData
. Removed.
I'd like to, but there's a couple of decisions to be made first I think: |
OK. Let's defer it. We can add tests later in a separated PR. |
…rd batch body length from writer offsets. Add guard to check total block size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Files were being written with a non-padded file marker and alignment was not being written when serializing record batches. Also the metadata length was being set to zero in the block.
What's Changed
The padded version of the filemarker is being written.
Padding is written to record blocks and record block metadata, i.e.:
The metadata length was being written as zero. This was preventing PyArrow from reading files written by
ArrowWriter
. This has now been calculated and set in the Block.Closes #91.