Skip to content

Conversation

willtemperley
Copy link
Contributor

Files were being written with a non-padded file marker and alignment was not being written when serializing record batches. Also the metadata length was being set to zero in the block.

What's Changed

The padded version of the filemarker is being written.

Padding is written to record blocks and record block metadata, i.e.:

 addPadForAlignment(&writer)

The metadata length was being written as zero. This was preventing PyArrow from reading files written by ArrowWriter. This has now been calculated and set in the Block.

Closes #91.

… now written with correct alignment and metadata length.
@willtemperley willtemperley mentioned this pull request Oct 13, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests for this?

var rbBlocks = [org_apache_arrow_flatbuf_Block]()

for batch in batches {
addPadForAlignment(&writer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Do we need this?

What is this padding for in the specification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Without this padding, there is an unaligned block, however I realise now that the schema hasn't been padded. I'll revert this one and place the padding after the schema message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that calling addPadForAlignment(&writer) after writing the schema fixed the alignment problem. Fixed in second commit.

withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))}
withUnsafeBytes(of: rbResult.1.o.littleEndian) {writer.append(Data($0))}
writer.append(rbResult.0)
addPadForAlignment(&writer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for "Padding bytes to an 8-byte boundary" https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I believe that every time a FlatBuffers message is written it needs to be padded. I think that FlatBuffers messages align to 8 bytes internally but not necessarily where they terminate.

Comment on lines 149 to 150
let metadataEnd = writer.count
let metadataLength = metadataEnd - startIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes the above padding for "The metadata_size includes the size of the Message plus padding" https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. What's strange however is that unless the 8 byte prefix, i.e.:

<continuation: 0xFFFFFFFF>
<metadata_size: int32>

is included in the metadata length, the file is invalid according to PyArrow.

Doing a small experiment with the example in testFileWriter_bool, writing the block metadata like this, without the 8 byte prefix in the metadataLength:

offset: 120
metadataLength: 208
bodyLength: 296

PyArrow throws an error:
pyarrow.lib.ArrowInvalid: flatbuffer size 8 invalid. File offset: 128, metadata length: 208

However if the metadataLength includes the 8 byte prefix, i.e.:

offset: 120
metadataLength: 216
bodyLength: 296

The file is valid according to PyArrow. I think this might be missing from the spec. Or maybe a bug in the C++ implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's open an issue for it on https://github.com/apache/arrow .

let metadataLength = metadataEnd - startIndex
switch writeRecordBatchData(&writer, fields: batch.schema.fields, columns: batch.columns) {
case .success:
addPadForAlignment(&writer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Do we need this?

What is this padding for in the specification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual data need to be padded but you're right it's superfluous because this is already done in writeRecordBatchData. Removed.

@willtemperley
Copy link
Contributor Author

Can we add tests for this?

I'd like to, but there's a couple of decisions to be made first I think:
#96

@kou
Copy link
Member

kou commented Oct 13, 2025

OK. Let's defer it. We can add tests later in a separated PR.

…rd batch body length from writer offsets. Add guard to check total block size.
@kou kou changed the title fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment. fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment Oct 14, 2025
@kou kou changed the title fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment Oct 14, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 9259237 into apache:main Oct 14, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Padding bytes for the arrow filemarker are not being written.

2 participants