fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment #95

willtemperley · 2025-10-13T11:34:18Z

Files were being written with a non-padded file marker and alignment was not being written when serializing record batches. Also the metadata length was being set to zero in the block.

What's Changed

The padded version of the filemarker is being written.

Padding is written to record blocks and record block metadata, i.e.:

 addPadForAlignment(&writer)

The metadata length was being written as zero. This was preventing PyArrow from reading files written by ArrowWriter. This has now been calculated and set in the Block.

Closes #91.

… now written with correct alignment and metadata length.

kou

Can we add tests for this?

kou · 2025-10-13T13:56:39Z

Arrow/Sources/Arrow/ArrowWriter.swift

        var rbBlocks = [org_apache_arrow_flatbuf_Block]()

        for batch in batches {
+            addPadForAlignment(&writer)


Hmm. Do we need this?

What is this padding for in the specification?

Good question. Without this padding, there is an unaligned block, however I realise now that the schema hasn't been padded. I'll revert this one and place the padding after the schema message.

Confirmed that calling addPadForAlignment(&writer) after writing the schema fixed the alignment problem. Fixed in second commit.

kou · 2025-10-13T13:57:26Z

Arrow/Sources/Arrow/ArrowWriter.swift

                withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))}
                withUnsafeBytes(of: rbResult.1.o.littleEndian) {writer.append(Data($0))}
                writer.append(rbResult.0)
+                addPadForAlignment(&writer)


This is for "Padding bytes to an 8-byte boundary" https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message , right?

Yes - I believe that every time a FlatBuffers message is written it needs to be padded. I think that FlatBuffers messages align to 8 bytes internally but not necessarily where they terminate.

kou · 2025-10-13T13:58:56Z

Arrow/Sources/Arrow/ArrowWriter.swift

+                let metadataEnd = writer.count
+                let metadataLength = metadataEnd - startIndex


This includes the above padding for "The metadata_size includes the size of the Message plus padding" https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format , right?

Yes exactly. What's strange however is that unless the 8 byte prefix, i.e.:

<continuation: 0xFFFFFFFF> <metadata_size: int32>

is included in the metadata length, the file is invalid according to PyArrow.

Doing a small experiment with the example in testFileWriter_bool, writing the block metadata like this, without the 8 byte prefix in the metadataLength:

offset: 120
metadataLength: 208
bodyLength: 296

PyArrow throws an error:
pyarrow.lib.ArrowInvalid: flatbuffer size 8 invalid. File offset: 128, metadata length: 208

However if the metadataLength includes the 8 byte prefix, i.e.:

offset: 120
metadataLength: 216
bodyLength: 296

The file is valid according to PyArrow. I think this might be missing from the spec. Or maybe a bug in the C++ implementation.

Good point. Let's open an issue for it on https://github.com/apache/arrow .

kou · 2025-10-13T13:59:20Z

Arrow/Sources/Arrow/ArrowWriter.swift

+                let metadataLength = metadataEnd - startIndex
                switch writeRecordBatchData(&writer, fields: batch.schema.fields, columns: batch.columns) {
                case .success:
+                    addPadForAlignment(&writer)


Hmm. Do we need this?

What is this padding for in the specification?

The actual data need to be padded but you're right it's superfluous because this is already done in writeRecordBatchData. Removed.

willtemperley · 2025-10-13T14:15:46Z

Can we add tests for this?

I'd like to, but there's a couple of decisions to be made first I think:
#96

kou · 2025-10-13T14:29:13Z

OK. Let's defer it. We can add tests later in a separated PR.

…rd batch body length from writer offsets. Add guard to check total block size.

kou

+1

fix: Correct padding now added to arrow filemarker. Recordbatches are…

1b718f4

… now written with correct alignment and metadata length.

willtemperley mentioned this pull request Oct 13, 2025

Improve IPCTests #96

Open

kou reviewed Oct 13, 2025

View reviewed changes

fix: Pad schema message instead of record batch start. Calculate reco…

a190494

…rd batch body length from writer offsets. Add guard to check total block size.

kou changed the title ~~fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment.~~ fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment Oct 14, 2025

kou changed the title ~~fix: Padding now added to arrow filemarker and RecordBatches are being written with correct alignment~~ fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment Oct 14, 2025

kou approved these changes Oct 14, 2025

View reviewed changes

kou merged commit 9259237 into apache:main Oct 14, 2025
10 checks passed

willtemperley mentioned this pull request Oct 15, 2025

[Format] Encapsulated message format metadata_size ambiguity. apache/arrow#47824

Open

		let metadataEnd = writer.count
		let metadataLength = metadataEnd - startIndex

fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment #95

fix: Padding now added to Arrow file marker and RecordBatches are being written with correct alignment #95

Uh oh!

Conversation

willtemperley commented Oct 13, 2025

What's Changed

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willtemperley commented Oct 13, 2025

Uh oh!

kou commented Oct 13, 2025

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants