feat: add list type support to ArrowWriter by jding-xyz · Pull Request #149 · apache/arrow-swift

jding-xyz · 2026-03-19T19:30:40Z

Summary

ArrowWriter currently cannot serialize schemas containing list<T> columns. This PR adds full list type support to the writer, including nested list<struct<...>> schemas.

1. List type writer support (`ArrowWriterHelper.swift`, `ArrowWriter.swift`)

ArrowWriterHelper.swift: Added .list case to toFBTypeEnum (returns .list FlatBuffer type enum) and toFBType (creates empty org_apache_arrow_flatbuf_List table).
ArrowWriter.swift: Added ArrowTypeList handling in writeField, writeFieldNodes, writeBufferInfo, and writeRecordBatchData — mirrors the existing ArrowTypeStruct pattern but uses ArrowTypeList.elementField and NestedArray.values for the single child array.
Also moved the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop, so it runs once per column rather than once per buffer (a latent correctness issue for types with multiple buffers).

2. Field-node ordering fix for nested types (`ArrowWriter.swift`)

writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. External readers (pyarrow) reject such streams with "Array length did not match record batch length".

Moved child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector.

3. Schema message padding in streaming format (`ArrowWriter.swift`)

The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream.

Added addPadForAlignment(&schemaData) in writeStreaming to match writeFile behavior.

4. Fix `appendAny` bypass in nested builders (`ArrowArrayBuilder.swift`)

The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This means list<struct<...>> columns produce struct arrays with empty children, since the struct's field builders never receive data.

Overrode appendAny in both StructArrayBuilder and ListArrayBuilder to delegate to self.append, ensuring child builder distribution runs correctly for nested types.

5. Make `ArrowField` initializer public (`ArrowSchema.swift`)

ArrowField is a public class with all public properties, but its initializer was internal. This prevents external consumers from constructing ArrowTypeStruct instances with explicit fields, which is needed for list<struct<...>> column schemas.

Made ArrowField.init(_:type:isNullable:) public.

Test plan

Verified all existing tests pass (swift test — only pre-existing fixture-file-missing failures).
Round-trip tested with list<float32> schema: encode via ArrowWriter.writeStreaming, decode via ArrowReader.readStreaming, verified schema fields and row count survive.
Round-trip tested with list<struct<string, float32>> schema: encode via ArrowWriter.writeStreaming, decode via ArrowReader.readStreaming, verified schema fields and row count survive.
End-to-end validated list<float32> and list<struct<string, float32>> streams against a pyarrow-based server (Arrow IPC streaming reader) — both accepted without errors.
Tested alongside the timestamp nesting fix from fix: move timezone string creation before startTimestamp to avoid nesting assertion #148 (independent change).

ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list, and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/ writeRecordBatchData recurse into ArrowTypeList children the same way they already do for ArrowTypeStruct. This unblocks writing schemas that use list<T> columns (e.g. the cloud stream server 'raw' schema with list<float32>). Also moves the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop so it runs once per column rather than once per buffer. Made-with: Cursor

writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child field nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. pyarrow rejects such streams with "Array length did not match record batch length". Move child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector. Made-with: Cursor

The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream. Made-with: Cursor

The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This causes list<struct<...>> columns to produce struct arrays with empty children, since the struct's field builders never receive data. Override appendAny in both classes to delegate to self.append, ensuring child builder distribution runs for nested types like list<struct<...>>. Made-with: Cursor

ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor

jding-xyz added 5 commits March 19, 2026 12:29

fix: make ArrowField initializer public

cbb1006

ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add list type support to ArrowWriter#149

feat: add list type support to ArrowWriter#149
jding-xyz wants to merge 5 commits intoapache:mainfrom
jding-xyz:feat/list-type-writer-support

jding-xyz commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jding-xyz commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. List type writer support (ArrowWriterHelper.swift, ArrowWriter.swift)

2. Field-node ordering fix for nested types (ArrowWriter.swift)

3. Schema message padding in streaming format (ArrowWriter.swift)

4. Fix appendAny bypass in nested builders (ArrowArrayBuilder.swift)

5. Make ArrowField initializer public (ArrowSchema.swift)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jding-xyz commented Mar 19, 2026 •

edited

Loading

1. List type writer support (`ArrowWriterHelper.swift`, `ArrowWriter.swift`)

2. Field-node ordering fix for nested types (`ArrowWriter.swift`)

3. Schema message padding in streaming format (`ArrowWriter.swift`)

4. Fix `appendAny` bypass in nested builders (`ArrowArrayBuilder.swift`)

5. Make `ArrowField` initializer public (`ArrowSchema.swift`)