Skip to content

feat: add list type support to ArrowWriter#149

Open
jding-xyz wants to merge 5 commits intoapache:mainfrom
jding-xyz:feat/list-type-writer-support
Open

feat: add list type support to ArrowWriter#149
jding-xyz wants to merge 5 commits intoapache:mainfrom
jding-xyz:feat/list-type-writer-support

Conversation

@jding-xyz
Copy link

@jding-xyz jding-xyz commented Mar 19, 2026

Summary

ArrowWriter currently cannot serialize schemas containing list<T> columns. This PR adds full list type support to the writer, including nested list<struct<...>> schemas.

1. List type writer support (ArrowWriterHelper.swift, ArrowWriter.swift)

  • ArrowWriterHelper.swift: Added .list case to toFBTypeEnum (returns .list FlatBuffer type enum) and toFBType (creates empty org_apache_arrow_flatbuf_List table).
  • ArrowWriter.swift: Added ArrowTypeList handling in writeField, writeFieldNodes, writeBufferInfo, and writeRecordBatchData — mirrors the existing ArrowTypeStruct pattern but uses ArrowTypeList.elementField and NestedArray.values for the single child array.
  • Also moved the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop, so it runs once per column rather than once per buffer (a latent correctness issue for types with multiple buffers).

2. Field-node ordering fix for nested types (ArrowWriter.swift)

writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. External readers (pyarrow) reject such streams with "Array length did not match record batch length".

Moved child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector.

3. Schema message padding in streaming format (ArrowWriter.swift)

The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream.

Added addPadForAlignment(&schemaData) in writeStreaming to match writeFile behavior.

4. Fix appendAny bypass in nested builders (ArrowArrayBuilder.swift)

The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This means list<struct<...>> columns produce struct arrays with empty children, since the struct's field builders never receive data.

Overrode appendAny in both StructArrayBuilder and ListArrayBuilder to delegate to self.append, ensuring child builder distribution runs correctly for nested types.

5. Make ArrowField initializer public (ArrowSchema.swift)

ArrowField is a public class with all public properties, but its initializer was internal. This prevents external consumers from constructing ArrowTypeStruct instances with explicit fields, which is needed for list<struct<...>> column schemas.

Made ArrowField.init(_:type:isNullable:) public.

Test plan

  • Verified all existing tests pass (swift test — only pre-existing fixture-file-missing failures).
  • Round-trip tested with list<float32> schema: encode via ArrowWriter.writeStreaming, decode via ArrowReader.readStreaming, verified schema fields and row count survive.
  • Round-trip tested with list<struct<string, float32>> schema: encode via ArrowWriter.writeStreaming, decode via ArrowReader.readStreaming, verified schema fields and row count survive.
  • End-to-end validated list<float32> and list<struct<string, float32>> streams against a pyarrow-based server (Arrow IPC streaming reader) — both accepted without errors.
  • Tested alongside the timestamp nesting fix from fix: move timezone string creation before startTimestamp to avoid nesting assertion #148 (independent change).

ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list,
and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/
writeRecordBatchData recurse into ArrowTypeList children the
same way they already do for ArrowTypeStruct.

This unblocks writing schemas that use list<T> columns
(e.g. the cloud stream server 'raw' schema with list<float32>).

Also moves the nested-type recursion in writeBufferInfo and
writeRecordBatchData outside the inner buffer loop so it runs
once per column rather than once per buffer.

Made-with: Cursor
writeFieldNodes iterates in reverse to match FlatBuffers prepend
semantics, but wrote parent field nodes before recursing into
children. This placed child field nodes before their parent in
the final vector, violating Arrow IPC's depth-first pre-order
requirement. pyarrow rejects such streams with "Array length did
not match record batch length".

Move child recursion (for both struct and list types) before the
parent fbb.create call so children are prepended first and end up
after their parent in the resulting vector.

Made-with: Cursor
The Arrow IPC spec requires all message metadata to be padded to
a multiple of 8 bytes. writeFile already does this via
addPadForAlignment, but writeStreaming did not. Without padding,
the record batch body can start at a non-8-byte-aligned offset,
causing alignment violations when strict readers (pyarrow + Rust
FFI) create zero-copy buffers from the stream.

Made-with: Cursor
The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly,
bypassing the overridden append methods in StructArrayBuilder and
ListArrayBuilder that distribute values to child builders. This causes
list<struct<...>> columns to produce struct arrays with empty children,
since the struct's field builders never receive data.

Override appendAny in both classes to delegate to self.append, ensuring
child builder distribution runs for nested types like list<struct<...>>.

Made-with: Cursor
ArrowField is a public class but its initializer was internal, preventing
external consumers from constructing ArrowTypeStruct instances with
explicit fields. This is needed for list<struct<...>> column schemas.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant