feat: add list type support to ArrowWriter#149
Open
jding-xyz wants to merge 5 commits intoapache:mainfrom
Open
feat: add list type support to ArrowWriter#149jding-xyz wants to merge 5 commits intoapache:mainfrom
jding-xyz wants to merge 5 commits intoapache:mainfrom
Conversation
ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list, and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/ writeRecordBatchData recurse into ArrowTypeList children the same way they already do for ArrowTypeStruct. This unblocks writing schemas that use list<T> columns (e.g. the cloud stream server 'raw' schema with list<float32>). Also moves the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop so it runs once per column rather than once per buffer. Made-with: Cursor
writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child field nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. pyarrow rejects such streams with "Array length did not match record batch length". Move child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector. Made-with: Cursor
The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream. Made-with: Cursor
The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This causes list<struct<...>> columns to produce struct arrays with empty children, since the struct's field builders never receive data. Override appendAny in both classes to delegate to self.append, ensuring child builder distribution runs for nested types like list<struct<...>>. Made-with: Cursor
ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ArrowWritercurrently cannot serialize schemas containinglist<T>columns. This PR adds full list type support to the writer, including nestedlist<struct<...>>schemas.1. List type writer support (
ArrowWriterHelper.swift,ArrowWriter.swift)ArrowWriterHelper.swift: Added.listcase totoFBTypeEnum(returns.listFlatBuffer type enum) andtoFBType(creates emptyorg_apache_arrow_flatbuf_Listtable).ArrowWriter.swift: AddedArrowTypeListhandling inwriteField,writeFieldNodes,writeBufferInfo, andwriteRecordBatchData— mirrors the existingArrowTypeStructpattern but usesArrowTypeList.elementFieldandNestedArray.valuesfor the single child array.writeBufferInfoandwriteRecordBatchDataoutside the inner buffer loop, so it runs once per column rather than once per buffer (a latent correctness issue for types with multiple buffers).2. Field-node ordering fix for nested types (
ArrowWriter.swift)writeFieldNodesiterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. External readers (pyarrow) reject such streams with "Array length did not match record batch length".Moved child recursion (for both struct and list types) before the parent
fbb.createcall so children are prepended first and end up after their parent in the resulting vector.3. Schema message padding in streaming format (
ArrowWriter.swift)The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes.
writeFilealready does this viaaddPadForAlignment, butwriteStreamingdid not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream.Added
addPadForAlignment(&schemaData)inwriteStreamingto matchwriteFilebehavior.4. Fix
appendAnybypass in nested builders (ArrowArrayBuilder.swift)The base
ArrowArrayBuilder.appendAnycallsbufferBuilder.appenddirectly, bypassing the overriddenappendmethods inStructArrayBuilderandListArrayBuilderthat distribute values to child builders. This meanslist<struct<...>>columns produce struct arrays with empty children, since the struct's field builders never receive data.Overrode
appendAnyin bothStructArrayBuilderandListArrayBuilderto delegate toself.append, ensuring child builder distribution runs correctly for nested types.5. Make
ArrowFieldinitializer public (ArrowSchema.swift)ArrowFieldis a public class with all public properties, but its initializer wasinternal. This prevents external consumers from constructingArrowTypeStructinstances with explicit fields, which is needed forlist<struct<...>>column schemas.Made
ArrowField.init(_:type:isNullable:)public.Test plan
swift test— only pre-existing fixture-file-missing failures).list<float32>schema: encode viaArrowWriter.writeStreaming, decode viaArrowReader.readStreaming, verified schema fields and row count survive.list<struct<string, float32>>schema: encode viaArrowWriter.writeStreaming, decode viaArrowReader.readStreaming, verified schema fields and row count survive.list<float32>andlist<struct<string, float32>>streams against a pyarrow-based server (Arrow IPC streaming reader) — both accepted without errors.