Skip to content

feat(parquet/pqarrow): support writing LARGE_LIST types#838

Open
lidavidm wants to merge 1 commit into
apache:mainfrom
lidavidm:gh-834
Open

feat(parquet/pqarrow): support writing LARGE_LIST types#838
lidavidm wants to merge 1 commit into
apache:mainfrom
lidavidm:gh-834

Conversation

@lidavidm
Copy link
Copy Markdown
Member

@lidavidm lidavidm commented Jun 3, 2026

Rationale for this change

We can't write large list to a Parquet file.

What changes are included in this PR?

Implement support for large list in pqarrow.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

Closes apache#834.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@lidavidm lidavidm requested a review from zeroshade as a code owner June 3, 2026 07:51
@lidavidm lidavidm marked this pull request as draft June 3, 2026 07:51
@lidavidm lidavidm marked this pull request as ready for review June 3, 2026 22:32
@zeroshade zeroshade requested a review from Copilot June 5, 2026 18:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end support for Arrow LARGE_LIST types in the parquet/pqarrow integration so large-list arrays can be written to (and read back from) Parquet, including schema handling and regression tests for #834.

Changes:

  • Extend schema/type handling to preserve LARGE_LIST during (de)serialization and nested type reconstruction.
  • Add write-path traversal support for array.LargeList and read-path support by expanding list offsets to int64.
  • Add regression/round-trip tests covering nullable large lists, empty lists, and stored schema behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
parquet/pqarrow/schema.go Teach nested-type reconstruction to rebuild LargeList when the original schema used it.
parquet/pqarrow/path_builder.go Add visitor support for array.LargeList by using int64 offsets in path building.
parquet/pqarrow/path_builder_test.go Add regression test validating def/rep levels for nullable large-list scenarios.
parquet/pqarrow/file_writer.go Normalize LargeList element field names to element when storing Arrow schema metadata.
parquet/pqarrow/file_reader.go Ensure LARGE_LIST fields are routed through list reader construction.
parquet/pqarrow/column_readers.go When reading LARGE_LIST, convert computed int32 offsets buffer into an int64 offsets buffer.
parquet/pqarrow/encode_arrow_test.go Add round-trip + store-schema regression tests for large lists, plus minor whitespace cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2027 to +2032
cnk := arrow.NewChunked(field.Type, []arrow.Array{arr})
defer arr.Release()

tbl := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil), []arrow.Column{*arrow.NewColumn(field, cnk)}, -1)
defer cnk.Release()
defer tbl.Release()
buffers[0] = validityBuffer
}

if lr.field.Type.ID() == arrow.LARGE_LIST {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we modify DefRepLevelsToListInfo to be generic and take []int64 directly to avoid having to first allocate and create the []int32 and then allocate again and copy everything over?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants