Skip to content

Getting a 'pyarrow.lib.ArrowTypeError' error following a schema merge on delta table with nested columns #43893

@liamphmurphy

Description

@liamphmurphy

Describe the bug, including details regarding any error messages, version, and platform.

Following a schema merge operation in a delta table involving nested columns, PyArrow seems to struggle with loading data with the following error:

pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c: int64> output fields: struct<c: int64, d: int64>

I have confirmed this does not happen with a schema merge that DOES NOT involve any nested columns.

I believe this is a PyArrow specific problem as Spark does not have this problem. My very naive assumption is that PyArrow won't be able to assume that a new column will always be added to the "end" of a schema.

Below is an example of how this can be reproduced:

import pyarrow as pa
import polars as pl
from deltalake import write_deltalake

df = pa.table({
    "a": [1, 2, 3],
    "b": [{"c": 1}, {"c": 2}, {"c": 3}]
})

schema = pa.schema([
    pa.field("a", pa.int64()),
    pa.field("b", pa.struct([
        pa.field("c", pa.int64())
    ]))
])

local_path = "./tables/merge_delta_table"

# Write the table to delta lake
write_deltalake(local_path, data=df, engine="rust", schema=schema, mode="append")

# Create a new table with a different schema, added a new column 'd' before 'c'. 
df2 = pa.table({
    "a": [4, 5, 6],
    "b": [{"d": 2, "c": 1}, {"c": 2}, {"c": 3}]
})

schema2 = pa.schema([
    pa.field("a", pa.int64()),
    pa.field("b", pa.struct([
        pa.field("d", pa.int64()),
        pa.field("c", pa.int64())
    ]))
])
# Write the new table to the same delta lake
write_deltalake(local_path, data=df2, schema=schema2, engine="rust", mode="append", schema_mode="merge")

# Now read the delta lake using polars
df = pl.read_delta(local_path)
print(df)

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions