Describe the bug, including details regarding any error messages, version, and platform.
Following a schema merge operation in a delta table involving nested columns, PyArrow seems to struggle with loading data with the following error:
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c: int64> output fields: struct<c: int64, d: int64>
I have confirmed this does not happen with a schema merge that DOES NOT involve any nested columns.
I believe this is a PyArrow specific problem as Spark does not have this problem. My very naive assumption is that PyArrow won't be able to assume that a new column will always be added to the "end" of a schema.
Below is an example of how this can be reproduced:
import pyarrow as pa
import polars as pl
from deltalake import write_deltalake
df = pa.table({
"a": [1, 2, 3],
"b": [{"c": 1}, {"c": 2}, {"c": 3}]
})
schema = pa.schema([
pa.field("a", pa.int64()),
pa.field("b", pa.struct([
pa.field("c", pa.int64())
]))
])
local_path = "./tables/merge_delta_table"
# Write the table to delta lake
write_deltalake(local_path, data=df, engine="rust", schema=schema, mode="append")
# Create a new table with a different schema, added a new column 'd' before 'c'.
df2 = pa.table({
"a": [4, 5, 6],
"b": [{"d": 2, "c": 1}, {"c": 2}, {"c": 3}]
})
schema2 = pa.schema([
pa.field("a", pa.int64()),
pa.field("b", pa.struct([
pa.field("d", pa.int64()),
pa.field("c", pa.int64())
]))
])
# Write the new table to the same delta lake
write_deltalake(local_path, data=df2, schema=schema2, engine="rust", mode="append", schema_mode="merge")
# Now read the delta lake using polars
df = pl.read_delta(local_path)
print(df)
Component(s)
Parquet, Python
Describe the bug, including details regarding any error messages, version, and platform.
Following a schema merge operation in a delta table involving nested columns, PyArrow seems to struggle with loading data with the following error:
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c: int64> output fields: struct<c: int64, d: int64>I have confirmed this does not happen with a schema merge that DOES NOT involve any nested columns.
I believe this is a PyArrow specific problem as Spark does not have this problem. My very naive assumption is that PyArrow won't be able to assume that a new column will always be added to the "end" of a schema.
Below is an example of how this can be reproduced:
Component(s)
Parquet, Python