AnalysisException after using `mergeSchema` option to add a new column containing only null #944

NicolasGuary · 2022-02-18T09:17:16Z

Tested this on DBR 9.1 LTS and DBR 10.3.

Hello, I am currently facing an issue that is for me making the mergeSchema option unusable.
My goal is to append new columns to an existing table, but sometimes a new column can come with only null values and then afterwards contain non-null values. That's what I've tried to reproduce, but I am getting this error when trying to merge a new record with the non-null value :

AnalysisException: The schema of your Delta table has changed in an incompatible way since your DataFrame or
DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.
Changes:
Latest schema has additional field(s): X

Here's what you can do to reproduce the bug :

Create a base table :

import spark.implicits._

val path = "dbfs:/tmp/merge_new_column_with_null_value_test"
val df = Seq((1, 1, 1)).toDF("a", "b", "c")

display(df)

df.write
  .partitionBy("c")
  .format("delta")
  .mode("overwrite")
  .save(path)

Merge with a record containing a new column X that has only a null value :

import io.delta.tables._
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", true)

val newEvents = Seq((5, null)).toDF("a", "X")

DeltaTable
        .forPath(path)
        .as("delta_table")
        .merge(newEvents.as("event"), "delta_table.a = event.a")
        .whenMatched
        .updateAll
        .whenNotMatched
        .insertAll
        .execute

Note that at this point, if you display(spark.read.format("delta").load(path)) column X won't even exist on this table.

Merge again, but this time column X contain a non-null value:

import io.delta.tables._
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", true)

val newEvents = Seq((5, "NON_NULL_VALUE")).toDF("a", "X")

DeltaTable
        .forPath(path)
        .as("delta_table")
        .merge(newEvents.as("event"), "delta_table.a = event.a")
        .whenMatched
        .updateAll
        .whenNotMatched
        .insertAll
        .execute

After running command 3, you should get the error above.

Thank you for your time and consideration, have a great day !

The text was updated successfully, but these errors were encountered:

allisonport-db · 2022-02-19T09:17:20Z

Thanks for reporting this!

allisonport-db · 2022-03-29T16:54:21Z

Hey @NicolasGuary, I wanted to provide an update. The issue here is that the addition of a column containing only NULL values results in the creation of a void column in the Delta log. Void columns are dropped by Spark when opening Delta tables (causing the schema mismatch error.)

You can avoid this error by explicitly specifying a type for null-only columns.

989078f throws an error for any MERGE command that adds a void column (so you should see an error after (2) above). Supporting void columns is a major change and I'm not sure when/if it will be added.

NicolasGuary · 2022-03-30T07:52:18Z

Thank you for the explanation @allisonport-db !
If Spark cannot handle Void Columns, wouldn't the desired behavior of Delta be to ignore this column? (ie. if a column contains only NULL values, Delta will ignore it and won't add it in the Delta Log).
It could be an easy fix for Delta instead of supporting void columns.

Let me know how that sounds for you and if it makes sense for Delta to implement such rule !

allisonport-db · 2022-04-13T21:50:46Z

Ignoring the column could cause later operations to fail since the user expects that column to now exist in the table's schema and there are a lot of corner cases. I think for now throwing an error is the most explicitly clear solution to the user.

Closing this issue for now since this specific bug should no longer be allowed, and void column support is not in good shape and there are a lot of edge cases to deal with.

ramankr44 · 2022-09-26T17:43:05Z

I'm facing this error. please tell me how to tackle it

Error in SQL statement: AnalysisException: The schema of your Delta table has changed in an incompatible way since your DataFrame or
DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.
Changes:
Latest schema is missing field(s): modified_timestamp, created_timestamp
Latest metadata for field customer_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}
Latest metadata for field product_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}
Latest metadata for field promotion_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}

I have introduced this two columns modified_timestamp, created_timestamp after creating the dataframe.

allisonport-db added bug Something isn't working acknowledged This issue has been read and acknowledged by Delta admins labels Feb 19, 2022

allisonport-db closed this as completed Apr 13, 2022

felipepessoto mentioned this issue Jan 4, 2023

[BUG] VOID type supported by Delta, but is not part of the Delta log protocol specification #1499

Closed

3 tasks

pspeter mentioned this issue Jun 22, 2023

[BUG] Merge Into with autoMerge fails on string/int/... columns containing only NULL #1855

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnalysisException after using `mergeSchema` option to add a new column containing only null #944

AnalysisException after using `mergeSchema` option to add a new column containing only null #944

NicolasGuary commented Feb 18, 2022 •

edited

allisonport-db commented Feb 19, 2022

allisonport-db commented Mar 29, 2022

NicolasGuary commented Mar 30, 2022

allisonport-db commented Apr 13, 2022

ramankr44 commented Sep 26, 2022

AnalysisException after using mergeSchema option to add a new column containing only null #944

AnalysisException after using mergeSchema option to add a new column containing only null #944

Comments

NicolasGuary commented Feb 18, 2022 • edited

allisonport-db commented Feb 19, 2022

allisonport-db commented Mar 29, 2022

NicolasGuary commented Mar 30, 2022

allisonport-db commented Apr 13, 2022

ramankr44 commented Sep 26, 2022

AnalysisException after using `mergeSchema` option to add a new column containing only null #944

AnalysisException after using `mergeSchema` option to add a new column containing only null #944

NicolasGuary commented Feb 18, 2022 •

edited