Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnalysisException after using mergeSchema option to add a new column containing only null #944

Closed
NicolasGuary opened this issue Feb 18, 2022 · 5 comments
Labels
acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working

Comments

@NicolasGuary
Copy link

NicolasGuary commented Feb 18, 2022

Tested this on DBR 9.1 LTS and DBR 10.3.

Hello, I am currently facing an issue that is for me making the mergeSchema option unusable.
My goal is to append new columns to an existing table, but sometimes a new column can come with only null values and then afterwards contain non-null values. That's what I've tried to reproduce, but I am getting this error when trying to merge a new record with the non-null value :

AnalysisException: The schema of your Delta table has changed in an incompatible way since your DataFrame or
DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.
Changes:
Latest schema has additional field(s): X

Here's what you can do to reproduce the bug :

  1. Create a base table :
import spark.implicits._

val path = "dbfs:/tmp/merge_new_column_with_null_value_test"
val df = Seq((1, 1, 1)).toDF("a", "b", "c")

display(df)

df.write
  .partitionBy("c")
  .format("delta")
  .mode("overwrite")
  .save(path)
  1. Merge with a record containing a new column X that has only a null value :
import io.delta.tables._
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", true)

val newEvents = Seq((5, null)).toDF("a", "X")

DeltaTable
        .forPath(path)
        .as("delta_table")
        .merge(newEvents.as("event"), "delta_table.a = event.a")
        .whenMatched
        .updateAll
        .whenNotMatched
        .insertAll
        .execute

Note that at this point, if you display(spark.read.format("delta").load(path)) column X won't even exist on this table.

  1. Merge again, but this time column X contain a non-null value:
import io.delta.tables._
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", true)

val newEvents = Seq((5, "NON_NULL_VALUE")).toDF("a", "X")

DeltaTable
        .forPath(path)
        .as("delta_table")
        .merge(newEvents.as("event"), "delta_table.a = event.a")
        .whenMatched
        .updateAll
        .whenNotMatched
        .insertAll
        .execute

After running command 3, you should get the error above.

Thank you for your time and consideration, have a great day !

@allisonport-db allisonport-db added bug Something isn't working acknowledged This issue has been read and acknowledged by Delta admins labels Feb 19, 2022
@allisonport-db
Copy link
Collaborator

Thanks for reporting this!

@allisonport-db
Copy link
Collaborator

Hey @NicolasGuary, I wanted to provide an update. The issue here is that the addition of a column containing only NULL values results in the creation of a void column in the Delta log. Void columns are dropped by Spark when opening Delta tables (causing the schema mismatch error.)

You can avoid this error by explicitly specifying a type for null-only columns.

989078f throws an error for any MERGE command that adds a void column (so you should see an error after (2) above). Supporting void columns is a major change and I'm not sure when/if it will be added.

@NicolasGuary
Copy link
Author

Thank you for the explanation @allisonport-db !
If Spark cannot handle Void Columns, wouldn't the desired behavior of Delta be to ignore this column? (ie. if a column contains only NULL values, Delta will ignore it and won't add it in the Delta Log).
It could be an easy fix for Delta instead of supporting void columns.

Let me know how that sounds for you and if it makes sense for Delta to implement such rule !

@allisonport-db
Copy link
Collaborator

Ignoring the column could cause later operations to fail since the user expects that column to now exist in the table's schema and there are a lot of corner cases. I think for now throwing an error is the most explicitly clear solution to the user.

Closing this issue for now since this specific bug should no longer be allowed, and void column support is not in good shape and there are a lot of edge cases to deal with.

@ramankr44
Copy link

I'm facing this error. please tell me how to tackle it

Error in SQL statement: AnalysisException: The schema of your Delta table has changed in an incompatible way since your DataFrame or
DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.
Changes:
Latest schema is missing field(s): modified_timestamp, created_timestamp
Latest metadata for field customer_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}
Latest metadata for field product_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}
Latest metadata for field promotion_key is different from existing schema:
Latest: {"delta.identity.start":1,"delta.identity.step":1,"delta.identity.highWaterMark":423,"delta.identity.allowExplicitInsert":false}
Existing: {}

I have introduced this two columns modified_timestamp, created_timestamp after creating the dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants