-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support schema evolution / schema overwrite in DeltaLake MERGE #170
Comments
Any update on this issue? |
Not yet. It is something we want to do eventually, but we're not sure we're going to get to it this quarter. It is on our roadmap though. |
@gerardwolf this is great! thank you. it will be super helpful if you paste the code directly in comments in markdown format rather than screenshots. would be much easier for others to copy and use the code. |
I was a little lazy, source code zipped and attached. |
will this be supported in the next version? we have an issue with this at the moment. |
we are hoping to add this for 0.6.0 |
I am trying to use this workaround of yours @gerardwolf, and I am wondering about what the FILTER=reduce does in the SQL query, could you please explain it a bit for me? :) |
@Lytiker. the "reduce" is a variable I defined earlier in the notebook which can contain a WHERE clause. I use the same code to load all my source objects, so this allows me to populate the "reduce" variable with a clause relevant to a specific objects filter criteria. It allows me to MERGE INTO the target delta table using the same filter condition I have on the source query. Basically so I compare apples and apples in the source and sink objects. Hope that makes sense. |
ah, thank you @gerardwolf , so do I understand you correctly, if I say this is where you would handle the case of incorporating rows for id 1,2 and 4 into DF1 of the example on top of this thread? |
Pretty much just a way for me to 'reduce' the dataset return from the sink side to use in the dataframe comparison to speed up the process. Pretty much pruning data I don't need to compare to speed up the process. |
in order to have a clear idea about this issue, I want to share with you the different cases related to this feature and there expected results: Given target :
source :
merge command with :
expected result:
2- update clause only
expected result:
3- insert and update clause :
expected result:
3.2- new columns in update actions only
expected result:
3.3 new columns in insert and update actions :
expected result:
3.3.2 same subset of columns in insert and update actions
expected result:
3.3.3 subset of columns in insert action, updateAll
expected result:
3.3.4 insertAll, subset of columns in update action
expected result:
3.3.5 subset of columns in insert action, subset of columns in update action (different)
expected result:
|
@JassAbidi These are good scenarios. At the first glance, they seem to make sense. However, it is a little complex to correctly figure out and implement the cases where different subsets of the columns are explicitly referred to in different clauses. In 0.6.0, we implemented a simpler solution where we added schema evolution only for We just released 0.6.0 a few minutes back - https://github.com/delta-io/delta/releases/tag/v0.6.0 See the docs linked in the notes for more information on the release schema evolution. I am going close this ticket to mark the initial implementation of schema evolution as done. There are obviously improvements possible on top of this, please open more specific tickets for them. |
How do I specify the option? Is there an example? I see following error w/ pyspark w/ Delta 0.6.0 AttributeError: 'DeltaMergeBuilder' object has no attribute 'option' d = DeltaTable.forPath(spark, targetFolder).alias("base") |
Never mind...found the documentation. https://docs.delta.io/0.6.0/delta-update.html#automatic-schema-evolution. I was able to make it work w/ whenMatchedUpdateAll and whenNotMatchedInsertAll. This may work in some cases, in certain cases, we update the value of existing record using whenMatchedUpdate. is there any plan to support for whenMatchedUpdate and whenNotMatchedInsert in future? spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true") |
@kkr78 could you please post your |
@XBeg9 sorry I forgot to reply back. Not sure if it helps now. Mapping logic in a separate method. def prepareMap(self, cdc): |
This is my solution val objectDf = kafkaDf.select(col("value.after").alias("object"))
val index = objectDf.schema.fieldIndex("object")
val propSchema = objectDf.schema(index).dataType.asInstanceOf[StructType]
val columns = mutable.HashMap[String, String]()
propSchema.fields.foreach(field =>{
columns += ("t."+field.name -> "s.object.".concat(field.name))
}) and then you can do this: def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
sinkTable.as("t")
.merge(microBatchOutputDF.sort(desc("eventTime")).dropDuplicates("key").as("s"), "s.key = t.id")
.whenMatched("s.op == 'd'").delete()
.whenMatched().updateExpr(columns)
.whenNotMatched().insertExpr(columns)
.execute()
} |
Is this feature removed with version I carried out one of the tests but when I check the schema |
Is this suppoerted in sql merge |
@samkutty94 yes it is. all APIs have same functionality and semantics. |
@tdas For SCD type 2 I want to be able to 'retire' an existing / matching row and add another one with appropriate flags / dates, but I can't do that If i'm forced to use updateAll & insertAll It seems I'm going to have to choose which feature I can have in my etl process using Delta Lake or is there an alternative approach which isn't documented? Any pointers? Thanks |
@tdas what is the depth with which schema evolution works while merging? Automatic schema evolution does not work while merging in the following case.
This looks like failing when depth is more than 2 and incoming df has columns missing. Using Delta Lake version 0.8.0 |
what is the error? |
@tdas thanks for the quick response. Noting that if a similar schema change(missing of column in incoming data) happens on the upper levels(I think till depth 2) of df, it works fine. |
can you give the full stack trace? |
@tdas sure, Below is the full stack trace of the error:
|
is there any more of the stack trace? the java part of the stack trace after the last line you have shown? |
@tdas no. that is all it gives. |
I am facing issue while merging the schema with existing Delta table, please find below scenario - Second iteration, while appending another Json file with this Delta table - While appending with option mergeSchema = true, it is throwing error something like below - |
@divasgupta could you please share the code and the data you are merging I might be able to help |
@Rohit25negi I am facing the same problem, how did you solve it? |
I am facing a similar issue as @divasgupta, The error I get is |
…le empty file would stop iteration (delta-io#170)
does this work now? I can't seem to get it working, it just drops the new columns |
As far as I can tell, schema evolution / schema overwrite in DeltaLake MERGE is not currently supported. The below pyspark code illustrates my issue (Spark 2.4.4, Scala 2.11, DeltaLake 0.3.0):
This quietly outputs:
Would be great, if by option (
.option("overwriteSchema", "true") .option("mergeSchema", "true")
), to support schemaevolution to get instead:and a new schema
The text was updated successfully, but these errors were encountered: