Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

kunal-nandwana · 2022-08-17T04:52:58Z

Feature Request / Improvement

Hi Team,
I am using Iceberg in my project and I found a big thing which is missing from Iceberg which is easily available in Apache Hudi and Deltalake that is "merge schema". If possible this feature need to added into the Iceberg. I am attaching my last ticket which is explaining the problem that I am facing.Please find the below ticket for the refrence.
#5548

@rdblue any thoughts on this?

Query engine

Spark

kbendick · 2022-08-17T18:18:52Z

For reference, the issue here is that the user wants to be able to use mergeScherma option when writing via MERGE INTO.

I'm not sure of a way to support that presently. If somebody does know, please comment 🙂

kunal-nandwana · 2022-08-18T05:51:05Z

It would be great help if someone could help in achieving this functionality.. We are struggling to do this thing manually...

kbendick · 2022-08-18T05:56:00Z

Just an FYI but I would update the title to be Feature Request: Support mergeSchema option when using Spark MERGE INTO. This is more explicit and gets to the heart of what it is you need.

The hints might not be something we can add without changing Spark, but the core of the idea is that you need mergeSchema to work with MERGE INTO (which is currently SQL only).

Removing the implementation constraint from the title might attract more eyeballs / bring more ideas to the table (as ultimately you don’t care about anything other than needing mergeInto to work).

kbendick · 2022-08-18T05:59:39Z

Also, what about a table property? Does the table experience writes where you explicitly do not want mergeSchema?

Generally, I think mergeSchem is safer as a per-query option and is somewhat unsafe as a table level configuration. But Spark makes it somehat hard to support that as there’s no Dataframe support for MERGE INTO currently.

For the long run, I’m going to bring up adding a merge into API to the dataframe / dataset API in Spark. But that could take a while. We might be able to provide implicit classes so that it’s do-able using the dataframe API in just Iceberg, but in the long run that should be moved to Spark (though that doesn’t solve your immediate problem, I know).

github-actions · 2023-02-15T00:13:03Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2023-03-01T00:14:32Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

jhchee · 2023-05-18T14:29:20Z

Not stale

jaredtbates · 2023-08-25T20:09:14Z

This is definitely something we'd be interested in - We are doing something similar with Glue and would like to be able to support schema evolution with a MERGE INTO UPSERT. Currently we have to manually modify the iceberg schema every time our source schema changes.

This is a very similar architecture to what we're doing - https://aws.amazon.com/blogs/big-data/automate-replication-of-relational-sources-into-a-transactional-data-lake-with-apache-iceberg-and-aws-glue/

amogh-jahagirdar · 2023-09-08T22:37:13Z

Reopening due to interest in this

andreacfm · 2023-09-12T18:52:59Z

@kbendick is this still on your radar? If not could you give me some direction on where I could start to look at.

FabricioZGalvani · 2023-10-24T18:41:32Z

Any updates on this feature? I also have a strong interest in iceberg providing this solution.

RussellSpitzer · 2023-10-24T18:43:44Z

Anyone who would like to work on the issue is welcome to, there is currently no one I know working on it.

andreacfm · 2023-11-09T08:52:27Z

Delta Lake has the ability to set spark.databricks.delta.schema.autoMerge.enabled. I find this approach interesting as it can be used only when required. Once set the automatic schema evolution works for every write operation.

abhishekkrbaliase · 2024-02-28T20:20:07Z

Is it still being worked on? It would be nice if we can have either:

Schema evolution (schema merge) for merge sql statements
or
2 DataFrame API for merge queries

kbendick mentioned this issue Aug 17, 2022

Schema Evolution #5548

Closed

kunal-nandwana changed the title ~~Support Hints for Dataframe Writer Options Like 'mergeSchema'~~ Feature Request: Support mergeSchema option when using Spark MERGE INTO Aug 18, 2022

github-actions bot added the stale label Feb 15, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023

amogh-jahagirdar reopened this Sep 8, 2023

nastra added not-stale and removed stale labels Nov 9, 2023

bk-mz mentioned this issue Feb 29, 2024

Spark 3.5.0 MERGE INTO breaks #9827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

kunal-nandwana commented Aug 17, 2022 •

edited by nastra

kbendick commented Aug 17, 2022

kunal-nandwana commented Aug 18, 2022

kbendick commented Aug 18, 2022

kbendick commented Aug 18, 2022

github-actions bot commented Feb 15, 2023

github-actions bot commented Mar 1, 2023

jhchee commented May 18, 2023

jaredtbates commented Aug 25, 2023 •

edited

amogh-jahagirdar commented Sep 8, 2023

andreacfm commented Sep 12, 2023 •

edited

FabricioZGalvani commented Oct 24, 2023

RussellSpitzer commented Oct 24, 2023

andreacfm commented Nov 9, 2023 •

edited

abhishekkrbaliase commented Feb 28, 2024 •

edited

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

Comments

kunal-nandwana commented Aug 17, 2022 • edited by nastra

Feature Request / Improvement

Query engine

kbendick commented Aug 17, 2022

kunal-nandwana commented Aug 18, 2022

kbendick commented Aug 18, 2022

kbendick commented Aug 18, 2022

github-actions bot commented Feb 15, 2023

github-actions bot commented Mar 1, 2023

jhchee commented May 18, 2023

jaredtbates commented Aug 25, 2023 • edited

amogh-jahagirdar commented Sep 8, 2023

andreacfm commented Sep 12, 2023 • edited

FabricioZGalvani commented Oct 24, 2023

RussellSpitzer commented Oct 24, 2023

andreacfm commented Nov 9, 2023 • edited

abhishekkrbaliase commented Feb 28, 2024 • edited

kunal-nandwana commented Aug 17, 2022 •

edited by nastra

jaredtbates commented Aug 25, 2023 •

edited

andreacfm commented Sep 12, 2023 •

edited

andreacfm commented Nov 9, 2023 •

edited

abhishekkrbaliase commented Feb 28, 2024 •

edited