Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

Open
kunal-nandwana opened this issue Aug 17, 2022 · 14 comments
Open

Comments

@kunal-nandwana
Copy link

kunal-nandwana commented Aug 17, 2022

Feature Request / Improvement

Hi Team,
I am using Iceberg in my project and I found a big thing which is missing from Iceberg which is easily available in Apache Hudi and Deltalake that is "merge schema". If possible this feature need to added into the Iceberg. I am attaching my last ticket which is explaining the problem that I am facing.Please find the below ticket for the refrence.
#5548

@rdblue any thoughts on this?

Query engine

Spark

@kbendick
Copy link
Contributor

For reference, the issue here is that the user wants to be able to use mergeScherma option when writing via MERGE INTO.

I'm not sure of a way to support that presently. If somebody does know, please comment 🙂

@kunal-nandwana
Copy link
Author

It would be great help if someone could help in achieving this functionality.. We are struggling to do this thing manually...

@kbendick
Copy link
Contributor

Just an FYI but I would update the title to be Feature Request: Support mergeSchema option when using Spark MERGE INTO. This is more explicit and gets to the heart of what it is you need.

The hints might not be something we can add without changing Spark, but the core of the idea is that you need mergeSchema to work with MERGE INTO (which is currently SQL only).

Removing the implementation constraint from the title might attract more eyeballs / bring more ideas to the table (as ultimately you don’t care about anything other than needing mergeInto to work).

@kbendick
Copy link
Contributor

Also, what about a table property? Does the table experience writes where you explicitly do not want mergeSchema?

Generally, I think mergeSchem is safer as a per-query option and is somewhat unsafe as a table level configuration. But Spark makes it somehat hard to support that as there’s no Dataframe support for MERGE INTO currently.

For the long run, I’m going to bring up adding a merge into API to the dataframe / dataset API in Spark. But that could take a while. We might be able to provide implicit classes so that it’s do-able using the dataframe API in just Iceberg, but in the long run that should be moved to Spark (though that doesn’t solve your immediate problem, I know).

@kunal-nandwana kunal-nandwana changed the title Support Hints for Dataframe Writer Options Like 'mergeSchema' Feature Request: Support mergeSchema option when using Spark MERGE INTO Aug 18, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Feb 15, 2023
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023
@jhchee
Copy link

jhchee commented May 18, 2023

Not stale

@jaredtbates
Copy link

jaredtbates commented Aug 25, 2023

This is definitely something we'd be interested in - We are doing something similar with Glue and would like to be able to support schema evolution with a MERGE INTO UPSERT. Currently we have to manually modify the iceberg schema every time our source schema changes.

This is a very similar architecture to what we're doing - https://aws.amazon.com/blogs/big-data/automate-replication-of-relational-sources-into-a-transactional-data-lake-with-apache-iceberg-and-aws-glue/

@amogh-jahagirdar
Copy link
Contributor

Reopening due to interest in this

@andreacfm
Copy link
Contributor

andreacfm commented Sep 12, 2023

@kbendick is this still on your radar? If not could you give me some direction on where I could start to look at.

@FabricioZGalvani
Copy link

Any updates on this feature? I also have a strong interest in iceberg providing this solution.

@RussellSpitzer
Copy link
Member

Anyone who would like to work on the issue is welcome to, there is currently no one I know working on it.

@nastra nastra added not-stale and removed stale labels Nov 9, 2023
@andreacfm
Copy link
Contributor

andreacfm commented Nov 9, 2023

Delta Lake has the ability to set spark.databricks.delta.schema.autoMerge.enabled. I find this approach interesting as it can be used only when required. Once set the automatic schema evolution works for every write operation.

@abhishekkrbaliase
Copy link

abhishekkrbaliase commented Feb 28, 2024

Is it still being worked on? It would be nice if we can have either:

  1. Schema evolution (schema merge) for merge sql statements
    or
    2 DataFrame API for merge queries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants