-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink SQL with Iceberg snapshots doesn't react if table has upsert #9948
Comments
Flink streaming read only supports append only tables ATM. When you imsert to a table it generates |
Thanks for getting back to me. Looking at the docs I also see that Spark Structured Streaming has a similar warning of only supporting appends. Is Flink and Spark append+upsert (delete) streaming not something you want to support due to some different preferred way or is it just later in the roadmap? If so any idea on timeline? For us this would unlock plenty of interesting use-cases for Iceberg. But if we could subscribe to the more complex upsert silver writes (late arrival joins, aggregations, pivots, partial GDPR deletes), we could actually automate pipelines that maintain more “golden” tables. For data science teams to gain clean generated and maintained tables, SuperSet users better views, easier to stream transformed data from Iceberg to other products etc. Now we would perhaps need to use Kafka or something as a buffer and do dual writes to gain CDC for upsert. It would also circumvent some of the Flink limitations, such as not needing to buffer in memory as much for late arrivals, as it lacks some good transformation SQL that Spark has. We would happily help contribute if possible and given a few pointers. I also saw Flink implemented FLIP-282 in 1.17 release, to enable row level delete / update APIs |
@VidakM: I suggest to bring this up on the mailing list. I would also love to see this feature in Flink. I checked myself a few months ago, and the main blocker is that the planning part for streaming V2 tables is not implemented yet. The interfaces are created, but there is no working implementation for planning. Also the proposed interfaces have a base class different than the one returned by the V1 planning, which will make it harder to do the implementation part on the execution engine side. Seeing these tasks, I started to work on the Flink in-job compaction, as it have more immediate gains for us, but in the medium term (in a year), I would like to revisit this feature. (No commitment though) If you need reviewers, then I could help there, so any contribution would be welcome here! Thanks, |
Query engine
Flink 1.17.2 with Iceberg 1.4.2 libraries
Question
I have a few Iceberg v2 tables defined and a Flink job that reads them in a streaming fashion before transforming to another Iceberg table.
If the source tables are basic, then subscribing to them works great and the SQL query can continuously run.
But if the tables are defined with
'write.upsert.enabled'='true'
, then the subscribing Flink SQL will read only once, and not react to new snapshots. Even if the SQL definition asks it to monitor intervals and the streaming strategy is any incremental version.Flink streaming query that normally works:
The streaming join works great if the source Iceberg tables are defined like this:
But the streaming join runs only once, and then stops triggering on new snapshots. It does not finish though, just stops reacting from source and produces no new records.
In my Flink job i simply define the connector and run the SQL join/insert. Both source and target table is already defined.
I also noticed that If I have an SQL Join, it too stops streaming if at least one table has upsert enabled.
Looking at the documentation for both Iceberg and Flink I don't find any indication that enabling upsert should alter the behaviour - but I do remember reading somewhere that FLIP27 only supports append and not update / delete. Is this the reason I'm seeing this behaviour?
The text was updated successfully, but these errors were encountered: