-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Query engine
Spark
Question
Hi all – hoping someone might be able to help with this.
I have an Iceberg table partitioned by day(timestamp_field), with several other non-partition columns (col_1 to col_5). The table contains multiple terabytes of data. I'm performing a MERGE INTO using Spark (100M+ rows), with a clause like:
MERGE INTO target
USING source
ON target.timestamp_field = source.timestamp_field AND target.col_1 = source.col_1
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
My question is: should Spark + Iceberg be able to optimise this for predicate pushdown on the partition column (timestamp_field)?
I’m observing what seem to be inconsistencies between the Spark plan, Spark logs, Iceberg logs, and the metrics shown in the Spark UI — particularly around how much of the target data is scanned before the merge.
More specifically:
- Is this kind of MERGE deterministic in terms of how much data is scanned from the target?
- Would adding an explicit timestamp_field BETWEEN clause improve scan efficiency?
- Or is it better to avoid MERGE INTO entirely in these cases, e.g., by querying the target, using a LEFT ANTI JOIN with the source, and overwriting partitions?
Appreciate any thoughts or insight on how Spark and Iceberg handle this under the hood.
Thanks!
(Spark 3.5.3, Iceberg 1.7.1)