MERGE INTO behaviour/optimisation

### Query engine

Spark

### Question

Hi all – hoping someone might be able to help with this.

I have an Iceberg table partitioned by day(timestamp_field), with several other non-partition columns (col_1 to col_5). The table contains multiple terabytes of data. I'm performing a MERGE INTO using Spark (100M+ rows), with a clause like:

```
MERGE INTO target
USING source
ON target.timestamp_field = source.timestamp_field AND target.col_1 = source.col_1
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
```
My question is: should Spark + Iceberg be able to optimise this for predicate pushdown on the partition column (timestamp_field)?

I’m observing what seem to be inconsistencies between the Spark plan, Spark logs, Iceberg logs, and the metrics shown in the Spark UI — particularly around how much of the target data is scanned before the merge.

More specifically:

 - Is this kind of MERGE deterministic in terms of how much data is scanned from the target?
 - Would adding an explicit timestamp_field BETWEEN clause improve scan efficiency?
 - Or is it better to avoid MERGE INTO entirely in these cases, e.g., by querying the target, using a LEFT ANTI JOIN with the source, and overwriting partitions?

Appreciate any thoughts or insight on how Spark and Iceberg handle this under the hood.

Thanks!

(Spark 3.5.3, Iceberg 1.7.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MERGE INTO behaviour/optimisation #13768

Query engine

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MERGE INTO behaviour/optimisation #13768

Description

Query engine

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions