Skip to content

MERGE INTO behaviour/optimisation #13768

@david-at-zenobe

Description

@david-at-zenobe

Query engine

Spark

Question

Hi all – hoping someone might be able to help with this.

I have an Iceberg table partitioned by day(timestamp_field), with several other non-partition columns (col_1 to col_5). The table contains multiple terabytes of data. I'm performing a MERGE INTO using Spark (100M+ rows), with a clause like:

MERGE INTO target
USING source
ON target.timestamp_field = source.timestamp_field AND target.col_1 = source.col_1
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

My question is: should Spark + Iceberg be able to optimise this for predicate pushdown on the partition column (timestamp_field)?

I’m observing what seem to be inconsistencies between the Spark plan, Spark logs, Iceberg logs, and the metrics shown in the Spark UI — particularly around how much of the target data is scanned before the merge.

More specifically:

  • Is this kind of MERGE deterministic in terms of how much data is scanned from the target?
  • Would adding an explicit timestamp_field BETWEEN clause improve scan efficiency?
  • Or is it better to avoid MERGE INTO entirely in these cases, e.g., by querying the target, using a LEFT ANTI JOIN with the source, and overwriting partitions?

Appreciate any thoughts or insight on how Spark and Iceberg handle this under the hood.

Thanks!

(Spark 3.5.3, Iceberg 1.7.1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions