Skip to content

[Feature Request] Merge performance improvements #1827

Closed
@johanl-db

Description

@johanl-db

Feature request

Overview

This feature request covers a number of performance improvements for the MERGE INTO command.

Motivation

I have a prototype for improvements to the MERGE command that shows up to 2x faster execution with the biggest improvements on delete only merge and merge touching a lot of files. The results have been gathered using new benchmark tests cases for merge that can be added to the existing benchmarks.

Performance improvement:

Test case  Base duration (s) Test duration (s)   Improvement ratio
delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 26,1 20,5 1,27
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 8,8 15,2 0,58
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 27,7 17,5 1,58
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 36,3 21,2 1,71
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 14,9 14,8 1,01
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 17,5 17,3 1,01
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 20,3 20,7 0,98
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 39,5 28,8 1,37
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 19,9 19,3 1,03
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 39,1 29,9 1,31
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 39,1 31 1,26
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 41,9 32,5 1,29
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 43,3 33,8 1,28
upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 43,8 34,1 1,28
upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 147,9 84,8 1,74
upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 266,9 142,5 1,87

Further details

The following improvements are contributing to better overall performance:

Data skipping for matched only merges using matched condition: In the case where the merge statement only includes MATCHED clauses, we can benefit from data skipping using the target-only MATCHED conditions when scanning for the target for matches. For example, with:

MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED AND target.value = 5 AND source.value = 6
WHEN MATCHED AND target.value = 7

we can skip scanning files using predicate: target.value =5 OR target.value = 7

Dedicated path for insert only merges with more than 1 insert clause: The current code path for insert-only only supports 1 NOT MATCHED clause. We can extend it to support any number of NOT MATCHED clause.

More efficient classic merge execution: We can improve overall merge performance by optimizing the way we write modified rows: instead of processing individual partitions, we can build a large single expression to process all rows at once.

Native counter for metrics instead of UDFs: We use UDFs to increment metrics across MERGE/UPDATE/DELETE. We can instead introduce a dedicated native expression that can be codegened.

Project Plan

Task Description Status PR
Merge benchmark Extend the existing benchmarks with test cases for MERGE In Review #1835
Improve Insert-only merge execution Allow handling of multiple NOT MATCHED clause in the insert-only path Done #1852
Rewrite classic merge execution The performance of the regular code path to execute merge can be improved by rewriting the way rows are processed Done #1854
Native metric counters Replace the use of UDFs by native expressions to increment metrics Done #1828
Data skipping for MATCHED only merge When they are only MATCHED clauses, the MATCHED condition can be used to skip data when scanning the target for matches. Done #1851

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions