Description
Feature request
Overview
This feature request covers a number of performance improvements for the MERGE INTO command.
Motivation
I have a prototype for improvements to the MERGE command that shows up to 2x faster execution with the biggest improvements on delete only merge and merge touching a lot of files. The results have been gathered using new benchmark tests cases for merge that can be added to the existing benchmarks.
Performance improvement:
Test case | Base duration (s) | Test duration (s) | Improvement ratio |
---|---|---|---|
delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 | 26,1 | 20,5 | 1,27 |
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 8,8 | 15,2 | 0,58 |
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 27,7 | 17,5 | 1,58 |
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 36,3 | 21,2 | 1,71 |
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 14,9 | 14,8 | 1,01 |
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 17,5 | 17,3 | 1,01 |
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 20,3 | 20,7 | 0,98 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 | 39,5 | 28,8 | 1,37 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 | 19,9 | 19,3 | 1,03 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 | 39,1 | 29,9 | 1,31 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 | 39,1 | 31 | 1,26 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 | 41,9 | 32,5 | 1,29 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 | 43,3 | 33,8 | 1,28 |
upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 | 43,8 | 34,1 | 1,28 |
upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 147,9 | 84,8 | 1,74 |
upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 266,9 | 142,5 | 1,87 |
Further details
The following improvements are contributing to better overall performance:
Data skipping for matched only merges using matched condition: In the case where the merge statement only includes MATCHED clauses, we can benefit from data skipping using the target-only MATCHED conditions when scanning for the target for matches. For example, with:
MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED AND target.value = 5 AND source.value = 6
WHEN MATCHED AND target.value = 7
we can skip scanning files using predicate: target.value =5 OR target.value = 7
Dedicated path for insert only merges with more than 1 insert clause: The current code path for insert-only only supports 1 NOT MATCHED clause. We can extend it to support any number of NOT MATCHED clause.
More efficient classic merge execution: We can improve overall merge performance by optimizing the way we write modified rows: instead of processing individual partitions, we can build a large single expression to process all rows at once.
Native counter for metrics instead of UDFs: We use UDFs to increment metrics across MERGE/UPDATE/DELETE. We can instead introduce a dedicated native expression that can be codegened.
Project Plan
Task | Description | Status | PR |
---|---|---|---|
Merge benchmark | Extend the existing benchmarks with test cases for MERGE | In Review | #1835 |
Improve Insert-only merge execution | Allow handling of multiple NOT MATCHED clause in the insert-only path | Done | #1852 |
Rewrite classic merge execution | The performance of the regular code path to execute merge can be improved by rewriting the way rows are processed | Done | #1854 |
Native metric counters | Replace the use of UDFs by native expressions to increment metrics | Done | #1828 |
Data skipping for MATCHED only merge | When they are only MATCHED clauses, the MATCHED condition can be used to skip data when scanning the target for matches. | Done | #1851 |
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- Yes. I can contribute this feature independently.
- Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- No. I cannot contribute this feature at this time.