Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize insert-only merge to not rewrite existing files #246

Closed
tdas opened this issue Nov 11, 2019 · 1 comment
Closed

Optimize insert-only merge to not rewrite existing files #246

tdas opened this issue Nov 11, 2019 · 1 comment
Assignees
Labels
Milestone

Comments

@tdas
Copy link
Contributor

@tdas tdas commented Nov 11, 2019

A common pattern used in merge is to dedup new data with data in the delta table in the following way.

MERGE INTO target
USING source
ON condition
WHEN NOT MATCHED THEN INSERT ....

So there is no update clause, only insert clause, so ideally only new files (containing new non-matching data from source) should be added to the table (i.e. append-only). However, the current implementation of the merge does not optimize this case. This leads to the following problems.

  1. Target delta table files containing data that matches the source data are unnecessarily rewritten. This hurts performance.

  2. Since the files are deleted and added in the commit, the downstream streaming queries cannot treat the table as append-only data which they should be able to as we are only append deduplicated data. This makes it hard to build pipelines that dedup data.

  3. Merge's default semantics is to fail when multiple source keys match the same target key because update action becomes ambiguous -- its not clear which source row should be used to update the target row. However, in an insert-only merge, this is not ambiguous. Hence, this should be allowed.

The proposed solution is to optimize this case by performing an anti-join on the source data to insert the data. This will

  1. do only one pass on the data making it faster.
  2. not modify any existing file, only append new files, thus allowing downstream streaming queries to read deduped data.
  3. allow the source to have multiple keys and unambiguously insert them
@tdas tdas added this to the 0.5.0 milestone Nov 11, 2019
@tdas tdas pinned this issue Nov 11, 2019
@rahulsmahadev rahulsmahadev self-assigned this Nov 11, 2019
@mukulmurthy

This comment has been minimized.

Copy link
Collaborator

@mukulmurthy mukulmurthy commented Nov 19, 2019

Resolved in 23c7613

@zsxwing zsxwing unpinned this issue Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.