Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upsert merge strategy #1129

Open
rudolfix opened this issue Mar 22, 2024 · 10 comments · May be fixed by #1466
Open

Add upsert merge strategy #1129

rudolfix opened this issue Mar 22, 2024 · 10 comments · May be fixed by #1466
Assignees

Comments

@rudolfix
Copy link
Collaborator

Background
dlt uses DELETE + INSERT to merge staging dataset into destination. It will also create temporary tables with _dlt_id of rows to be deleted and inserted. This may all not be necessary if MERGE sql is used. On supported database backend, MERGE looks even more standardized that various forms of DELETE that we had to hack.

Tasks

    • look at replace strategies and add merge strategies to config
    • pick a right merge job depending on merge strategy
    • make sure the deduplication of the staging datasets by primary key also takes place (hope MERGE will do it automatically)
    • which destinations to convert: duckdb (testing), bigquery, snowflake. postgres should work if duckdb works
    • tests and docs!
@jorritsandbrink
Copy link
Collaborator

See this draft PR for relevant discussions and code inspiration: #1294

@jorritsandbrink
Copy link
Collaborator

My take, based on previous work in #1294:

Add a new key-based merge strategy called upsert:

  • if key is not present in destination table ➜ insert record
  • if key is present in destination table ➜ update record

Source data can be both full or incremental extracts.

Constraints:

  • needs a primary_key
  • does not support merge_key (we need a one-to-one relationship between records in source and destination)

Furthermore, as with delete-insert:

  • do primary key-based deduplication
  • support dedup_sort and hard_delete column hints

Implementation:

  • do real updates as much as possible, i.e. no delete+insert mechanism
    • use the MERGE command to implement it for SQL destinations that support it
    • use INSERT INTO ... ON CONFLICT DO UPDATE on DuckDB (reference)
    • use the TableMerger API for delta table format on filesystem destination (implement in separate PR)
  • child tables still need deletes, to delete records for elements which have been removed from an updated list
  • hash primary key and store as _dlt_id in root table
    • use this as upsert key

@rudolfix
Copy link
Collaborator Author

@jorritsandbrink my biggest question to the above: what you can do with it that you cannot do with the current merge? can you update from partial records? (on first look we'd need to add something on top to do that)

I had something more modest in mind:

  1. replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX
  2. implement existing merge strategy for delta (using their TableMerger API whenever possible, INSERT DELETE otherwise)

@jorritsandbrink
Copy link
Collaborator

@rudolfix

"what you can do with it that you cannot do with the current merge?"

There's no extra functionality indeed. Clear semantics would be the main benefit. People are familiar with key-based "upsert", and know which behavior to expect.

"replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX"

An issue I see here: the merge strategy is called delete-insert. It will be confusing if we mix MERGE behavior into it.

Also, what is the benefit? MERGE does not always outperform DELETE + INSERT. There's a lot of content online comparing the two approaches, for different database backends, and it doesn't seem either one of the approaches is clearly more performant.

It would be challenging to elegantly mix both approaches. I think it makes more sense to give the user an option:

  • delete-insert based on DELETE and INSERT SQL statements
    • works with merge_key and doesn't require primary_key
  • upsert based on MERGE statement
    • does not work with merge_key and requires primary_key

It's a leaky abstraction, but it enables the user to use the approach that is most performant for their workload.

What do you think?

@rudolfix
Copy link
Collaborator Author

@jorritsandbrink
yeah OK. I like the upsert idea.

  1. How are you planning to handle child tables? DELETE + INSERT? we may try to merge on root key id + unique column which is deterministic
  2. How long is it going to take?
  3. Do you plan to implement both strategies for delta?

@jorritsandbrink
Copy link
Collaborator

@rudolfix

  1. I would approach it like this:
  • use DELETE + INSERT as base implementation
  • use MERGE if WHEN NOT MATCHED BY SOURCE is supported (e.g. on mssql and databricks)

Note that the DELETE + INSERT approach only deletes updated records, so it differs from the delete-insert strategy, which deletes all records.

  1. We might be able to merge functionality for the new upsert strategy in one week. Perhaps a little longer if the different SQL destinations require much customization. I can use code from Add upsert merge strategy #1294. Support for delta would be a separate PR.

  2. I'd say only upsert and scd2 at first and do delete-insert if/when requested in the community.

@jorritsandbrink jorritsandbrink changed the title convert sql merge jobs into MERGE sql statement Add upsert merge strategy Jun 11, 2024
@jorritsandbrink jorritsandbrink self-assigned this Jun 11, 2024
@rudolfix
Copy link
Collaborator Author

@jorritsandbrink
ad 1. how are you going to deal with deleted child records? ie. you have a list of strings in your main document and the list will get shorter. you need to be able to delete everything with list_idx > previous max list_idx

ad 2. OK! could we implement that for Snowflake first and try to merge it before other destinations? if that's adding too much work.

ad 3. Ok! I thing our initial merge strategy is the most powerful but I'll follow what you say. So we'll make upsert the default strategy for delta?

@jorritsandbrink
Copy link
Collaborator

@rudolfix

  1. I would do it based on root key and unique key like last time: https://github.com/dlt-hub/dlt/pull/1294/files#r1635024156
  2. I will do Snowflake first.
  3. Yes, upsert would be default for delta.

@jorritsandbrink jorritsandbrink linked a pull request Jun 14, 2024 that will close this issue
@jorritsandbrink jorritsandbrink linked a pull request Jun 14, 2024 that will close this issue
@rudolfix
Copy link
Collaborator Author

@jorritsandbrink can we close it?

@jorritsandbrink
Copy link
Collaborator

@rudolfix I think we should close this ticket after #1466 is merged. I will create a new ticket for upsert support for destinations beyond postgres and snowflake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants