Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace write disposition must always use staging dataset. #271

Closed
rudolfix opened this issue Apr 14, 2023 · 2 comments
Closed

replace write disposition must always use staging dataset. #271

rudolfix opened this issue Apr 14, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Apr 14, 2023

Background
We replace data directly in the destination dataset. This is looks good from the performance perspective but has several problems

  • we cant have two jobs for the same table, those jobs will replace each other work
  • we have tables with child tables in inconsistent state: some are replaced some are not
  • child tables without data will not be replaced at all, just the old data will be left (bug)
  • we can load in parallel to stage if many files

Implementation Note
Below is just one of possible solutions. other solutions:

  • just truncate tables that are replaced and then append data. we have all the code we need for merge. there's however a moment where we do not have data
  • introduce copy jobs handling multiple files

Tasks

    • for replace write disposition load to staging dataset like we do that with merge
    • generate replace jobs same way we generate merge jobs (only for parent tables, take into account full table chain)
    • replace jobs must take into account full table chains - even those that do not have data at this point so we can truncate tables
    • truncate should be used for destinations that support it. otherwise we use delete

Tests

    • job client tests for replace must be updated
    • we have plenty of tests for replace, they must pass
    • test edge cases, a replace of table with child tables where one or more child tables do not get the data
@adrianbr
Copy link
Contributor

Replacing tables without cost can be done by dropping/renaming the old table and renaming the replacement to take its place.
On bigquery - Rename saves a full table scan which costs 5 usd/tb - but IMO this is pretty cheap.
On Redshift - Rename saves time which would be used to copy the data. I think this cost is under 10% of total loading time given how we load.

@z3z1ma
Copy link
Collaborator

z3z1ma commented Apr 28, 2023

@adrianbr @rudolfix

This is tangentially related, so feel free to spin it into an issue if you want.
As more SQL necessarily proliferates, have you considered https://github.com/tobymao/sqlglot
This would simplify the variance in adding new databases as you can work directly with an AST and write it in all supported dialects instantly. This includes your current DDLs, INSERTs, CTAs, and so on.

That lib is pretty legit in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants