`replace` write disposition must always use staging dataset. #271

rudolfix · 2023-04-14T12:08:40Z

Background
We replace data directly in the destination dataset. This is looks good from the performance perspective but has several problems

we cant have two jobs for the same table, those jobs will replace each other work
we have tables with child tables in inconsistent state: some are replaced some are not
child tables without data will not be replaced at all, just the old data will be left (bug)
we can load in parallel to stage if many files

Implementation Note
Below is just one of possible solutions. other solutions:

just truncate tables that are replaced and then append data. we have all the code we need for merge. there's however a moment where we do not have data
introduce copy jobs handling multiple files

Tasks

- for replace write disposition load to staging dataset like we do that with merge
- generate replace jobs same way we generate merge jobs (only for parent tables, take into account full table chain)
- replace jobs must take into account full table chains - even those that do not have data at this point so we can truncate tables
- truncate should be used for destinations that support it. otherwise we use delete

Tests

- job client tests for replace must be updated
- we have plenty of tests for replace, they must pass
- test edge cases, a replace of table with child tables where one or more child tables do not get the data

The text was updated successfully, but these errors were encountered:

adrianbr · 2023-04-25T11:40:04Z

Replacing tables without cost can be done by dropping/renaming the old table and renaming the replacement to take its place.
On bigquery - Rename saves a full table scan which costs 5 usd/tb - but IMO this is pretty cheap.
On Redshift - Rename saves time which would be used to copy the data. I think this cost is under 10% of total loading time given how we load.

z3z1ma · 2023-04-28T08:17:57Z

@adrianbr @rudolfix

This is tangentially related, so feel free to spin it into an issue if you want.
As more SQL necessarily proliferates, have you considered https://github.com/tobymao/sqlglot
This would simplify the variance in adding new databases as you can work directly with an AST and write it in all supported dialects instantly. This includes your current DDLs, INSERTs, CTAs, and so on.

That lib is pretty legit in practice.

rudolfix added the bug Something isn't working label Apr 14, 2023

rudolfix mentioned this issue Apr 16, 2023

when data source is empty and write disposition is replace, the destination table is not truncated #263

Closed

rudolfix mentioned this issue Jun 4, 2023

implement storage staging with filesystem destination #375

Closed

rudolfix assigned sh-rp Jun 27, 2023

rudolfix closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`replace` write disposition must always use staging dataset. #271

`replace` write disposition must always use staging dataset. #271

rudolfix commented Apr 14, 2023 •

edited

Loading

adrianbr commented Apr 25, 2023

z3z1ma commented Apr 28, 2023

replace write disposition must always use staging dataset. #271

replace write disposition must always use staging dataset. #271

Comments

rudolfix commented Apr 14, 2023 • edited Loading

adrianbr commented Apr 25, 2023

z3z1ma commented Apr 28, 2023

`replace` write disposition must always use staging dataset. #271

`replace` write disposition must always use staging dataset. #271

rudolfix commented Apr 14, 2023 •

edited

Loading