You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
We replace data directly in the destination dataset. This is looks good from the performance perspective but has several problems
we cant have two jobs for the same table, those jobs will replace each other work
we have tables with child tables in inconsistent state: some are replaced some are not
child tables without data will not be replaced at all, just the old data will be left (bug)
we can load in parallel to stage if many files
Implementation Note
Below is just one of possible solutions. other solutions:
just truncate tables that are replaced and then append data. we have all the code we need for merge. there's however a moment where we do not have data
introduce copy jobs handling multiple files
Tasks
for replace write disposition load to staging dataset like we do that with merge
generate replace jobs same way we generate merge jobs (only for parent tables, take into account full table chain)
replace jobs must take into account full table chains - even those that do not have data at this point so we can truncate tables
truncate should be used for destinations that support it. otherwise we use delete
Tests
job client tests for replace must be updated
we have plenty of tests for replace, they must pass
test edge cases, a replace of table with child tables where one or more child tables do not get the data
The text was updated successfully, but these errors were encountered:
Replacing tables without cost can be done by dropping/renaming the old table and renaming the replacement to take its place.
On bigquery - Rename saves a full table scan which costs 5 usd/tb - but IMO this is pretty cheap.
On Redshift - Rename saves time which would be used to copy the data. I think this cost is under 10% of total loading time given how we load.
This is tangentially related, so feel free to spin it into an issue if you want.
As more SQL necessarily proliferates, have you considered https://github.com/tobymao/sqlglot
This would simplify the variance in adding new databases as you can work directly with an AST and write it in all supported dialects instantly. This includes your current DDLs, INSERTs, CTAs, and so on.
Background
We replace data directly in the destination dataset. This is looks good from the performance perspective but has several problems
Implementation Note
Below is just one of possible solutions. other solutions:
Tasks
replace
write disposition load to staging dataset like we do that with mergeTests
The text was updated successfully, but these errors were encountered: