replace with staging tables #488

sh-rp · 2023-07-13T11:56:53Z

No description provided.

netlify · 2023-07-13T11:56:57Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`80d5de4`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/64b9b4bdab8ace0008d70741

rudolfix · 2023-07-15T19:50:54Z

@sh-rp huh! IMO you are better off cherry-picking the replace changes than trying to merge this. unfortunately that was expected :/

rudolfix

this looks really good.

please take a look at test_write_dispositions and parametrize for strategies. the test should OFC pass
add new document to docs in general-usage: full-loading.md where we'll describe the mechanism, the strategies etc.
if you implement those optimizations in warehouses, please mention those per warehouse docs
incremental-loading.md introduces full loading: refer the document (2) from it

rudolfix · 2023-07-17T19:03:05Z

dlt/common/destination/reference.py


 @configspec(init=True)
 class DestinationClientConfiguration(BaseConfiguration):
    destination_name: str = None  # which destination to load data to
    credentials: Optional[CredentialsConfiguration]
+    replace_strategy: TLoaderReplaceStrategy = "classic"


I'd say DestinationClientDwhConfiguration is a better place.

rudolfix · 2023-07-17T19:34:16Z

dlt/common/destination/reference.py

@@ -172,6 +176,20 @@ def start_file_load(self, table: TTableSchema, file_path: str, load_id: str) ->
    def restore_file_load(self, file_path: str) -> LoadJob:
        pass

+    @abstractmethod
+    def get_stage_dispositions(self) -> List[TWriteDisposition]:


this goes in good direction but it is not necessary to add those items to the JobClientBase. IMO it is time for a little refactor:

move all stage related methods (including the create_merge_job) to SqlJobClientBase (or create another intermediary abstract client here)

run the staging part of load.py only if client implements this interface

why? filesystem, weaviate or dummy do not have need for staging dataset

DestinationClientDwhConfiguration got really misleading. dataset_name and default_schema_name are never configured. they are arguments to the client instance always passed via code. we can deal with that but later

dlt/common/destination/reference.py

rudolfix · 2023-07-17T19:37:53Z

dlt/common/destination/reference.py

+    def create_table_chain_completed_followup_jobs(self, table_chain: Sequence[TTableSchema]) -> List[NewLoadJob]:
+        """Creates a list of followup jobs that should be executed after a table chain is completed"""
+        return []
+
    @abstractmethod
    def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:


hmmm it looks like it can become a private method _create_merge_job used internally by SqlJobClientBase

rudolfix · 2023-07-17T20:39:22Z

dlt/destinations/job_client_impl.py

+            dispositions.append("replace")
+        return dispositions
+
+    def truncate_destination_table(self, disposition: TWriteDisposition) -> bool:


this is only used internally by SqlJobClientBase so remove it from ABC and add _ prefix

and _should_truncate_destination_table is better -> because we just check truncation, not do that

rudolfix · 2023-07-17T20:39:30Z

dlt/destinations/job_client_impl.py

    def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:
        return SqlMergeJob.from_table_chain(table_chain, self.sql_client)

+    def create_staging_copy_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:


same here -> private method

rudolfix · 2023-07-17T20:45:40Z

dlt/destinations/redshift/redshift.py

@@ -81,8 +81,10 @@ def _maybe_make_terminal_exception_from_data_error(pg_ex: psycopg2.DataError) ->

 class RedshiftCopyFileLoadJob(CopyRemoteFileLoadJob):

-    def __init__(self, table: TTableSchema, file_path: str, sql_client: SqlClientBase[Any], staging_credentials: Optional[CredentialsConfiguration] = None, staging_iam_role: str = None) -> None:
+    def __init__(self, table: TTableSchema, file_path: str, sql_client: SqlClientBase[Any], use_staging_table: bool, truncate_destination_table: bool, staging_credentials: Optional[CredentialsConfiguration] = None, staging_iam_role: str = None) -> None:


heh maybe reformat this. line got too long

dlt/destinations/sql_jobs.py

rudolfix · 2023-07-17T21:05:07Z

tests/load/pipeline/test_replace_disposition.py

+    os.environ['DESTINATION__REPLACE_STRATEGY'] = "staging"
+
+    # snowflake requires gcs prefix instead of gs in bucket path
+    if destination == "snowflake":


maybe we should normalize paths in copy jobs in snowflake? it has a specialized one

rudolfix · 2023-07-17T21:18:12Z

dlt/load/load.py

-                    logger.info(f"Client for {job_client.config.destination_name} will TRUNCATE STAGING TABLES: {merge_tables}")
-                    job_client.initialize_storage(staging=True, truncate_tables=merge_tables)
+                    staging_tables = set(job.table_name for job in staging_table_jobs)
+                    job_client.update_storage_schema(staging=True, only_tables=staging_tables | dlt_tables, expected_update=expected_update)


you solved the conflict incorrectly: only_tables=merge_tables | {VERSION_TABLE_NAME}

rudolfix · 2023-07-17T21:21:00Z

dlt/load/load.py

-                merge_jobs = self.get_new_jobs_info(load_id, schema, "merge")
-                if merge_jobs:
+                staging_table_jobs = self.get_new_jobs_info(load_id, schema, job_client.get_stage_dispositions())
+                if staging_table_jobs:
                    logger.info(f"Client for {job_client.config.destination_name} will start initialize STAGING storage")
                    job_client.initialize_storage(staging=True)


MAYBE we could remove the staging flag from initialize_storage and update_storage_schema and have with_staging context manager on SqlJobClientBase (the refactor that I proposed previously?). that would remove all traces of staging from JobClientBase (the top level ABC)

I have created a subclass that supports the staging interface, it could also be a mixin actually, not sure.

Also there is one method left called drop_tables, which has a staging parameter, maybe this should also be handled within the context manager?

Ok, I removed it there too..

rudolfix

thanks for cleaning up the tests. the PR LGTM. I'll find why the last test fails and maybe rename one of the strategies then it is good to merge

rudolfix · 2023-07-18T14:21:16Z

dlt/destinations/snowflake/snowflake.py

+            # drop destination table
+            sql.append(f"DROP TABLE IF EXISTS {table_name};")
+            # recreate destination table with data cloned from staging table
+            sql.append(f"CREATE TABLE {table_name} CLONE {staging_table_name};")


this is copy on write clone. exactly like bigquery!

rudolfix · 2023-07-20T14:57:12Z

.github/workflows/test_destination_bigquery.yml

+  # needed for bigquery staging tests
+  DESTINATION__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID: AKIAT4QMVMC4J46G55G4
+  DESTINATION__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+  CREDENTIALS__PROJECT_ID: chat-analytics-rasa-ci


all CREDENTIALS__ below are already defined, look up

rudolfix · 2023-07-20T15:00:52Z

dlt/destinations/redshift/redshift.py

                dataset_name = self._sql_client.dataset_name
+                if self._should_truncate_destination_table:
+                    self._sql_client.execute_sql(f"""TRUNCATE TABLE {table_name}""")


I think _sql_client has a truncate method now?

rudolfix · 2023-07-20T15:04:58Z

tests/load/pipeline/test_replace_disposition.py

+from tests.utils import ALL_DESTINATIONS
+from tests.load.pipeline.utils import destinations_configs, DestinationTestConfiguration, set_destination_config_envs
+
+REPLACE_STRATEGIES = ["truncate-and-insert", "insert-from-staging", "optimized"]


I'd rename optimized -> staging-optimized

* implement staging replace disposition * add third replace strategy and parametrize test by replace strategy * make parametrized destination tests nicer * update docs * add s3 deps to snowflake tests * add more env vars for staging tests * fix classic test and revert redshift tx change * explains optimized stage replace per warehouse doc * fixes bumping schema version and renames to staging-optimized * fixes ci workflow env * fixes several race conditions in filesystem replace --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

sh-rp force-pushed the d#/replace_with_staging branch 3 times, most recently from e939cb3 to 8e64643 Compare July 14, 2023 12:43

sh-rp force-pushed the d#/replace_with_staging branch 2 times, most recently from 1537f54 to 37e439e Compare July 16, 2023 15:13

sh-rp marked this pull request as ready for review July 16, 2023 15:32

rudolfix requested changes Jul 17, 2023

View reviewed changes

sh-rp force-pushed the d#/replace_with_staging branch 8 times, most recently from 2e1bfa2 to 0af6f10 Compare July 19, 2023 22:07

sh-rp added 5 commits July 20, 2023 02:06

implement staging replace disposition

da19c50

PR changes

a2e627c

add third replace strategy and parametrize test by replace strategy

8c4a073

make parametrized destination tests nicer

0720348

can't use yield :(

da89cec

sh-rp force-pushed the d#/replace_with_staging branch from cde7230 to da89cec Compare July 20, 2023 00:06

sh-rp added 2 commits July 20, 2023 03:32

add missing vars

3b89194

update docs

746e255

sh-rp force-pushed the d#/replace_with_staging branch from 1737160 to 746e255 Compare July 20, 2023 02:31

sh-rp added 3 commits July 20, 2023 08:09

add s3 deps to snowflake tests

45c5575

add more env vars for staging tests

b726c84

fix classic test and revert redshift tx change

8b1157f

rudolfix requested changes Jul 20, 2023

View reviewed changes

rudolfix added 2 commits July 20, 2023 18:44

explains optimized stage replace per warehouse doc

cd381c7

fixes bumping schema version and renames to staging-optimized

5e00e14

rudolfix force-pushed the d#/replace_with_staging branch from b591adc to 8985712 Compare July 20, 2023 19:51

rudolfix added 2 commits July 20, 2023 22:56

fixes ci workflow env

27b19fe

fixes several race conditions in filesystem replace

80d5de4

rudolfix force-pushed the d#/replace_with_staging branch from 8985712 to 80d5de4 Compare July 20, 2023 22:27

rudolfix approved these changes Jul 21, 2023

View reviewed changes

rudolfix merged commit 5f963e9 into devel Jul 21, 2023
35 checks passed

rudolfix deleted the d#/replace_with_staging branch July 21, 2023 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace with staging tables #488

replace with staging tables #488

sh-rp commented Jul 13, 2023

netlify bot commented Jul 13, 2023 •

edited

Loading

rudolfix commented Jul 15, 2023

rudolfix left a comment

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

rudolfix Jul 17, 2023

sh-rp Jul 18, 2023

sh-rp Jul 18, 2023

sh-rp Jul 18, 2023

rudolfix left a comment

rudolfix Jul 18, 2023

rudolfix Jul 20, 2023

rudolfix Jul 20, 2023

rudolfix Jul 20, 2023

replace with staging tables #488

replace with staging tables #488

Conversation

sh-rp commented Jul 13, 2023

netlify bot commented Jul 13, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix commented Jul 15, 2023

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Jul 13, 2023 •

edited

Loading