replace child table handling #499

sh-rp · 2023-07-17T09:59:12Z

One important note: This PR only handles propers child table clearing for the staging replace strategy. There are more changes needed if we also want to clear child tables on the classic strategy.

The one test that is always failing is: tests/load/pipeline/test_merge_disposition.py::test_merge_on_keys_in_schema. The problem here is, that the test makes sure that if there are tables defined in the schema but no data is coming for them they don't get created. But in this PR all the child tables of any table that has data are created on staging and cleared, so that the child table replacing works. The way to solve this is to look at the destination and see which tables are present already and prevent creation of tables that are not present, but that part I took out again.

netlify · 2023-07-17T09:59:18Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`632f971`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/64c21ed8c029680008f9e42a
😎 Deploy Preview	https://deploy-preview-499--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sh-rp · 2023-07-25T11:47:23Z

dlt/extract/extract.py

+            # find defined resources that did not yield any pipeitems and create empty jobs for them
+            for resource in source.resources.selected.values():
+                if resource.write_disposition != "replace" or resource._table_name_hint_fun:
+                    continue


how does this _table_name_hint_fun work?

you do not need to worry :) we do not track table names.

rudolfix

review of extract.py

dlt/common/destination/reference.py

rudolfix · 2023-07-25T15:23:03Z

dlt/extract/extract.py

@@ -72,6 +72,11 @@ def extract(

    with collector(f"Extract {source.name}"):

+        def _write_empty_file(table_name: str) -> None:
+            table_name = schema.naming.normalize_identifier(table_name)
+            collector.update(table_name)


don't update collector with empty tables

dlt/extract/extract.py

rudolfix · 2023-07-25T15:25:01Z

dlt/extract/extract.py

+            # find defined resources that did not yield any pipeitems and create empty jobs for them
+            for resource in source.resources.selected.values():
+                if resource.write_disposition != "replace" or resource._table_name_hint_fun:
+                    continue


you do not need to worry :) we do not track table names.

rudolfix · 2023-07-25T15:31:50Z

dlt/extract/extract.py

                        _write_static_table(resource, table_name)
                        _write_item(table_name, pipe_item.item)
+
+            # find defined resources that did not yield any pipeitems and create empty jobs for them


what you should to here:

find the difference of the set of resources that had data and all selected resources that had write disposition replace

find all tables that belong to the resources of the set above

data_tables = {t["name"]: t for t in self.schema.data_tables()} # Don't remove _dlt tables resource_tables = group_tables_by_resource(data_tables, pattern=self.resource_pattern)

the code is coming from dlt/pipeline/helpers.py
3. generate empty files for all the tables found

explanation: a resource can create any number of tables thanks to dynamic hints or explicit dlt.mark.with_table_name
see the following resources

@dlt.transformer(name="github_repo_events", primary_key="id", write_disposition="merge", table_name=lambda i: i['type']) def github_repo_events(page): yield page @dlt.transformer(name="github_repo_events", primary_key="id", write_disposition="merge") def github_repo_events_table_meta(page): yield from [dlt.mark.with_table_name(p, p['type']) for p in page]

this is actually dlt being amazing :) you create a separate table for all event types in github event feed. all those tables are coming from a single resource

In the case where the main resource yields none, the table does not show up in the group_tables_by_resource, so I had to add an extra bit of code for this case. I hope this is alright.

rudolfix

two bigger things:

please check the filesystem destination. it also has a replace "problem" that is brute force fixed in the copy job. what we should do is to adopt the truncate-insert strategy and truncate data in initialize_storage.
this code can just go from copy job to the client

        # fallback to replace for merge without any merge keys
        if write_disposition == 'merge':
            write_disposition = 'append' if has_merge_keys else 'replace'

        # replace existing files. also check if dir exists for bucket storages that cannot create dirs
        if write_disposition == 'replace' and fs_client.isdir(dataset_path):
            job_info = LoadStorage.parse_job_file_name(file_name)
            # remove those files
            search_prefix = posixpath.join(dataset_path, f"{schema_name}.{job_info.table_name}.")
            # but leave actual load id - files may be loaded from other threads
            ignore_prefix = posixpath.join(dataset_path, f"{schema_name}.{job_info.table_name}.{load_id}.")
            # NOTE: glob implementation in fsspec does not look thread safe, way better is to use ls and then filter
            all_files: List[str] = fs_client.ls(dataset_path, detail=False, refresh=True)
            items = [item for item in all_files if item.startswith(search_prefix) and not item.startswith(ignore_prefix)]
            # NOTE: deleting in chunks on s3 does not raise on access denied, file non existing and probably other errors
            # if items:
            #     fs_client.rm(items[0])
            for item in items:
                # ignore file not found as we can have races from other deleting threads
                try:
                    fs_client.rm_file(item)
                except FileNotFoundError:
                    pass

and we do not need to filter by ignore_prefix and catch FileNotFoundError because we do truncate BEFORE we start any loading. so there is just one writer. this is so good

we need more tests

test the situation with many tables per resource (the resources for github will be a good test)
you must test all the file types we have: jsonl, insert-values, parquet to make sure empty files are correctly processed. empty == without rows. in particular I'm not sure of insert-values

rudolfix · 2023-07-25T15:40:48Z

tests/load/pipeline/test_merge_disposition.py

    files = p.get_load_package_info(p.list_completed_load_packages()[1]).jobs["completed_jobs"]
-    assert all(f.job_file_info.file_format == "parquet" for f in files)
+    assert all(f.job_file_info.file_format in ["parquet", "sql"] for f in files)


we should have no "sql" jobs because we replace with truncate and without staging. if we have sql jobs we doing something wrong

tests/load/pipeline/test_replace_disposition.py

sh-rp · 2023-07-25T20:06:15Z

I am not sure how to implement this merge keys fallback strategy, since this info is not available in initialise storage, I could put something in the loader to hand this over though..

Ok I think we need to discuss this quickly, I have moved the code into the initialise function, but I will have to add some kind of filter function to reduce all top_tables present in jobs to the right set to be truncated and this filter function will probably be in the client. It kind of feels right there but not quite.. Do you know what I mean? For filesystem we need to check each table in the case of the merge write disposition to wether merge keys are present on the actual table.

sh-rp · 2023-07-26T15:12:09Z

I have also decided to have the file destination fallback to append for merge. I think If we do this replace, then there will be problems when loading from staging destinations to redshift etc with merge. Data will go missing.

rudolfix · 2023-07-26T18:58:37Z

dlt/destinations/job_client_impl.py

-    def _should_truncate_destination_table(self, disposition: TWriteDisposition) -> bool:
-        return disposition == "replace" and self.config.replace_strategy == "truncate-and-insert"
+    def get_truncate_destination_table_dispositions(self) -> List[TWriteDisposition]:
+        if self.config.replace_strategy:


I was sure we truncate only when doing truncate-and-insert not always?

crap, this was an editing error, good catch!

rudolfix · 2023-07-26T19:03:00Z

dlt/extract/extract.py

                else:
                    # get partial table from table template
                    if resource._table_name_hint_fun:
                        if isinstance(pipe_item.item, List):
                            for item in pipe_item.item:
-                                _write_dynamic_table(resource, item)
+                                table_name = resource._table_name_hint_fun(item)


you can move it back into _write_dynamic_table

rudolfix · 2023-07-26T19:12:16Z

dlt/extract/extract.py

+                        _write_item(table_name, resource.name, pipe_item.item)
+
+            # find defined resources that did not yield any pipeitems and create empty jobs for them
+            tables_by_resources = utils.group_tables_by_resource(schema.tables)


I think we can safely use data_tables. this eliminates _dlt tables and incomplete tables that were never created in the destination

rudolfix · 2023-07-26T19:14:04Z

dlt/extract/extract.py

+            for resource in source.resources.selected.values():
+                if resource.write_disposition != "replace" or resource.name in resources_with_items:
+                    continue
+                if resource.name not in tables_by_resources:


IMO we do not need empty file. if resource is not in present for any complete table this means that there's nothing to truncated on the destination

yes you are right, I had an error in the test that required this.

rudolfix · 2023-07-26T19:42:03Z

docs/website/docs/general-usage/full-loading.md

-destinations.
+destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. 
+The downside of this strategy is, that your tables will have no data for a while until the load is completed. You
+may end up with new data in some tables and no data in other tables if the load fails during the run. 


add something like "Such incomplete load may be however detected by checking if the _dlt_loads table contains load id from _dlt_load_id of the replaced tables

rudolfix · 2023-07-26T19:45:07Z

dlt/destinations/filesystem/filesystem.py

+            all_files = self.fs_client.ls(self.dataset_path, detail=False, refresh=True)
+            for table in truncate_tables:
+                search_prefix = posixpath.join(self.dataset_path, f"{self.schema.name}.{table}.")
+                for item in all_files:


please preserve the comment form the above. it is a nice warning that you cannot trust fsspec` too much:

# NOTE: glob implementation in fsspec does not look thread safe, way better is to use ls and then filter

and

# NOTE: deleting in chunks on s3 does not raise on access denied, file non existing and probably other errors

rudolfix · 2023-07-26T19:48:31Z

tests/load/test_job_client.py

@@ -485,6 +499,7 @@ def test_write_dispositions(client: SqlJobClientBase, write_disposition: str, re
                assert len(db_rows) == idx + 1
            elif write_disposition == "replace":
                # we overwrite with the same row. merge fallbacks to replace when no keys specified
+                print(db_rows)


you can remove that

rudolfix · 2023-07-26T19:52:07Z

tests/load/pipeline/test_replace_disposition.py

+        }]
+        yield dlt.mark.with_table_name(data, "items")
+        yield dlt.mark.with_table_name(data, "other_items")
+        yield None


what a little bother me is that when we added the round robin strategy to pipe iterator, we also forced the iterator to get next value if None is yielded. to None should never get out of the pipe:

item = None while item is None: item = next(gen)

so how it is possible that you've got this INSERT VALUE with NULL that we were looking at?

This happens if yield None is the only statement in the resource and no data is ever yielded. I did not investigate why this happens though.

Should I? I think this would be a problem for a separate ticket though and we should merge the replace soonish

rudolfix · 2023-07-26T19:56:27Z

I have also decided to have the file destination fallback to append for merge. I think If we do this replace, then there will be problems when loading from staging destinations to redshift etc with merge. Data will go missing.

hmmm why? redshift destination copies files from the load package. all the other files were already copied and may be deleted (or not). I do not mean you should restore this edge case (this is not so important) - but no data will be missing in that case!

when we start implementing the delta lake then yeah we will replace old files with new but that's according to write disposition so all is still good

sh-rp · 2023-07-27T07:37:30Z

I have also decided to have the file destination fallback to append for merge. I think If we do this replace, then there will be problems when loading from staging destinations to redshift etc with merge. Data will go missing.

hmmm why? redshift destination copies files from the load package. all the other files were already copied and may be deleted (or not). I do not mean you should restore this edge case (this is not so important) - but no data will be missing in that case!

when we start implementing the delta lake then yeah we will replace old files with new but that's according to write disposition so all is still good

Hm I think you are right, I had a thought error here, I was already thinking about s3 and Athena. But I think it would be either good to keep it like this or throw an error if merge is used with filesystem destination.

… into d#/replace_child_table_handling

rudolfix · 2023-07-27T11:11:23Z

Hm I think you are right, I had a thought error here, I was already thinking about s3 and Athena. But I think it would be either good to keep it like this or throw an error if merge is used with filesystem destination.

FYI: all the hints are handled as best effort by destinations. if something is not supported it gets ignored... so the existing code is good

sh-rp marked this pull request as ready for review July 17, 2023 10:00

sh-rp force-pushed the d#/replace_child_table_handling branch 7 times, most recently from 1f1a873 to 65f0602 Compare July 21, 2023 07:23

sh-rp commented Jul 25, 2023

View reviewed changes

sh-rp force-pushed the d#/replace_child_table_handling branch 3 times, most recently from 5de7e15 to 57a80a5 Compare July 25, 2023 15:13

rudolfix requested changes Jul 25, 2023

View reviewed changes

sh-rp added 6 commits July 26, 2023 09:57

child table handling

cdc4452

make classic replace strategy more predictable

330a9f0

update docs

319575e

improve load.py

6e0f3ca

cleanup filesystem destination for replace

8578dee

test marked tables

1d56ad0

sh-rp force-pushed the d#/replace_child_table_handling branch from 7234356 to 1d56ad0 Compare July 26, 2023 07:58

extend replace tests to be able to test filesystem destinations

27f12ed

sh-rp force-pushed the d#/replace_child_table_handling branch from ff08ac8 to 27f12ed Compare July 26, 2023 11:36

fix tests and small code changes

d6fca85

sh-rp force-pushed the d#/replace_child_table_handling branch from f5ec413 to d6fca85 Compare July 26, 2023 14:53

rudolfix requested changes Jul 26, 2023

View reviewed changes

updates understanding the tables

1f35206

sh-rp added 2 commits July 27, 2023 09:37

pr fixes

32001e5

Merge remote-tracking branch 'origin/d#/replace_child_table_handling'…

632f971

… into d#/replace_child_table_handling

rudolfix approved these changes Jul 27, 2023

View reviewed changes

rudolfix merged commit 54499a1 into devel Jul 27, 2023
36 checks passed

rudolfix deleted the d#/replace_child_table_handling branch July 27, 2023 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace child table handling #499

replace child table handling #499

sh-rp commented Jul 17, 2023 •

edited

Loading

netlify bot commented Jul 17, 2023 •

edited

Loading

sh-rp Jul 25, 2023

rudolfix Jul 25, 2023

rudolfix left a comment

rudolfix Jul 25, 2023

rudolfix Jul 25, 2023

rudolfix Jul 25, 2023

sh-rp Jul 25, 2023

rudolfix left a comment

rudolfix Jul 25, 2023

sh-rp commented Jul 25, 2023 •

edited

Loading

sh-rp commented Jul 26, 2023

rudolfix Jul 26, 2023

sh-rp Jul 27, 2023

rudolfix Jul 26, 2023

rudolfix Jul 26, 2023

rudolfix Jul 26, 2023

sh-rp Jul 27, 2023

rudolfix Jul 26, 2023

rudolfix Jul 26, 2023

rudolfix Jul 26, 2023

rudolfix Jul 26, 2023

sh-rp Jul 27, 2023

sh-rp Jul 27, 2023

rudolfix commented Jul 26, 2023

sh-rp commented Jul 27, 2023

rudolfix commented Jul 27, 2023

replace child table handling #499

replace child table handling #499

Conversation

sh-rp commented Jul 17, 2023 • edited Loading

netlify bot commented Jul 17, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Jul 25, 2023 • edited Loading

sh-rp commented Jul 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix commented Jul 26, 2023

sh-rp commented Jul 27, 2023

rudolfix commented Jul 27, 2023

sh-rp commented Jul 17, 2023 •

edited

Loading

netlify bot commented Jul 17, 2023 •

edited

Loading

sh-rp commented Jul 25, 2023 •

edited

Loading