deterministic table name collision resolution for normalization #2206

jrhizor · 2021-02-25T01:38:57Z

Resolves #2055

Curious what you think of this approach of having hashes for all nested tables instead of just duplicates.

My reasoning was that it seemed un-intuitive that you could go from a schema with just B -> x which is rendered as subtable x and then maybe update the schema to have both B -> x AND A -> x which would label the one from B as x_hash and the new one from A as x, if the method used to decide which is the duplicate defines A < B.

With this, it isn't ergonomic to refer to the tables with the hashes, but at the same time it may be more ergonomic than trying to figure out how to rename usages downstream?

I want to confirm this first. Then I'll add more unit testing.

michel-tricot

One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}

Otherwise it is very hard to understand what is in the table with just hashes

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

ChristopheDuong

One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}

Otherwise it is very hard to understand what is in the table with just hashes

On Hubspot catalog, you get these fields for example:

root stream name: contacts
Lineage: contacts/properties/
Nested Property name: hs_content_membership_registration_domain_sent_to

On Postgres destinations, we already truncate the name of that table to hs_content_membership___tration_domain_sent_to to save a few characters but if we were to include contacts or contacts/properties on top of it then it would become even more difficult to fit...

But yes, if we can at least fit in contacts (maybe in cases where we don't hit the character limits at least?)

ChristopheDuong · 2021-02-25T09:23:49Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

@@ -418,7 +421,7 @@ def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[st
    def list_fields(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]:
        return [column_names[field][0] for field in column_names]

-    def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = "") -> str:
+    def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = None) -> str:


If you run mypy you'll get this warning because of this change:

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py:424:77: error: Incompatible default for argument "suffix" (default has type "None", argument has type "str")

So it was on purpose that suffix was by default with the empty string ""

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

airbyte-integrations/bases/base-normalization/unit_tests/test_catalog_processor.py

...e-integrations/bases/base-normalization/normalization/transform_catalog/catalog_processor.py

ChristopheDuong · 2021-02-25T15:31:15Z

FYI when running the tests I included here: #2211

An exception is raised with your code because it is generating duplicated table names on the Hubspot catalog.json

I don't know why... is it because 6 characters hashes are still generating collisions?

A remark is also that intermediate tables are also using the lineage_hash in their naming...
So If the stream processor is working on a top level stream charges.sql its intermediate files are not named charges_ab1.sql, charges_ab2.sql etc anymore

Should this be kept aligned and lineage hash are only on nested tables/intermediate tables?

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

ChristopheDuong · 2021-02-26T09:40:42Z

Looking closer at the destination limits, BigQuery (snowflake too?) have pretty good character limits on identifier length, would it make sense not to include the not so user-friendly lineage hash by default in those destinations since we would more rarely need to truncate anything?

ChristopheDuong · 2021-02-26T10:16:06Z

@jrhizor, not sure if you will keep truncating in the middle of the string, so just in case, I put a small PR here to fix it here: #2223

jrhizor · 2021-02-26T18:36:11Z

PTAL, should be working as expected if you look at the test cases.

.../bases/base-normalization/unit_tests/resources/stripe_catalog_expected_nested_snowflake.json

ChristopheDuong · 2021-02-26T19:15:55Z

...integrations/connectors/source-facebook-marketing/source_facebook_marketing/client/client.py

@@ -26,6 +26,7 @@

 import pendulum as pendulum
 from base_python import BaseClient
+


the new annoying formatting disagreement between spotless and black? 🙄

ChristopheDuong · 2021-02-26T19:25:03Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

+def get_table_name(
+    name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str
+) -> str:
+    prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else ""


Shouldn't we truncate only if we run into character limits problems?
If we have space to spare, why wouldn't we include the full name for clarity?

agreed. we should be degrading only if needed.

michel-tricot

Looks good % the comment and adding some comment in the code to explain the truncation logic

michel-tricot · 2021-02-26T22:47:33Z

...ons/bases/base-normalization/normalization/transform_catalog/destination_name_transformer.py

+                # Add extra characters '__', signaling a truncate in identifier
+                print(f"Truncating {input_name} (#{len(input_name)}) to {prefix}__{suffix} (#{2 + len(prefix) + len(suffix)})")
+                input_name = f"{prefix}__{suffix}"
+        else:


Shoudn't it be that if it is not there we should assume no need for truncation?

michel-tricot · 2021-02-26T22:49:58Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

@@ -463,18 +464,10 @@ def generate_new_table_name(self, is_intermediate: bool, suffix: str) -> str:
                # see alias in dbt: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases/
                pass
            pass
+        elif self.parent is None:


not self.parent

michel-tricot · 2021-02-26T22:50:59Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

+def get_table_name(
+    name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str
+) -> str:
+    prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else ""


agreed. we should be degrading only if needed.

ChristopheDuong · 2021-03-01T16:26:30Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

-                        new_table_name = self.name_transformer.normalize_table_name(f"{table_name}_{i}")
-                    if new_table_name not in tables_registry:
-                        break
+            new_table_name = get_table_name(self.name_transformer, self.json_path[0], new_table_name, suffix, self.json_path)


While keeping your logic in get_table_name, we could make it even easier for the user to follow through the multiple levels of nesting, WDYT of "_".join(self.json_path) instead of self.json_path[0]?

The json_path will still be truncated in priority (keeping the first parent within the MINIMUM_PARENT_LENGTH characters) to fit in the proper limits when necessary, but we would expose as much JSON path as possible?

done in 64df137, just needed to exclude the final element of the path (itself)

michel-tricot

Looks great!

michel-tricot · 2021-03-01T19:12:02Z

...ons/bases/base-normalization/normalization/transform_catalog/destination_name_transformer.py

+        @param input_name is the identifier name to middle truncate
+        @param custom_limit uses a custom length as the max instead of the destination max length
+        """
+        limit = custom_limit if custom_limit > 0 else self.get_name_max_length()


shouldn't we use None instead of a placeholder value?

jrhizor added 2 commits February 24, 2021 17:30

deterministic collision handling for table names

6170811

remove debugging print statement

a73ad58

jrhizor requested review from ChristopheDuong and cgardens February 25, 2021 01:38

jrhizor added 2 commits February 24, 2021 17:40

fmt

82f952a

fix flake check

e8147ab

michel-tricot reviewed Feb 25, 2021

View reviewed changes

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py Outdated Show resolved Hide resolved

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py Outdated Show resolved Hide resolved

ChristopheDuong reviewed Feb 25, 2021

View reviewed changes

cgardens reviewed Feb 25, 2021

View reviewed changes

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py Outdated Show resolved Hide resolved

jrhizor added 12 commits February 26, 2021 08:47

Merge branch 'master' into jrhizor/deterministic-collision-resolution

6ebd265

fix

8ad4555

fix

995918e

fix usage

7ab16e0

respond to more feedback

578d0e7

fix everything except truncation

e6233da

Merge branch 'master' into jrhizor/deterministic-collision-resolution

7f43030

fix everything but expected values

98d855c

add test for just table name middle truncation

c8ffb67

handle inconsistent suffixes

ef60ea3

update tests

e3fc8db

fmt

78cce03

ChristopheDuong reviewed Feb 26, 2021

View reviewed changes

.../bases/base-normalization/unit_tests/resources/stripe_catalog_expected_nested_snowflake.json Outdated Show resolved Hide resolved

ChristopheDuong reviewed Feb 26, 2021

View reviewed changes

michel-tricot approved these changes Feb 26, 2021

View reviewed changes

refactor (again)

0cc35bd

jrhizor added 3 commits March 1, 2021 07:48

fix

df9fd33

update comments

da4ec73

remove formatting

4cfff08

ChristopheDuong approved these changes Mar 1, 2021

View reviewed changes

jrhizor added 3 commits March 1, 2021 10:33

use full path

64df137

remove logging

9e8bddd

remove print statements

3a54045

jrhizor merged commit fa505c7 into master Mar 1, 2021

jrhizor deleted the jrhizor/deterministic-collision-resolution branch March 1, 2021 19:25

michel-tricot reviewed Mar 1, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deterministic table name collision resolution for normalization #2206

deterministic table name collision resolution for normalization #2206

jrhizor commented Feb 25, 2021 •

edited

Loading

michel-tricot left a comment

ChristopheDuong left a comment •

edited

Loading

ChristopheDuong Feb 25, 2021

ChristopheDuong commented Feb 25, 2021 •

edited

Loading

ChristopheDuong commented Feb 26, 2021

ChristopheDuong commented Feb 26, 2021

jrhizor commented Feb 26, 2021 •

edited

Loading

ChristopheDuong Feb 26, 2021

ChristopheDuong Feb 26, 2021

michel-tricot Feb 26, 2021

michel-tricot left a comment

michel-tricot Feb 26, 2021

michel-tricot Feb 26, 2021

michel-tricot Feb 26, 2021

ChristopheDuong Mar 1, 2021 •

edited

Loading

jrhizor Mar 1, 2021 •

edited

Loading

michel-tricot left a comment

michel-tricot Mar 1, 2021

		@@ -26,6 +26,7 @@

		import pendulum as pendulum
		from base_python import BaseClient

deterministic table name collision resolution for normalization #2206

deterministic table name collision resolution for normalization #2206

Conversation

jrhizor commented Feb 25, 2021 • edited Loading

michel-tricot left a comment

Choose a reason for hiding this comment

ChristopheDuong left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong commented Feb 25, 2021 • edited Loading

ChristopheDuong commented Feb 26, 2021

ChristopheDuong commented Feb 26, 2021

jrhizor commented Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michel-tricot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

jrhizor Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

michel-tricot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhizor commented Feb 25, 2021 •

edited

Loading

ChristopheDuong left a comment •

edited

Loading

ChristopheDuong commented Feb 25, 2021 •

edited

Loading

jrhizor commented Feb 26, 2021 •

edited

Loading

ChristopheDuong Mar 1, 2021 •

edited

Loading

jrhizor Mar 1, 2021 •

edited

Loading