Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deterministic table name collision resolution for normalization #2206

Merged
merged 23 commits into from
Mar 1, 2021

Conversation

jrhizor
Copy link
Contributor

@jrhizor jrhizor commented Feb 25, 2021

Resolves #2055

Curious what you think of this approach of having hashes for all nested tables instead of just duplicates.

My reasoning was that it seemed un-intuitive that you could go from a schema with just B -> x which is rendered as subtable x and then maybe update the schema to have both B -> x AND A -> x which would label the one from B as x_hash and the new one from A as x, if the method used to decide which is the duplicate defines A < B.

With this, it isn't ergonomic to refer to the tables with the hashes, but at the same time it may be more ergonomic than trying to figure out how to rename usages downstream?

I want to confirm this first. Then I'll add more unit testing.

Copy link
Contributor

@michel-tricot michel-tricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}

Otherwise it is very hard to understand what is in the table with just hashes

Copy link
Contributor

@ChristopheDuong ChristopheDuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}

Otherwise it is very hard to understand what is in the table with just hashes

On Hubspot catalog, you get these fields for example:

  • root stream name: contacts
  • Lineage: contacts/properties/
  • Nested Property name: hs_content_membership_registration_domain_sent_to

Screenshot 2021-02-25 at 11 00 02

On Postgres destinations, we already truncate the name of that table to hs_content_membership___tration_domain_sent_to to save a few characters but if we were to include contacts or contacts/properties on top of it then it would become even more difficult to fit...

But yes, if we can at least fit in contacts (maybe in cases where we don't hit the character limits at least?)

@@ -418,7 +421,7 @@ def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[st
def list_fields(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]:
return [column_names[field][0] for field in column_names]

def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = "") -> str:
def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = None) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run mypy you'll get this warning because of this change:

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py:424:77: error: Incompatible default for argument "suffix" (default has type "None",
argument has type "str")

So it was on purpose that suffix was by default with the empty string ""

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Feb 25, 2021

FYI when running the tests I included here: #2211

An exception is raised with your code because it is generating duplicated table names on the Hubspot catalog.json

I don't know why... is it because 6 characters hashes are still generating collisions?

A remark is also that intermediate tables are also using the lineage_hash in their naming...
So If the stream processor is working on a top level stream charges.sql its intermediate files are not named charges_ab1.sql, charges_ab2.sql etc anymore

Should this be kept aligned and lineage hash are only on nested tables/intermediate tables?

@ChristopheDuong
Copy link
Contributor

Looking closer at the destination limits, BigQuery (snowflake too?) have pretty good character limits on identifier length, would it make sense not to include the not so user-friendly lineage hash by default in those destinations since we would more rarely need to truncate anything?

@ChristopheDuong
Copy link
Contributor

@jrhizor, not sure if you will keep truncating in the middle of the string, so just in case, I put a small PR here to fix it here: #2223

@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 26, 2021

PTAL, should be working as expected if you look at the test cases.

@@ -26,6 +26,7 @@

import pendulum as pendulum
from base_python import BaseClient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new annoying formatting disagreement between spotless and black? 🙄

def get_table_name(
name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str
) -> str:
prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we truncate only if we run into character limits problems?
If we have space to spare, why wouldn't we include the full name for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we should be degrading only if needed.

Copy link
Contributor

@michel-tricot michel-tricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % the comment and adding some comment in the code to explain the truncation logic

# Add extra characters '__', signaling a truncate in identifier
print(f"Truncating {input_name} (#{len(input_name)}) to {prefix}__{suffix} (#{2 + len(prefix) + len(suffix)})")
input_name = f"{prefix}__{suffix}"
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoudn't it be that if it is not there we should assume no need for truncation?

@@ -463,18 +464,10 @@ def generate_new_table_name(self, is_intermediate: bool, suffix: str) -> str:
# see alias in dbt: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases/
pass
pass
elif self.parent is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not self.parent

def get_table_name(
name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str
) -> str:
prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we should be degrading only if needed.

new_table_name = self.name_transformer.normalize_table_name(f"{table_name}_{i}")
if new_table_name not in tables_registry:
break
new_table_name = get_table_name(self.name_transformer, self.json_path[0], new_table_name, suffix, self.json_path)
Copy link
Contributor

@ChristopheDuong ChristopheDuong Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While keeping your logic in get_table_name, we could make it even easier for the user to follow through the multiple levels of nesting, WDYT of "_".join(self.json_path) instead of self.json_path[0]?

The json_path will still be truncated in priority (keeping the first parent within the MINIMUM_PARENT_LENGTH characters) to fit in the proper limits when necessary, but we would expose as much JSON path as possible?

Copy link
Contributor Author

@jrhizor jrhizor Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 64df137, just needed to exclude the final element of the path (itself)

@jrhizor jrhizor merged commit fa505c7 into master Mar 1, 2021
@jrhizor jrhizor deleted the jrhizor/deterministic-collision-resolution branch March 1, 2021 19:25
Copy link
Contributor

@michel-tricot michel-tricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@param input_name is the identifier name to middle truncate
@param custom_limit uses a custom length as the max instead of the destination max length
"""
limit = custom_limit if custom_limit > 0 else self.get_name_max_length()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we use None instead of a placeholder value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalization Unnesting and table name collision
4 participants