Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Remove hash when it is not necessary from normalization outputs #3704

Merged
merged 14 commits into from
Jun 1, 2021

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented May 28, 2021

What

Closes #3523
Closes #2389

How

Introduce a two-stage loop through the catalog

  1. The first step is to collect all stream names (from top-level and nested ones) in the catalog
  2. Then, resolve the naming issues by taking care of all naming from this global view of the catalog (not stream by stream) (hash are used in the names only if necessary)
  3. Finally, actually generates model files using the resolved names.

In the future, we could separate the first stage when setting up the connection page, so the UI could have a chance to customize how to resolve naming conflicts. (The table name registry can then be imported from a file)

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images

Recommended reading order

Ignore .sql and .json files changes

  1. airbyte-integrations/bases/base-normalization/normalization/transform_catalog/table_name_registry.py
  2. airbyte-integrations/bases/base-normalization/normalization/transform_catalog/catalog_processor.py
  3. airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
  4. the rest

@@ -29,6 +29,7 @@
import pkgutil
import shutil
from enum import Enum
from typing import Any, Dict
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file are made to comply to MyPy...

@ChristopheDuong ChristopheDuong marked this pull request as ready for review May 28, 2021 11:32
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented May 28, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/886224093
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/886224093

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job Chris! Makes lots of sense. Only one comment from me.

I appreciated the great PR description and comments and the use of MyPy. Made much easier to review.

Feel free to merge whenever.


class TableNameRegistry:
"""
A registry object that records table names being used during the run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Definitely easier for others to follow with this!


stream_processor.collect_table_names()
for conflict in tables_registry.resolve_names():
print(f"WARN: Resolving conflict: {conflict[0]}.{conflict[1]} from '{'.'.join(conflict[2])}' into {conflict[3]}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea to print out the changed names!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, these "exceptions/conflicts" could be collected and displayed as warnings in the UI when setting up normalization for the connection

# before reaching the database name length limit
# 2 characters for signaling truncate with '__' and 6 others for generating unique strings
TRUNCATE_RESERVED_SIZE: int = 8
# we keep 4 characters for 1 underscore and 3 characters for suffix (_ab1, _ab2, etc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Jun 1, 2021

/publish connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/896333887
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/896333887

@ChristopheDuong ChristopheDuong merged commit bb4dcb1 into master Jun 1, 2021
@ChristopheDuong ChristopheDuong deleted the chris/normalization-hash branch June 1, 2021 15:07
@avaidyanatha avaidyanatha changed the title Remove hash when it is not necessary from normalization outputs 🎉 Remove hash when it is not necessary from normalization outputs Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants