Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization test cases #2992

Merged

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented Apr 20, 2021

What

Add a test case to normalization (#2750) where name collisions happens on postgres

How

Describe the solution
No solution yet.

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images

Recommended reading order

  1. test.java
  2. component.ts
  3. the rest

@@ -171,6 +171,7 @@ def test_normalize_column_name(input_str: str, destination_type: str, expected:
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii", "Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii"),
# over the limit
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
("Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
Copy link
Contributor Author

@ChristopheDuong ChristopheDuong Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When truncated, this test case ends up with the exact same truncated name as the previous test...
So there is potential table name conflict!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! that's bad. how should we handle it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like in the implementation above you're throwing an error. is there anyway to handle this without throwing an error?

Copy link
Contributor Author

@ChristopheDuong ChristopheDuong Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it's finding a duplicate name, the code tries a second chance by adding a 3 character hash of the full stream name (without truncation) at the end of the truncated stream name.

If that second chance still has collisions (because the 3 character hash was already used too? or maybe because the catalog somehow contains the exact same stream name twice) then it'll fail
(I assume the second chance with the hash should have low probability to fail though)

We could make it not throw an error by appending a random suffix character or number (_1, _2, _3 etc) instead of erroring but it's not deterministic anymore ...

@ChristopheDuong ChristopheDuong marked this pull request as ready for review April 20, 2021 17:20
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch. is there a way we can solve this collision without throwing an exception?

are all the files being added in this PR actually used? was having trouble tracking that.

@@ -171,6 +171,7 @@ def test_normalize_column_name(input_str: str, destination_type: str, expected:
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii", "Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii"),
# over the limit
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
("Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! that's bad. how should we handle it?

@@ -171,6 +171,7 @@ def test_normalize_column_name(input_str: str, destination_type: str, expected:
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii", "Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iii"),
# over the limit
("Aaaa_Bbbb_Cccc_Dddd_Eeee_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
("Aaaa_Bbbb_Cccc_Dddd_a_very_long_name_Ffff_Gggg_Hhhh_Iiii", "Aaaa_Bbbb_Cccc_Dddd___e_Ffff_Gggg_Hhhh_Iiii"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like in the implementation above you're throwing an error. is there anyway to handle this without throwing an error?

@@ -0,0 +1,3 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test was originally made for testing nested streams in the catalog but in this scenario, it's not using any nesting so it needs an empty nested file for expected nested streams

So for the test to pass, it is needed yes

@@ -0,0 +1,13 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is used and refer as the parameter "edge_cases_catalog" in airbyte-integrations/bases/base-normalization/unit_tests/test_stream_processor.py

@@ -0,0 +1,12 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file being used?

@ChristopheDuong ChristopheDuong merged commit 07a45df into master Apr 22, 2021
@ChristopheDuong ChristopheDuong deleted the chris/normalization-test-stream-name-collisions branch April 22, 2021 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants