Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix normalization issue with quoted & case sensitive columns #9317

Merged
merged 32 commits into from
Jan 6, 2022

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented Jan 5, 2022

What

Relates to https://github.com/airbytehq/oncall/issues/84

How

It seems MSSQL does care about case sensitivity of columns (even quoted ones).
(This might be depending on a "collation" settings of the server too)

But this PR makes it so we always resolve column conflicts by lowercasing all identifiers.

When I added the new test case, it surfaced similar behavior/errors on other destinations, so I'm fixing it there too.
(bigquery/mysql)

Recommended reading order

  1. x.java
  2. y.python

@ChristopheDuong ChristopheDuong changed the base branch from master to chris/fix-bq-normalization-scd-float January 5, 2022 17:13
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Jan 5, 2022

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1659526585
❌ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1659526585
🐛

@jrhizor jrhizor temporarily deployed to more-secrets January 5, 2022 17:24 Inactive
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Jan 5, 2022

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1659612772
❌ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1659612772
🐛

@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 5, 2022 17:47 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets January 5, 2022 17:48 Inactive
Comment on lines +27 to +28
"User id",
"user id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I don't understand how the first User id is not converted to lower case. The main change in the Python file is this one I think:

-            if not is_quoted and not self.needs_quotes(input_name):
-                result = input_name.lower()
+            result = input_name.lower()

which seems to universally convert names to lower case. Can you elaborate?

Copy link
Contributor Author

@ChristopheDuong ChristopheDuong Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file you commented on is the output for Postgres, not mssql (i think the mssql one is not versioned in git).

  • In Postgres, when quoted, the identifier's case matters (so "User Id", "User id" and "user id" are different column names). So nothing has changed here.
  • But as the integration test reveals and as the reported On-Call issue reports, MSSQL does not make a difference with these 3 columns names with different cases and considers them all to be equal, and thus conflicting.

So, to mirror this behavior with MSSQL in normalization code, I universally lowercase all column identifiers for MSSQL only even if they are quoted. As a consequence, normalization now detects column name conflicts and will resolve it by renaming conflicting names with extra '_1', '_2' etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm going to change the PR and isolate it to normalize this sort of casing only when looking for column name conflicts...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation.

@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 12:12 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets January 6, 2022 12:13 Inactive
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Jan 6, 2022

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1662948830
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1662948830
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            89     64    28%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/cdk/utils/event_timing.py         47      3    94%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     15    55%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        713    397    44%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     148      7    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 518    331    36%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1247    520    58%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/conftest.py                     109    109     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              242     96    60%
	 source_acceptance_test/tests/test_full_refresh.py       38      0   100%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  54     17    69%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  979    404    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     148      7    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 518    331    36%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1247    520    58%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     12    92%
	 normalization/transform_catalog/destination_name_transformer.py     148      5    97%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 518     39    92%
	 normalization/transform_catalog/table_name_registry.py              174     51    71%
	 normalization/transform_catalog/transform.py                         45     30    33%
	 normalization/transform_catalog/utils.py                             33      0   100%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     45    69%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1247    188    85%

@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 12:18 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets January 6, 2022 12:19 Inactive
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 12:52 Inactive
Comment on lines 10 to 14
json_value(_airbyte_data, ''$."User Id"'') as "User Id",
json_value(_airbyte_data, ''$."user_id"'') as user_id,
json_value(_airbyte_data, ''$."User id"'') as "User id_1",
json_value(_airbyte_data, ''$."user id"'') as "user id_2",
json_value(_airbyte_data, ''$."UserId"'') as userid,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuliren here is the output of the new test for MSSQL

@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 13:58 Inactive
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 14:07 Inactive
@ChristopheDuong ChristopheDuong changed the title Fix mssql normalization issue with case sensitive columns 🐛 Fix normalization issue with quoted & case sensitive columns Jan 6, 2022
Base automatically changed from chris/fix-bq-normalization-scd-float to master January 6, 2022 17:49
@github-actions github-actions bot added the area/connectors Connector related issues label Jan 6, 2022
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets January 6, 2022 17:57 Inactive
@ChristopheDuong ChristopheDuong merged commit c5d4a97 into master Jan 6, 2022
@ChristopheDuong ChristopheDuong deleted the chris/fix-mssql-normalization branch January 6, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants