Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐌 Optimize incremental normalization runtime with snowflake #8088

Merged
merged 7 commits into from
Nov 19, 2021

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented Nov 18, 2021

What

Closes #7987 (comment)
Closes #8093
Closes #7775

How

  1. Thanks to a user on slack, I was able to confirm that this PR solves the snowflake runtime issues by forcing the left join with an empty table
  2. Adding comments of what the models are depending on as recommended in error messages from dbt as reported in Facebook Marketing Normalization failed with dbt compilation errors #8093
  3. Removing post-hook to delete _ab3 tables in Postgres since this breaks incremental (force a full refresh since the table would not exist anymore). So I renamed it _tmp instead.

Recommended reading order

  1. `airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
  2. the rest

🚨 User Impact 🚨

Not a breaking change but Postgres users will see "pollution" from an intermediate table on "dedupe history" mode that needs to be persisted (in a different staging schema) for incremental to work because of a limitation between dbt/Postgres. These tables will be suffixed with _stg
On the other hand, the impact on the speed of this trade-off is difficult to measure without #7741

@ChristopheDuong ChristopheDuong marked this pull request as draft November 18, 2021 09:26
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 18, 2021 09:28 Inactive
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Nov 18, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1475811514
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1475811514
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            83     59    29%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     19    42%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        660    393    40%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        75      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       38     27    29%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 62     25    60%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  896    440    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     12    92%
	 normalization/transform_catalog/destination_name_transformer.py     120      4    97%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468     37    92%
	 normalization/transform_catalog/table_name_registry.py              174     51    71%
	 normalization/transform_catalog/transform.py                         45     30    33%
	 normalization/transform_catalog/utils.py                             33      0   100%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     45    68%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    185    84%

@jrhizor jrhizor temporarily deployed to more-secrets November 18, 2021 09:50 Inactive
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristopheDuong thanks for listing the user impact! very helpful :) will these tables be removed automatically or is it something we'll need to remove over time? (or should they just never be removed? if they shouldn't be removed would it make sense to call them something other than _tmp maybe _metadata?)

@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Nov 18, 2021

@ChristopheDuong thanks for listing the user impact! very helpful :) will these tables be removed automatically or is it something we'll need to remove over time? (or should they just never be removed? if they shouldn't be removed would it make sense to call them something other than _tmp maybe _metadata?)

it has to be kept with at least the row that has the maximum emitted_at. For the moment we keep the whole table though...
If they are removed, they'll be rebuilt in full refresh the next sync

_metadata is a lot more character to add to table names where we have limited number of... can you think of something with only 3 letters?

@sherifnada
Copy link
Contributor

_info seems reasonable? or does it have to be 3?

@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Nov 19, 2021

_info seems reasonable? or does it have to be 3?

Yes, it has to be 3, unless we increase the number of reserved characters here:

# we keep 4 characters for 1 underscore and 3 characters for suffix (_ab1, _ab2, etc)

Let's go with _stg?

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Nov 19, 2021
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:37 Inactive
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Nov 19, 2021
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:51 Inactive
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:58 Inactive
@github-actions github-actions bot added the area/connectors Connector related issues label Nov 19, 2021
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 11:08 Inactive
@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Nov 19, 2021

/publish connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1480740149
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1480740149

@jrhizor jrhizor temporarily deployed to more-secrets November 19, 2021 11:19 Inactive
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 13:09 Inactive
@ChristopheDuong ChristopheDuong temporarily deployed to more-secrets November 19, 2021 13:40 Inactive
@ChristopheDuong ChristopheDuong merged commit c5a7267 into master Nov 19, 2021
@ChristopheDuong ChristopheDuong deleted the chris/fix-incremental-normalization branch November 19, 2021 14:03
@ChristopheDuong ChristopheDuong changed the title Optimize incremental normalization runtime with snowflake 🐌 Optimize incremental normalization runtime with snowflake Nov 19, 2021
@ChristopheDuong
Copy link
Contributor Author

Closes #7775

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation area/platform issues related to the platform area/worker Related to worker connectors/destination/snowflake connectors/destinations-warehouse normalization
Projects
None yet
4 participants