🐌 Optimize incremental normalization runtime with snowflake #8088

ChristopheDuong · 2021-11-18T09:26:50Z

What

Closes #7987 (comment)
Closes #8093
Closes #7775

How

Thanks to a user on slack, I was able to confirm that this PR solves the snowflake runtime issues by forcing the left join with an empty table
Adding comments of what the models are depending on as recommended in error messages from dbt as reported in Facebook Marketing Normalization failed with dbt compilation errors #8093
Removing post-hook to delete _ab3 tables in Postgres since this breaks incremental (force a full refresh since the table would not exist anymore). So I renamed it _tmp instead.

🚨 User Impact 🚨

Not a breaking change but Postgres users will see "pollution" from an intermediate table on "dedupe history" mode that needs to be persisted (in a different staging schema) for incremental to work because of a limitation between dbt/Postgres. These tables will be suffixed with _stg
On the other hand, the impact on the speed of this trade-off is difficult to measure without #7741

ChristopheDuong · 2021-11-18T09:47:47Z

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1475811514
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1475811514
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            83     59    29%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     19    42%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        660    393    40%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        75      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       38     27    29%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 62     25    60%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  896    440    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     12    92%
	 normalization/transform_catalog/destination_name_transformer.py     120      4    97%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468     37    92%
	 normalization/transform_catalog/table_name_registry.py              174     51    71%
	 normalization/transform_catalog/transform.py                         45     30    33%
	 normalization/transform_catalog/utils.py                             33      0   100%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     45    68%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    185    84%

sherifnada

@ChristopheDuong thanks for listing the user impact! very helpful :) will these tables be removed automatically or is it something we'll need to remove over time? (or should they just never be removed? if they shouldn't be removed would it make sense to call them something other than _tmp maybe _metadata?)

ChristopheDuong · 2021-11-18T20:38:18Z

@ChristopheDuong thanks for listing the user impact! very helpful :) will these tables be removed automatically or is it something we'll need to remove over time? (or should they just never be removed? if they shouldn't be removed would it make sense to call them something other than _tmp maybe _metadata?)

it has to be kept with at least the row that has the maximum emitted_at. For the moment we keep the whole table though...
If they are removed, they'll be rebuilt in full refresh the next sync

_metadata is a lot more character to add to table names where we have limited number of... can you think of something with only 3 letters?

sherifnada · 2021-11-18T22:24:56Z

_info seems reasonable? or does it have to be 3?

ChristopheDuong · 2021-11-19T08:48:19Z

_info seems reasonable? or does it have to be 3?

Yes, it has to be 3, unless we increase the number of reserved characters here:

airbyte/airbyte-integrations/bases/base-normalization/normalization/transform_catalog/destination_name_transformer.py

Line 32 in ad3b7be

    
           # we keep 4 characters for 1 underscore and 3 characters for suffix (_ab1, _ab2, etc)

Let's go with _stg?

…malization

ChristopheDuong · 2021-11-19T11:17:41Z

/publish connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1480740149
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/1480740149

ChristopheDuong · 2021-11-29T14:16:57Z

Closes #7775

…hq#8088)

normalization fix

09a698f

ChristopheDuong marked this pull request as draft November 18, 2021 09:26

github-actions bot added the normalization label Nov 18, 2021

ChristopheDuong temporarily deployed to more-secrets November 18, 2021 09:28 Inactive

jrhizor temporarily deployed to more-secrets November 18, 2021 09:50 Inactive

Add depends on comments

63b538b

ChristopheDuong temporarily deployed to more-secrets November 18, 2021 14:53 Inactive

ChristopheDuong marked this pull request as ready for review November 18, 2021 14:53

ChristopheDuong requested review from sherifnada, tuliren and marcosmarxm November 18, 2021 14:53

ChristopheDuong mentioned this pull request Nov 18, 2021

Syncs are taking hours after upgrading instance type #7656

Closed

sherifnada approved these changes Nov 18, 2021

View reviewed changes

Bump version

4c5a1c6

github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Nov 19, 2021

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:37 Inactive

Update docs

e090a39

github-actions bot added the area/documentation Improvements or additions to documentation label Nov 19, 2021

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:51 Inactive

Merge remote-tracking branch 'origin/master' into fix-incremental-nor…

70a3bc6

…malization

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 10:58 Inactive

Format code

8db7573

github-actions bot added the area/connectors Connector related issues label Nov 19, 2021

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 11:08 Inactive

jrhizor temporarily deployed to more-secrets November 19, 2021 11:19 Inactive

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 13:09 Inactive

Fix acceptance tests

aca0cf4

ChristopheDuong temporarily deployed to more-secrets November 19, 2021 13:40 Inactive

ChristopheDuong merged commit c5a7267 into master Nov 19, 2021

ChristopheDuong deleted the chris/fix-incremental-normalization branch November 19, 2021 14:03

ChristopheDuong changed the title ~~Optimize incremental normalization runtime with snowflake~~ 🐌 Optimize incremental normalization runtime with snowflake Nov 19, 2021

This was referenced Nov 19, 2021

Bump Airbyte version from 0.32.4-alpha to 0.32.5-alpha #8144

Closed

Bump Airbyte version from 0.32.4-alpha to 0.32.5-alpha #8147

Closed

Bump Airbyte version from 0.32.4-alpha to 0.32.5-alpha #8153

Merged

ChristopheDuong mentioned this pull request Nov 29, 2021

Improve incremental normalization #7775

Closed

schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022

🐛🐌 Optimize incremental normalization runtime with snowflake (airbyte…

82b6979

…hq#8088)

karinakuz added connectors/destinations-warehouse connectors/destination/snowflake labels Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐌 Optimize incremental normalization runtime with snowflake #8088

🐌 Optimize incremental normalization runtime with snowflake #8088

ChristopheDuong commented Nov 18, 2021 •

edited

Loading

ChristopheDuong commented Nov 18, 2021 •

edited by github-actions bot

Loading

sherifnada left a comment

ChristopheDuong commented Nov 18, 2021 •

edited

Loading

sherifnada commented Nov 18, 2021

ChristopheDuong commented Nov 19, 2021 •

edited

Loading

ChristopheDuong commented Nov 19, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Nov 29, 2021

🐌 Optimize incremental normalization runtime with snowflake #8088

🐌 Optimize incremental normalization runtime with snowflake #8088

Conversation

ChristopheDuong commented Nov 18, 2021 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

ChristopheDuong commented Nov 18, 2021 • edited by github-actions bot Loading

sherifnada left a comment

Choose a reason for hiding this comment

ChristopheDuong commented Nov 18, 2021 • edited Loading

sherifnada commented Nov 18, 2021

ChristopheDuong commented Nov 19, 2021 • edited Loading

ChristopheDuong commented Nov 19, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Nov 29, 2021

ChristopheDuong commented Nov 18, 2021 •

edited

Loading

ChristopheDuong commented Nov 18, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Nov 18, 2021 •

edited

Loading

ChristopheDuong commented Nov 19, 2021 •

edited

Loading

ChristopheDuong commented Nov 19, 2021 •

edited by github-actions bot

Loading