Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Google Analytics v4: - add pk and lookback window #26283

Merged

Conversation

grubberr
Copy link
Contributor

@grubberr grubberr commented May 18, 2023

What

Try to fix this oncalls
https://github.com/airbytehq/oncall/issues/2055
https://github.com/airbytehq/oncall/issues/2065

The customer complains that data from metrics of recent days is not full.
I propose to re-fetch data multiple times and use deduplication.

  • added primary_key for deduplication
  • added LOOKBACK_WINDOW = 2

Backward incompatible!

https://docs.google.com/document/d/11cx-ZTBsWyt4ZVVURTbm_etvX4IYp1QUlbT9FzcdCzE/edit

Primary Key changed:

OLD = [“uuid”]
NEW = [“dimension1”, “dimension2”, “dimension3”]

Because the new implementation removed the old uuid field which was an old primary_key. Existing synchronizations can lose data because of the deduplication mechanism. If the customer will run “refresh source schema” all works OK.

All dimensions are Primary key

The main idea of this PR is to make all dimensions that we passed to stream to be multi-field primary_key.
For example for our stream devices.

primary_key = ["date", "deviceCategory", "operatingSystem", "browser"]

This connector is equivalent of such SQL:

SELECT date, deviceCategory, operatingSystem, browser, sum(metric1), sum(metric2)
FROM table
WHERE date > start and data < end
GROUP BY date, deviceCategory, operatingSystem, browser

As you can see the combination of all dimensions gives us a unique set of values for every record.
It's important to understand that we have multiple such SQL queries because of connector slicing.

slice1: WHERE date >= '2023-04-01' and data =< '2023-04-30
slice2: WHERE date >= '2023-05-01' and data =< '2023-05-25

Missed "date" dimension

If the stream missed "date" dimension, for example, the custom stream can be constructed without "date" dimension.
Using the previously mentioned approach: "all dimensions are primary_key" does not work because now PK data can be duplicated.

slice1:

SELECT browser, sum(metric1), sum(metric2)
FROM table
WHERE '2023-04-01' > start and data < '2023-04-30'
GROUP BY browser

slice2:

SELECT browser, sum(metric1), sum(metric2)
FROM table
WHERE '2023-05-01' > start and data < '2023-05-25'
GROUP BY browser

Result:

Browser Metric1 Metric2 Slice
Chrome 500 600 slice1
Chrome 700 800 slice2

To fix this problem for such streams which are missed "date" dimension I have added a 2 new fields from slice "startDate", "endDate" which are also part of the primary key:

New Result:

Browser Metric1 Metric2 startDate endDate Slice
Chrome 500 600 2023-04-01 2023-04-31 slice1
Chrome 700 800 2023-05-01 2023-05-25 slice2

Because now primary_key=["Browser", "startDate", "endDate"] now all works OK.

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 18, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan and you've followed all steps in the Breaking Changes Checklist
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • You, or an Airbyter, have run /test successfully on this PR - or on a non-forked branch
  • You, or an Airbyter, have run /publish successfully on this PR - or on a non-forked branch
  • You've updated the connector's metadata.yaml file new!

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@grubberr grubberr self-assigned this May 18, 2023
grubberr and others added 8 commits May 18, 2023 21:44
…cs-data-api-2

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
…' of github.com:airbytehq/airbyte into grubberr/oncall-2055-source-google-analytics-data-api-2
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented May 19, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5024028643

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 19, 2023
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented May 19, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5024100126
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5024100126
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/utils.py              26      2    92%
source_google_analytics_data_api/source.py            235     29    88%
source_google_analytics_data_api/api_quota.py          94     12    87%
-----------------------------------------------------------------------
TOTAL                                                 401     45    89%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
============= 39 passed, 1 skipped, 1 warning in 415.57s (0:06:55) =============

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented May 19, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5024623024
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5024623024
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/utils.py              26      2    92%
source_google_analytics_data_api/source.py            235     29    88%
source_google_analytics_data_api/api_quota.py          94     12    87%
-----------------------------------------------------------------------
TOTAL                                                 401     45    89%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
============= 39 passed, 1 skipped, 1 warning in 411.52s (0:06:51) =============

grubberr and others added 3 commits May 19, 2023 13:29
…cs-data-api-2

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented May 23, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5054176289
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5054176289
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/utils.py              26      2    92%
source_google_analytics_data_api/source.py            235     29    88%
source_google_analytics_data_api/api_quota.py          94     12    87%
-----------------------------------------------------------------------
TOTAL                                                 401     45    89%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
============= 39 passed, 1 skipped, 1 warning in 416.21s (0:06:56) =============

@lazebnyi lazebnyi requested a review from davydov-d May 23, 2023 18:34
@grubberr
Copy link
Contributor Author

grubberr commented Jun 1, 2023

@maxi297

After your recommendation try to find more smooth transition for customers I have added this commit:

618034c

The main idea: produce "uuid" field for already existing syncs that still work on old configured_catalog.

THE MAIN PROBLEM WHICH WAS BEFORE THIS IMPROVEMENT:

if the new connector 0.3.0 stops producing "uuid" field but the old configured_catalog still announce that the primary_key is "uuid" after normalization, we have an empty "uuid" field which will be de-duplicated.

AFTER THIS IMPROVEMENT:

It still preferable to ask customers to press "refresh source schema" after upgrade to 0.3.0
but as for me now it's not such critical if customer forget to do it right away.

tag @pedroslopez

@grubberr grubberr requested a review from pedroslopez June 1, 2023 07:31
…cs-data-api-2

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented Jun 2, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5154302312
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5154302312
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/api_quota.py          94     12    87%
source_google_analytics_data_api/source.py            259     41    84%
source_google_analytics_data_api/utils.py              39     10    74%
-----------------------------------------------------------------------
TOTAL                                                 438     65    85%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:695: This tests currently leads to too much failures. We need to fix the connectors at scale first.
============ 39 passed, 2 skipped, 1 warning in 1033.49s (0:17:13) =============

…cs-data-api-2

Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
@grubberr
Copy link
Contributor Author

grubberr commented Jun 2, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5159312061
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5159312061
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/api_quota.py          94     12    87%
source_google_analytics_data_api/source.py            259     41    84%
source_google_analytics_data_api/utils.py              39     10    74%
-----------------------------------------------------------------------
TOTAL                                                 438     65    85%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:695: This tests currently leads to too much failures. We need to fix the connectors at scale first.
============ 39 passed, 2 skipped, 1 warning in 1062.16s (0:17:42) =============

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the improvements and the added work, I will approve this conditional to a strategy to refresh source schema

@tybernstein
Copy link
Contributor

@girarda @pedroslopez @artem1205 @davydov-d tagging you all as requested reviewers, just wanted to see what needs to be done to get this merged.

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change looks sane. Thank you for working on smoothing out the transition!

questions:

  • have you tested both streams with the new pk and with the old one to ensure the transition is smooth?
  • can you update the release playbook?
  • how will we know when we can get rid of the transition code?

@@ -144,6 +144,14 @@ def get_json_schema(self) -> Mapping[str, Any]:
}
)

if "cohort_spec" not in self.config and "date" not in self.config["dimensions"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on this block? It's not obvious why "cohort_spec" is a special case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment, cohorts doesn't support startDate and endDate

# We pass the uuid field for synchronizations which still have the old
# configured_catalog with the old primary key. We need it to avoid of removal of rows
# in the deduplication process. As soon as the customer press "refresh source schema"
# this part is no longer needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will we know when we can remove this code block?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need help from TSC team to communicate with users and set due date, and after this date we can remove this code block.

@darynaishchenko
Copy link
Collaborator

darynaishchenko commented Jun 9, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5221217048
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5221217048
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/api_quota.py          94     12    87%
source_google_analytics_data_api/source.py            259     41    84%
source_google_analytics_data_api/utils.py              39     10    74%
-----------------------------------------------------------------------
TOTAL                                                 438     65    85%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:695: This tests currently leads to too much failures. We need to fix the connectors at scale first.
============ 39 passed, 2 skipped, 1 warning in 1217.93s (0:20:17) =============

@darynaishchenko
Copy link
Collaborator

code change looks sane. Thank you for working on smoothing out the transition!

questions:

  • have you tested both streams with the new pk and with the old one to ensure the transition is smooth?

I have tested it locally, by generating new and old configured catalogs and running this code with these catalogs, reading was succeeded.

@darynaishchenko
Copy link
Collaborator

darynaishchenko commented Jun 14, 2023

/test connector=connectors/source-google-analytics-data-api

🕑 connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5265148114
✅ connectors/source-google-analytics-data-api https://github.com/airbytehq/airbyte/actions/runs/5265148114
Python tests coverage:

Name                                                Stmts   Miss  Cover
-----------------------------------------------------------------------
source_google_analytics_data_api/__init__.py            2      0   100%
source_google_analytics_data_api/authenticator.py      44      2    95%
source_google_analytics_data_api/api_quota.py          98     12    88%
source_google_analytics_data_api/utils.py              56     10    82%
source_google_analytics_data_api/source.py            267     49    82%
-----------------------------------------------------------------------
TOTAL                                                 467     73    84%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:695: This tests currently leads to too much failures. We need to fix the connectors at scale first.
============ 39 passed, 2 skipped, 1 warning in 1168.80s (0:19:28) =============

@amaliaroye
Copy link
Contributor

@darynaishchenko Do users need to run a reset after refreshing their source schema? If they do, would there be a possibility of hitting our rate limit/quota for OAuth?

@darynaishchenko
Copy link
Collaborator

@darynaishchenko Do users need to run a reset after refreshing their source schema? If they do, would there be a possibility of hitting our rate limit/quota for OAuth?

This PR doesn't change connector's spec, so users don't need to run a reset. @bazarnov, please correct me, if I'm wrong

@amaliaroye
Copy link
Contributor

@darynaishchenko Gotcha, thank you! If there's no reset needed, I'll send out the outreach today, and can we schedule this to be released on Thursday, June 22nd around 9am PT, 4pm UTC?

@darynaishchenko
Copy link
Collaborator

@darynaishchenko Gotcha, thank you! If there's no reset needed, I'll send out the outreach today, and can we schedule this to be released on Thursday, June 22nd around 9am PT, 4pm UTC?

yes, thank you

@octavia-squidington-iii
Copy link
Collaborator

source-google-analytics-data-api test report (commit 7ee544d6bd) - ✅

⏲️ Total pipeline duration: 1690 seconds

Step Result
Validate airbyte-integrations/connectors/source-google-analytics-data-api/metadata.yaml
Connector version semver check.
Connector version increment check.
QA checks
Code format checks
Connector package install
Build source-google-analytics-data-api docker image for platform linux/x86_64
Unit tests
Acceptance tests

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-google-analytics-data-api test

@darynaishchenko darynaishchenko merged commit 019153f into master Jun 22, 2023
22 checks passed
@darynaishchenko darynaishchenko deleted the grubberr/oncall-2055-source-google-analytics-data-api-2 branch June 22, 2023 16:00
@dancook-doxo
Copy link

dancook-doxo commented Jul 5, 2023

What

Try to fix this oncalls https://github.com/airbytehq/oncall/issues/2055 https://github.com/airbytehq/oncall/issues/2065

The customer complains that data from metrics of recent days is not full. I propose to re-fetch data multiple times and use deduplication.

  • added primary_key for deduplication
  • added LOOKBACK_WINDOW = 2

What is the effect of LOOKBACK_WINDOW = 2? Reason I ask is because I'm starting to wonder if GA4 considers the data "golden" for a given day before this is actually true. I've been moving daily cron schedules for GA4 syncs in an attempt to make sure that once a sync is complete then any date in the past is "golden" and won't change if I were to reset the stream and re-sync that date.

Is there a generally accepted best time to schedule a daily GA4 sync now that LOOKBACK_WINDOW has been implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/google-analytics-data-api
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet