New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Google Analytics v4: - add pk and lookback window #26283
Source Google Analytics v4: - add pk and lookback window #26283
Conversation
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
…cs-data-api-2 Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
…' of github.com:airbytehq/airbyte into grubberr/oncall-2055-source-google-analytics-data-api-2
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
|
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
…cs-data-api-2 Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
After your recommendation try to find more smooth transition for customers I have added this commit: The main idea: produce "uuid" field for already existing syncs that still work on old configured_catalog. THE MAIN PROBLEM WHICH WAS BEFORE THIS IMPROVEMENT: if the new connector 0.3.0 stops producing "uuid" field but the old configured_catalog still announce that the primary_key is "uuid" after normalization, we have an empty "uuid" field which will be de-duplicated. AFTER THIS IMPROVEMENT: It still preferable to ask customers to press "refresh source schema" after upgrade to 0.3.0 tag @pedroslopez |
…cs-data-api-2 Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
…cs-data-api-2 Signed-off-by: Sergey Chvalyuk <grubberr@gmail.com>
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the improvements and the added work, I will approve this conditional to a strategy to refresh source schema
@girarda @pedroslopez @artem1205 @davydov-d tagging you all as requested reviewers, just wanted to see what needs to be done to get this merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code change looks sane. Thank you for working on smoothing out the transition!
questions:
- have you tested both streams with the new pk and with the old one to ensure the transition is smooth?
- can you update the release playbook?
- how will we know when we can get rid of the transition code?
@@ -144,6 +144,14 @@ def get_json_schema(self) -> Mapping[str, Any]: | |||
} | |||
) | |||
|
|||
if "cohort_spec" not in self.config and "date" not in self.config["dimensions"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on this block? It's not obvious why "cohort_spec" is a special case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment, cohorts doesn't support startDate and endDate
# We pass the uuid field for synchronizations which still have the old | ||
# configured_catalog with the old primary key. We need it to avoid of removal of rows | ||
# in the deduplication process. As soon as the customer press "refresh source schema" | ||
# this part is no longer needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will we know when we can remove this code block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need help from TSC team to communicate with users and set due date, and after this date we can remove this code block.
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
I have tested it locally, by generating new and old configured catalogs and running this code with these catalogs, reading was succeeded. |
/test connector=connectors/source-google-analytics-data-api
Build PassedTest summary info:
|
@darynaishchenko Do users need to run a reset after refreshing their source schema? If they do, would there be a possibility of hitting our rate limit/quota for OAuth? |
This PR doesn't change connector's spec, so users don't need to run a reset. @bazarnov, please correct me, if I'm wrong |
@darynaishchenko Gotcha, thank you! If there's no reset needed, I'll send out the outreach today, and can we schedule this to be released on Thursday, June 22nd around 9am PT, 4pm UTC? |
yes, thank you |
source-google-analytics-data-api test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-google-analytics-data-api/metadata.yaml | ✅ |
Connector version semver check. | ✅ |
Connector version increment check. | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-google-analytics-data-api docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Acceptance tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-google-analytics-data-api test
What is the effect of LOOKBACK_WINDOW = 2? Reason I ask is because I'm starting to wonder if GA4 considers the data "golden" for a given day before this is actually true. I've been moving daily cron schedules for GA4 syncs in an attempt to make sure that once a sync is complete then any date in the past is "golden" and won't change if I were to reset the stream and re-sync that date. Is there a generally accepted best time to schedule a daily GA4 sync now that LOOKBACK_WINDOW has been implemented? |
What
Try to fix this oncalls
https://github.com/airbytehq/oncall/issues/2055
https://github.com/airbytehq/oncall/issues/2065
The customer complains that data from metrics of recent days is not full.
I propose to re-fetch data multiple times and use deduplication.
Backward incompatible!
https://docs.google.com/document/d/11cx-ZTBsWyt4ZVVURTbm_etvX4IYp1QUlbT9FzcdCzE/edit
Primary Key changed:
Because the new implementation removed the old
uuid
field which was an oldprimary_key
. Existing synchronizations can lose data because of the deduplication mechanism. If the customer will run “refresh source schema” all works OK.All dimensions are Primary key
The main idea of this PR is to make all dimensions that we passed to stream to be multi-field primary_key.
For example for our stream
devices
.primary_key = ["date", "deviceCategory", "operatingSystem", "browser"]
This connector is equivalent of such SQL:
As you can see the combination of all dimensions gives us a unique set of values for every record.
It's important to understand that we have multiple such SQL queries because of connector slicing.
slice1:
WHERE date >= '2023-04-01' and data =< '2023-04-30
slice2:
WHERE date >= '2023-05-01' and data =< '2023-05-25
Missed "date" dimension
If the stream missed "date" dimension, for example, the custom stream can be constructed without "date" dimension.
Using the previously mentioned approach: "all dimensions are primary_key" does not work because now PK data can be duplicated.
slice1:
slice2:
Result:
To fix this problem for such streams which are missed "date" dimension I have added a 2 new fields from slice
"startDate", "endDate"
which are also part of the primary key:New Result:
Because now
primary_key=["Browser", "startDate", "endDate"]
now all works OK.