Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Google Ads: handle page token expired exception #9812

Merged
merged 35 commits into from
Feb 4, 2022

Conversation

augan-rymkhan
Copy link
Contributor

@augan-rymkhan augan-rymkhan commented Jan 26, 2022

What

Resolves On-Call #103 Page token has expired.
Reducing twice data range (from 1 month to 15 days) did not help. If there are huge amount of data in that date range, processing can take more than 2 hours, then page token expires after that time.

How

Override read_records method in the IncrementalGoogleAdsStream so that, it handle GoogleAdsException with EXPIRED_PAGE_TOKEN error code, and update start_date key in the stream_slice with the latest read record's cursor value, then retry the sync.

The first attempt

stream_slice = {"start_date": "2021-01-01", "end_date": "2021-01-15"}

{"segments.date": "2021-01-01", "click_view.gclid": "1"},
{"segments.date": "2021-01-02", "click_view.gclid": "2"},
{"segments.date": "2021-01-03", "click_view.gclid": "3"},
{"segments.date": "2021-01-03", "click_view.gclid": "4"},

Page token has expired.

The second attempt

stream_slice = {"start_date": "2021-01-03", "end_date": "2021-01-15"}

{"segments.date": "2021-01-03", "click_view.gclid": "3"},
{"segments.date": "2021-01-03", "click_view.gclid": "4"},
{"segments.date": "2021-01-03", "click_view.gclid": "5"},
{"segments.date": "2021-01-04", "click_view.gclid": "6"},
{"segments.date": "2021-01-05", "click_view.gclid": "7"},

If the connector couldn't read all the records within one day, it will enter an infinite loop, so stop the sync with error

Refactored chunk_date_range, now it returns stream slices in the following format:

[
     {"start_date": "2021-06-18", "end_date": "2021-06-27"},
     {"start_date": "2021-06-28", "end_date": "2021-07-07"},
     {"start_date": "2021-07-08", "end_date": "2021-07-17"}
]

Recommended reading order

  1. source-google-ads/source_google_ads/streams.py
  2. source-google-ads/unit_tests/test_streams.py
  3. unit_tests/test_google_ads.py
  4. source-google-ads/unit_tests/test_source.py
  5. source_google_ads/custom_query_stream.py

@github-actions github-actions bot added the area/connectors Connector related issues label Jan 26, 2022
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 26, 2022 11:45 Inactive
@augan-rymkhan augan-rymkhan changed the title Arymkhan/google ads page token expired fix Source Google Ads: handle page token expired exception and reduce date range dynamically Jan 27, 2022
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 27, 2022 11:38 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 27, 2022 12:54 Inactive
@augan-rymkhan augan-rymkhan marked this pull request as ready for review January 27, 2022 13:00
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 27, 2022 16:12 Inactive
@codecov
Copy link

codecov bot commented Jan 27, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@9e6da46). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 940c658 differs from pull request most recent head 53b5536. Consider uploading reports for the commit 53b5536 to get more accurate results

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9812   +/-   ##
=========================================
  Coverage          ?   72.44%           
=========================================
  Files             ?        5           
  Lines             ?      323           
  Branches          ?        0           
=========================================
  Hits              ?      234           
  Misses            ?       89           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e6da46...53b5536. Read the comment docs.

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Jan 27, 2022

/test connector=connectors/source-google-ads

🕑 connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1757507412
❌ connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1757507412
🐛 https://gradle.com/s/wnxgbpyhxicgo
Python short test summary info:

=========================== short test summary info ============================
FAILED test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/plugin.py:56: Skipping TestIncremental.test_two_sequential_reads because not found in the config
============= 1 failed, 18 passed, 1 skipped in 799.86s (0:13:19) ==============

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 27, 2022 17:15 Inactive
@@ -128,3 +130,66 @@ def streams(self, config: Mapping[str, Any]) -> List[Stream]:
]
)
return streams

def _read_incremental(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments about this override:

  1. please, everytime we override something fill the docstring explaining why the override is needed (especially private method)
  2. I believe we can and should do error handling on the stream level (or GoogleAds class), not here
  3. The pattern looks like typical retry case. I don't think reducing page size is a good solution because we basically lose all progress. I propose to retry query with new state and continue reading. The data in response returned in ascending order, therefore should be no problem to continue reading where we crashed.
    WDYT?

Copy link
Contributor

@keu keu Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@augan-rymkhan here is an explanation of the 3rd point.

Lets say we request the data:

from 2010-01-01 00:00:00 to 2010-01-15 00:00:00
we read 10 pages, but there is another 20 pages left
we crashed at cursor_value = 2010-01-05 13:00:12
we will retry query
from 2010-01-05 13:00:12 to 2010-01-15 00:00:00.

if this will not work (I don't see why, but...), we can retry exactly the same query but continue from the page number where we crashed (10).

Copy link
Contributor Author

@augan-rymkhan augan-rymkhan Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am checking where it can be handled inside stream class. I'll update here.
  2. stream_instance.stream_slices receives stream_state, which is the latest state taken from get_updated_state and generate slices from that point. It doesn't lose the progress, it regenerates slices with reduces date range starting from the latest read record's cursor value.

Copy link
Contributor

@keu keu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 31, 2022 10:53 Inactive
Currently this method returns `start_date` and `end_date` with 15 days difference.
"""

end_date = end_date or pendulum.yesterday(tz=time_zone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why time_zone is used only for end_date? And why we need this parameter in general?

Copy link
Contributor Author

@augan-rymkhan augan-rymkhan Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliizazmic This line was there before this change, get_date_params was stream method. I just refactored it to make it as a module function to be able to call it inside chunk_date_range function.

end_date = end_date or pendulum.yesterday(tz=time_zone)
start_date = pendulum.parse(start_date)
if start_date > pendulum.now():
return start_date.to_date_string(), start_date.add(days=1).to_date_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why this dates are returned in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Fix issue #4806, start date should always be lower than end date.
if start_date.add(days=1).date() >= end_date.date():
return start_date.add(days=1).to_date_string(), start_date.add(days=2).to_date_string()
return start_date.add(days=1).to_date_string(), end_date.to_date_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why one day is added to start day?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -35,16 +64,21 @@ def chunk_date_range(

# As in to return some state when state in abnormal
if start_date > end_date:
return [{field: start_date.to_date_string()}]
start, end = get_date_params(start_date.to_date_string(), time_zone=time_zone, range_days=range_days)
return [{"start_date": start, "end_date": end}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, we can avoid duplicating code by setting dates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliizazmic Done. I refactored this line. thanks for this suggestion!

for record in self.parse_response(response):
state = self.get_updated_state(state, record)
yield record
except GoogleAdsException as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we shouldn't use names like "e".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliizazmic Changed var name.

start_date, end_date = parse_dates(stream_slice)
if (end_date - start_date).days == 1:
# If range days is 1, no need in retry, because it's the minimum date range
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it return some description of error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be description: message: "Page token has expired."
l will add here extra log.

raise e
else:
# return the control if no exception is raised
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, it isn't good solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliizazmic After reading records is completed successfully, it needs to quit the loop. What solution you suggest?

return start_date, end_date


def get_date_params(start_date: str, time_zone=None, range_days: int = None, end_date: pendulum.datetime = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't have return type annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@augan-rymkhan Added return type.

try:
response = self.google_ads_client.send_request(self.get_query(stream_slice))
for record in self.parse_response(response):
state = self.get_updated_state(state, record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why we need to call get_updated_state here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keu This method is called to get the cursor value from the record. To be sure the cursor value is the latest state I call this method.

if start_date.add(days=1).date() >= end_date.date():
return start_date.add(days=1).to_date_string(), start_date.add(days=2).to_date_string()
return start_date.add(days=1).to_date_string(), end_date.to_date_string()
state = stream_state or {}
Copy link
Contributor

@keu keu Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic of this function has two parts:

  • handling error
  • reading records (duplicate logic of stream.read or super class)

can we implement this as a retry decorator? or at least move this to separate function

Copy link
Contributor Author

@augan-rymkhan augan-rymkhan Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keu Refactored this method, now it calls the parent's read_records method. I am not sure that, using decorator will improve this code.

Copy link
Contributor

@keu keu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets February 2, 2022 07:19 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets February 2, 2022 09:49 Inactive
Copy link
Contributor

@keu keu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let it go, as it is oncall issue

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Feb 4, 2022
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Feb 4, 2022

/publish connector=connectors/source-google-ads

🕑 connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1795957523
✅ connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1795957523

@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets February 4, 2022 16:19 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 4, 2022 16:21 Inactive
@augan-rymkhan augan-rymkhan merged commit 359fcd8 into master Feb 4, 2022
@augan-rymkhan augan-rymkhan deleted the arymkhan/google-ads-page-token-expired-fix branch February 4, 2022 16:48
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets February 4, 2022 16:49 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants