Source Google Ads: handle page token expired exception #9812

augan-rymkhan · 2022-01-26T11:44:13Z

What

Resolves On-Call #103 Page token has expired.
Reducing twice data range (from 1 month to 15 days) did not help. If there are huge amount of data in that date range, processing can take more than 2 hours, then page token expires after that time.

How

Override read_records method in the IncrementalGoogleAdsStream so that, it handle GoogleAdsException with EXPIRED_PAGE_TOKEN error code, and update start_date key in the stream_slice with the latest read record's cursor value, then retry the sync.

The first attempt

stream_slice = {"start_date": "2021-01-01", "end_date": "2021-01-15"}

{"segments.date": "2021-01-01", "click_view.gclid": "1"},
{"segments.date": "2021-01-02", "click_view.gclid": "2"},
{"segments.date": "2021-01-03", "click_view.gclid": "3"},
{"segments.date": "2021-01-03", "click_view.gclid": "4"},

Page token has expired.

The second attempt

stream_slice = {"start_date": "2021-01-03", "end_date": "2021-01-15"}

{"segments.date": "2021-01-03", "click_view.gclid": "3"},
{"segments.date": "2021-01-03", "click_view.gclid": "4"},
{"segments.date": "2021-01-03", "click_view.gclid": "5"},
{"segments.date": "2021-01-04", "click_view.gclid": "6"},
{"segments.date": "2021-01-05", "click_view.gclid": "7"},

If the connector couldn't read all the records within one day, it will enter an infinite loop, so stop the sync with error

Refactored chunk_date_range, now it returns stream slices in the following format:

[
     {"start_date": "2021-06-18", "end_date": "2021-06-27"},
     {"start_date": "2021-06-28", "end_date": "2021-07-07"},
     {"start_date": "2021-07-08", "end_date": "2021-07-17"}
]

Codecov Report

❗ No coverage uploaded for pull request base (master@9e6da46). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 940c658 differs from pull request most recent head 53b5536. Consider uploading reports for the commit 53b5536 to get more accurate results

@@            Coverage Diff            @@
##             master    #9812   +/-   ##
=========================================
  Coverage          ?   72.44%           
=========================================
  Files             ?        5           
  Lines             ?      323           
  Branches          ?        0           
=========================================
  Hits              ?      234           
  Misses            ?       89           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e6da46...53b5536. Read the comment docs.

augan-rymkhan · 2022-01-27T17:12:49Z

/test connector=connectors/source-google-ads

🕑 connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1757507412
❌ connectors/source-google-ads https://github.com/airbytehq/airbyte/actions/runs/1757507412
🐛 https://gradle.com/s/wnxgbpyhxicgo
Python short test summary info:

=========================== short test summary info ============================
FAILED test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/plugin.py:56: Skipping TestIncremental.test_two_sequential_reads because not found in the config
============= 1 failed, 18 passed, 1 skipped in 799.86s (0:13:19) ==============

keu · 2022-01-27T22:35:58Z

airbyte-integrations/connectors/source-google-ads/source_google_ads/source.py

@@ -128,3 +130,66 @@ def streams(self, config: Mapping[str, Any]) -> List[Stream]:
                ]
            )
        return streams
+
+    def _read_incremental(


A few comments about this override:

please, everytime we override something fill the docstring explaining why the override is needed (especially private method)

I believe we can and should do error handling on the stream level (or GoogleAds class), not here

The pattern looks like typical retry case. I don't think reducing page size is a good solution because we basically lose all progress. I propose to retry query with new state and continue reading. The data in response returned in ascending order, therefore should be no problem to continue reading where we crashed.
WDYT?

@augan-rymkhan here is an explanation of the 3rd point.

Lets say we request the data:

from 2010-01-01 00:00:00 to 2010-01-15 00:00:00
we read 10 pages, but there is another 20 pages left
we crashed at cursor_value = 2010-01-05 13:00:12
we will retry query
from 2010-01-05 13:00:12 to 2010-01-15 00:00:00.

if this will not work (I don't see why, but...), we can retry exactly the same query but continue from the page number where we crashed (10).

I am checking where it can be handled inside stream class. I'll update here.

stream_instance.stream_slices receives stream_state, which is the latest state taken from get_updated_state and generate slices from that point. It doesn't lose the progress, it regenerates slices with reduces date range starting from the latest read record's cursor value.

keu

see my comments

vitaliizazmic · 2022-02-01T16:03:03Z