Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reintroduce window in days, log warning when sampling occurs #9480

Merged

Conversation

eliziario
Copy link
Contributor

@eliziario eliziario commented Jan 13, 2022

Improve sampling awareness and reintroduce parameter for mitigation in source google analytics

Fixes #8570

Reason

Google analytics may return sampled data when there are too many sessions (about 500k) in a given period.
Also, it may return provisional data (isDataGolden=false) meaning that the report can change if requested again in the future.
The Sampling issue can be partially mitigated (but not completely eliminated) by using smaller values for window_in_days parameter.
For those cases when even using window_in_days equal to one is not enough to avoid sampling (when there are more than 500k sessions or other dimension values in a report in a day) the user should be warned in the logs.
Also, when isDataGolden=false the user should be warned in logs

Confirmation

  • Generating millions of data records in an analytics api using the instrumentation api in a script reproduces the issue, which is also described in the documentation

How does the code change in the PR fix the issue?

  • Changed the default value for the window_in_days parameter
  • Re-exposed the window_in_days parameter on configuration for better user ergonomics based on their usage profile (users with lots of analytics data will choose smaller values to avoid sampling, users with less data will prefer higher values to improve initial sync speed.
  • Added logging warning for when isDataGolden is false
  • Added logging warning for when analytics api is sampling data.

Recommended reading order

  1. source.py
  2. spec.json
  3. unit_test.py

@github-actions github-actions bot added the area/connectors Connector related issues label Jan 13, 2022
@eliziario eliziario temporarily deployed to more-secrets January 17, 2022 11:11 Inactive
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Jan 17, 2022
@eliziario eliziario temporarily deployed to more-secrets January 17, 2022 11:50 Inactive
@eliziario eliziario marked this pull request as ready for review January 17, 2022 11:50
@sergei-solonitcyn
Copy link
Contributor

Please, don't forget to run tests and checks before every comit and especially before pushing.

Copy link
Contributor

@vitaliizazmic vitaliizazmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only requested small changes is needed.

@eliziario eliziario temporarily deployed to more-secrets January 20, 2022 12:57 Inactive
eliziario and others added 6 commits January 21, 2022 22:20
…rce_google_analytics_v4/source.py

Co-authored-by: Sergei Solonitcyn <11441558+sergei-solonitcyn@users.noreply.github.com>
Signed-off-by: Sergei Solonitcyn <sergei.solonitcyn@zazmic.com>
Signed-off-by: Sergei Solonitcyn <sergei.solonitcyn@zazmic.com>
@sergei-solonitcyn sergei-solonitcyn force-pushed the eliziario/8570_google_analytics_improve_sampling_awareness branch from 37b1d68 to 635d0fc Compare January 21, 2022 22:04
@sergei-solonitcyn sergei-solonitcyn temporarily deployed to more-secrets January 21, 2022 22:06 Inactive
@sergei-solonitcyn
Copy link
Contributor

Made some updates, but tests are still broken and there are suspections that the code as well.
Will reassign it and finish the work.

@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 24, 2022

/test connector=connectors/source-google-analytics-v4

🕑 connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1739826933
❌ connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1739826933
🐛 https://gradle.com/s/q4l327na465c4
Python short test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs0] - Failed: Timeout >300.0s
FAILED test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
FAILED test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/tests/test_incremental.py:21: `future_state_path` not specified, skipping
============= 3 failed, 13 passed, 1 skipped in 1481.03s (0:24:41) =============

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 24, 2022 12:31 Inactive
@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 25, 2022

/test connector=connectors/source-google-analytics-v4

🕑 connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1743603850
❌ connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1743603850
🐛 https://gradle.com/s/hhu7p2rouqzjs
Python short test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestConnection::test_check[inputs0] - Failed: Timeout >3...
FAILED test_core.py::TestConnection::test_check[inputs1] - Failed: Timeout >3...
FAILED test_core.py::TestBasicRead::test_read[inputs0] - docker.errors.Contai...
FAILED test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
FAILED test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/tests/test_incremental.py:21: `future_state_path` not specified, skipping
============= 5 failed, 11 passed, 1 skipped in 542.20s (0:09:02) ==============

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 25, 2022 04:24 Inactive
@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 25, 2022

/test connector=connectors/source-google-analytics-v4

🕑 connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1745380598
❌ connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1745380598
🐛 https://gradle.com/s/z6xlo2vu7vjgo
Python short test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs0] - Failed: Timeout >300.0s
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/tests/test_incremental.py:21: `future_state_path` not specified, skipping
============= 1 failed, 15 passed, 1 skipped in 1468.40s (0:24:28) =============

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 25, 2022 12:30 Inactive
@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 25, 2022

/test connector=connectors/source-google-analytics-v4

🕑 connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1746013589
✅ connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1746013589
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/conftest.py                     109    109     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              242     96    60%
	 source_acceptance_test/tests/test_full_refresh.py       38      0   100%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  54     17    69%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  979    404    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                     Stmts   Miss  Cover
	 ------------------------------------------------------------
	 source_google_analytics_v4/__init__.py       2      0   100%
	 source_google_analytics_v4/source.py       273     34    88%
	 ------------------------------------------------------------
	 TOTAL                                      275     34    88%

Python short test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.7/site-packages/source_acceptance_test/tests/test_incremental.py:21: `future_state_path` not specified, skipping
======================== 16 passed, 1 skipped in 53.51s ========================

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 25, 2022 14:48 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 26, 2022 04:33 Inactive
@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@6490bf7). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9480   +/-   ##
=========================================
  Coverage          ?   87.63%           
=========================================
  Files             ?        2           
  Lines             ?      275           
  Branches          ?        0           
=========================================
  Hits              ?      241           
  Misses            ?       34           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6490bf7...7afab85. Read the comment docs.

@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 26, 2022

/publish connector=connectors/source-google-analytics-v4

🕑 connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1749126035
✅ connectors/source-google-analytics-v4 https://github.com/airbytehq/airbyte/actions/runs/1749126035

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 26, 2022 04:44 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 26, 2022 04:56 Inactive
@sergei-solonitcyn sergei-solonitcyn temporarily deployed to more-secrets January 26, 2022 07:45 Inactive
@augan-rymkhan augan-rymkhan merged commit f78ede0 into master Jan 26, 2022
@augan-rymkhan augan-rymkhan deleted the eliziario/8570_google_analytics_improve_sampling_awareness branch January 26, 2022 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source Google Analytics: Improve sampling awareness
6 participants