Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source Amazon Ads: fix fragile polling generator #8388

Merged
merged 12 commits into from
Dec 28, 2021

Conversation

monai
Copy link
Contributor

@monai monai commented Dec 1, 2021

What

Amazon Ads connector initiates report generation and then polls for successfully generated reports. Currently, this logic is implemented as a single long-running and deeply nested generator that fails on every, possibly recoverable, upstream error.

This change is an attempt to fix the issue. Fixes #6767.

How

The polling generator is refactored as a decorated function. The decorator repeatedly calls it on every known error for a maximum of 30 minutes. When the status check is constantly failing for 30 minutes, it initializes a new report and repeats reinitialization up to 5 times.

Recommended reading order

  1. source_amazon_ads/streams/report_streams/report_streams.py

🚨 User Impact 🚨

None.

Pre-merge Checklist

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

@github-actions github-actions bot added the area/connectors Connector related issues label Dec 1, 2021
(ReportGenerationFailure, ReportGenerationInProgress),
max_time=REPORT_WAIT_TIMEOUT,
)
def _try_read_records(self, report_infos):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job refactoring this messy code. But you have changed logic so now it wouldn't wait for reports status update and expect it to be completed immediately? You misunderstood REPORT_WAIT_TIMEOUT parameter, this is not time required for downloading report, its maximum time that requires for generating async report. Please read comment that you have deleted :)

I agree on that if single report failed with 500 error it should be retried but in this case we should take into account report processing time for each report.
Wouldn't you mind if I pick up this PR and continue work for you. @sherifnada what do you think on priority of this ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avida it is most valuable to focus on the issues currently in the sprint (FB/sentry), is it possible to describe the needed changes so monai can do them instead? alternatively hand off to another member on the team or come back to the PR after FB/sentry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have changed logic so now it wouldn't wait for reports status update and expect it to be completed immediately?

As I wrote in the PR description above, I refactored the polling generator as a decorated function. As a result, it polls report status for up to REPORT_WAIT_TIMEOUT, i.e., 20 minutes (see the documentation for max_time property).

You misunderstood REPORT_WAIT_TIMEOUT parameter, this is not time required for downloading report, its maximum time that requires for generating async report.

False; see the comment above.

I agree on that if single report failed with 500 error it should be retried but in this case we should take into account report processing time for each report.

As already does yours, my implementation uses the same timeout for an entire report batch stored in report_infos. Report generation is being initiated simultaneously for the batch and, therefore, should be completed in the same 20 minutes window. Even if they're generated not in the same order as initiated, the polling function takes and processes whichever is generated first.

Wouldn't you mind if I pick up this PR and continue work for you. @sherifnada what do you think on priority of this ticket?

Of course not, that would be great 😊 I started this PR because my original issue #6767 hasn't got any attention yet.

To add more context to this issue. Amazon Ads connector is virtually unusable with a real account with several actively running campaigns. Report extraction fails due to one or another error. Roughly 1 in 20, if not less, sync succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monai Sorry for long time no response, lost this conversation. Thanks for detailed answer, it makes sense, lets merge it.

@monai
Copy link
Contributor Author

monai commented Dec 7, 2021

Few more changes:

  • Increased total wait timeout to 30 minutes; 20 minutes is not enough.
  • Added backoff to report download as it sometimes fails.
  • When the status check is constantly failing for 30 minutes, it initializes a new report and repeats reinitialization up to 5 times.

After these additional changes, I've finally managed to get the first successful sync for a whole month's reports. So this is the first version that works with an actual active account.

Sync status:

Succeeded
4.21 GB | 2,796,145 records | 1h 36m 39s | Sync

@monai monai marked this pull request as ready for review December 7, 2021 15:21
@harshithmullapudi
Copy link
Contributor

@marcosmarxm this looks good to me. I have tried with the credentials. Do we need to inform the Connector team to evaluate experience ?

@sherifnada
Copy link
Contributor

@harshithmullapudi if connector team approves the PR go ahead

@harshithmullapudi
Copy link
Contributor

@avida can you possibly take a quick look at this PR ?

Copy link
Contributor

@avida avida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I was confused by nonstandard backoff decorator using for retries but after @monai explained it became clear to me. Thanks!

@avida avida merged commit 039a1aa into airbytehq:master Dec 28, 2021
avida pushed a commit that referenced this pull request Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Amazon Ads connector fails a lot in the EU region
6 participants