Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Wikimedia request timeout #4003

Closed
AetherUnbound opened this issue Apr 1, 2024 · 0 comments · Fixed by #4004
Closed

Increase Wikimedia request timeout #4003

AetherUnbound opened this issue Apr 1, 2024 · 0 comments · Fixed by #4004
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow

Comments

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Apr 1, 2024

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.engineering/log?execution_date=2024-03-24T00%3A00%3A00%2B00%3A00&task_id=ingest_data_day_shift_456.pull_mixed_data_day_shift_456&dag_id=wikimedia_reingestion_workflow&map_index=-1

(one of many examples)

Description

The Wikimedia API is often particularly slow to reply. In many cases the query for a given parameter takes longer than 60 seconds (our default request timeout) to complete and the workflow will fail as a result. Particularly for when the reingestion workflows are running, this can mean that numerous days fail for a given reingestion run. We had a recent run with over a dozen errors all of the type:

Exception: HTTPSConnectionPool(host='commons.wikimedia.org', port=443): Read timed out. (read timeout=60)

For Wikimedia only, we should consider increasing the request timeout to at least 120s to see if this helps reduce the number of timeout issues we have. This can be done by passing requests's timeout into the get_response_json call for the ProviderDataIngester. We actually already have the timeout overridden here, so that'd just need to be bumped up to 120s:

Reproduction

DAG status

Unchanged

Related issues

#1269

@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 1, 2024
@AetherUnbound AetherUnbound self-assigned this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant