Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Science Museum queries may occasionally fail due to upstream bug #4013

Open
AetherUnbound opened this issue Apr 2, 2024 · 8 comments
Open
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔒 staff only Restricted to staff members ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow

Comments

@AetherUnbound
Copy link
Contributor

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.engineering/log?execution_date=2024-03-01T00%3A00%3A00%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=science_museum_workflow&map_index=-1

[2024-04-01, 00:03:38 UTC] {requester.py:85} ERROR - Error with the request for URL: https://collection.sciencemuseumgroup.org.uk/search/
[2024-04-01, 00:03:38 UTC] {requester.py:86} INFO - HTTPError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200
[2024-04-01, 00:03:38 UTC] {requester.py:88} INFO - Using query parameters {'has_image': 1, 'image_license': 'CC', 'page[size]': 100, 'page[number]': 39, 'date[from]': 0, 'date[to]': 200}
[2024-04-01, 00:03:38 UTC] {requester.py:89} INFO - Using headers {'User-Agent': 'Openverse/0.1 (https://openverse.org; openverse@wordpress.org)', 'Accept': 'application/json'}
[2024-04-01, 00:03:38 UTC] {requester.py:154} ERROR - No retries remaining. Failure.
[2024-04-01, 00:03:38 UTC] {provider_data_ingester.py:513} INFO - Committed 0 records
[2024-04-01, 00:03:39 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200
query_params: {"has_image": 1, "image_license": "CC", "page[size]": 100, "page[number]": 39, "date[from]": 0, "date[to]": 200}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/science_museum.py", line 81, in ingest_records
    super().ingest_records(year_range=year_range)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 238, in ingest_records
    batch, should_continue = self.get_batch(query_params)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 400, in get_batch
    response_json = self.get_response_json(query_params)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 421, in get_response_json
    return self.delayed_requester.get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
    response_json = self._attempt_retry_get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
    return self.get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
    response_json = self._attempt_retry_get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
    return self.get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
    response_json = self._attempt_retry_get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
    return self.get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
    response_json = self._attempt_retry_get_response_json(
  File "/opt/airflow/catalog/dags/common/requester.py", line 155, in _attempt_retry_get_response_json
    raise error
  File "/opt/airflow/catalog/dags/common/requester.py", line 181, in get_response_json
    response = self.get(endpoint, params=query_params, **kwargs)
  File "/opt/airflow/catalog/dags/common/requester.py", line 103, in get
    return self._make_request(self.session.get, url, params=params, **kwargs)
  File "/opt/airflow/catalog/dags/common/requester.py", line 70, in _make_request
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200

Description

It appears as though the Science Museum DAG is failing for this particular URL (specifically, these parameters):

https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page[size]=100&page[number]=39&date[from]=0&date[to]=200

Reproduction

Changing the page[number] param from 39 to 40 returns a non-503 response:

https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page[size]=100&page[number]=40&date[from]=0&date[to]=200

Since this is entirely an upstream bug, I think the best case here might be to skip a particular page if we receive a 503 response specifically.

Note
We should take special care to make sure that when this issue is resolved, we're actually ingesting data from this provider. The last large run returned nearly 100k results, but our previous run prior to this failure only returned ~150. There may be another issue here which is preventing standard ingestion of records, possibly due to a change in the shape of results.

DAG status

No change, this is a monthly DAG and we should hopefully address it soon.

@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 2, 2024
@sarayourfriend sarayourfriend added the 🔒 staff only Restricted to staff members label Apr 2, 2024
@sarayourfriend
Copy link
Contributor

Whoever does this issue should reach out to Science Museum and let them know about the bug, too: feedback@sciencemuseum.ac.uk from https://collection.sciencemuseumgroup.org.uk/about

@AetherUnbound do you know if we've already done that, by chance?

@AetherUnbound
Copy link
Contributor Author

I have not, I intend to though when I'm next in front of my computer! I had just enough time to fill this issue out and get down the context before I had to step away.

@AetherUnbound
Copy link
Contributor Author

I ran the same range locally and the error no longer occurs, closing!

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2024
@stacimc stacimc reopened this Apr 15, 2024
@stacimc
Copy link
Contributor

stacimc commented Apr 15, 2024

Reopening as I've encountered this issue while testing #4105.

@AetherUnbound
Copy link
Contributor Author

AetherUnbound commented Apr 15, 2024

@stacimc do you mind sharing the query params for the case that's failing currently?

@stacimc
Copy link
Contributor

stacimc commented Apr 16, 2024

Sure -- just a few minutes into ingestion, unfortunately:

"initial_query_params":{
"date[from]":1500
"date[to]":1750
"has_image":1
"image_license":"CC"
"page[number]":43
"page[size]":100
}

@AetherUnbound
Copy link
Contributor Author

Thanks! I've emailed the folks at the Science Museum Group with this information.

@stacimc
Copy link
Contributor

stacimc commented Apr 18, 2024

Thanks @AetherUnbound. For the time being I've updated the SKIPPED_INGESTION_ERRORS configuration to skip batches with this particular error, and restarted the DAG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔒 staff only Restricted to staff members ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow
Projects
Status: ⛔ Blocked
Development

No branches or pull requests

3 participants