Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stocksnap 403 forbidden error during ingestion #4101

Open
AetherUnbound opened this issue Apr 12, 2024 · 3 comments
Open

Stocksnap 403 forbidden error during ingestion #4101

AetherUnbound opened this issue Apr 12, 2024 · 3 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow

Comments

@AetherUnbound
Copy link
Contributor

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.org/log?execution_date=2024-03-01T00%3A00%3A00%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=stocksnap_workflow&map_index=-1

Description

The Stocksnap DAG encountered an error during ingestion:

[2024-04-01, 23:09:02 UTC] {requester.py:85} ERROR - Error with the request for URL: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
[2024-04-01, 23:09:02 UTC] {requester.py:86} INFO - HTTPError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
[2024-04-01, 23:09:02 UTC] {requester.py:89} INFO - Using headers {'User-Agent': 'Openverse/0.1 (https://openverse.org; openverse@wordpress.org)', 'Accept': 'application/json'}
[2024-04-01, 23:09:02 UTC] {media.py:233} INFO - Writing 68 lines from buffer to disk.
[2024-04-01, 23:09:02 UTC] {provider_data_ingester.py:513} INFO - Committed 31168 records
[2024-04-01, 23:09:02 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
query_params: {}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
    self.record_count += self.process_batch(batch)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 455, in process_batch
    if not (record_data := self.get_record_data(data)):
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/stocksnap.py", line 92, in get_record_data
    filesize = self._get_filesize(url)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/stocksnap.py", line 152, in _get_filesize
    resp = self.delayed_requester.head(image_url)
  File "/opt/airflow/catalog/dags/common/requester.py", line 114, in head
    return self._make_request(self.session.head, url, **kwargs)
  File "/opt/airflow/catalog/dags/common/requester.py", line 70, in _make_request
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg

On top of this, Stocksnap uses a page counter instead of normal query params, so it's difficult to determine which page it failed on:

In addition to resolving this issue, we should try and alter the DAGs that don't normally use query parameters so they still have something to report when they fail.

DAG status

Unchanged for now since this is a monthly DAG

@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 12, 2024
@AetherUnbound AetherUnbound self-assigned this Apr 12, 2024
@AetherUnbound
Copy link
Contributor Author

I've opened #4102 to help us reproduce this. Once that's merged, we should run the DAG again and see if it fails in the same place. If it does, we can continue to troubleshoot. If it doesn't, we can close this and reopen if it comes up again.

@stacimc
Copy link
Contributor

stacimc commented Apr 18, 2024

Confirmed (now that initial_query_params works!) that this fails locally when starting with the params {"page": 780}. Locally by the time I tested, the error was actually happening on page 781, possibly because more records were added before I tested.

@AetherUnbound
Copy link
Contributor Author

I have emailed Stocksnap about this issue.

@AetherUnbound AetherUnbound added the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow
Projects
Status: ⛔ Blocked
Development

No branches or pull requests

2 participants