Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylopic ingestion may fail if build changes during processing #3820

Closed
AetherUnbound opened this issue Feb 21, 2024 · 0 comments · Fixed by #3874
Closed

Phylopic ingestion may fail if build changes during processing #3820

AetherUnbound opened this issue Feb 21, 2024 · 0 comments · Fixed by #3874
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Contributor

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.engineering/log?execution_date=2024-02-11T00%3A00%3A00%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=phylopic_workflow&map_index=-1

Description

When performing the Phylopic ingestion, the first step we take is to get the build_param:

self.build_param = resp.get("build")

I assume this value relates to some index or API version that is being referenced. Recently, we had ingestion fail due to this parameter:

[2024-02-18, 00:05:51 UTC] {provider_data_ingester.py:238} INFO - 3312 records ingested so far.
[2024-02-18, 00:05:56 UTC] {requester.py:79} WARNING - Unable to request URL: https://api.phylopic.org/images?build=307&page=69&embed_items=true  Status code: 410
[2024-02-18, 00:05:56 UTC] {requester.py:169} WARNING - Bad response_json:  None
[2024-02-18, 00:05:56 UTC] {requester.py:170} WARNING - Retrying:
_get_response_json(
    https://api.phylopic.org/images,
    {'build': 307, 'page': 69, 'embed_items': 'true'},
    retries=2)
[2024-02-18, 00:06:00 UTC] {requester.py:79} WARNING - Unable to request URL: https://api.phylopic.org/images?build=307&page=69&embed_items=true  Status code: 410
[2024-02-18, 00:06:00 UTC] {requester.py:169} WARNING - Bad response_json:  None
[2024-02-18, 00:06:00 UTC] {requester.py:170} WARNING - Retrying:
_get_response_json(
    https://api.phylopic.org/images,
    {'build': 307, 'page': 69, 'embed_items': 'true'},
    retries=1)
[2024-02-18, 00:06:05 UTC] {requester.py:79} WARNING - Unable to request URL: https://api.phylopic.org/images?build=307&page=69&embed_items=true  Status code: 410
[2024-02-18, 00:06:05 UTC] {requester.py:169} WARNING - Bad response_json:  None
[2024-02-18, 00:06:05 UTC] {requester.py:170} WARNING - Retrying:
_get_response_json(
    https://api.phylopic.org/images,
    {'build': 307, 'page': 69, 'embed_items': 'true'},
    retries=0)
[2024-02-18, 00:06:10 UTC] {requester.py:79} WARNING - Unable to request URL: https://api.phylopic.org/images?build=307&page=69&embed_items=true  Status code: 410
[2024-02-18, 00:06:10 UTC] {requester.py:169} WARNING - Bad response_json:  None
[2024-02-18, 00:06:10 UTC] {requester.py:170} WARNING - Retrying:
_get_response_json(
    https://api.phylopic.org/images,
    {'build': 307, 'page': 69, 'embed_items': 'true'},
    retries=-1)
[2024-02-18, 00:06:10 UTC] {requester.py:155} ERROR - No retries remaining.  Failure.

When visiting the above URL, a 410 is indeed returned with the following body:

{

      "build": 312,
      "errors": [
            {
                  "developerMessage": "Outdated `build` index. Should be the current build index (312). The current value can always be gotten by omitting the `build` parameter and following the redirect. Or, see the body of this response.",
                  "field": "build",
                  "type": "RESOURCE_NOT_FOUND",
                  "userMessage": "There was a problem with a request to list silhouette images."
            }
      ]

}

I suspect that this meant that the build changed while we were doing processing, since the above requests for said build ran fine. In these cases, we probably just want to kick the Phylopic DAG off again from the start. I think it would be erroneous of us to pick up the new build number and continue from the same page as the data may be entirely different.

We have a few options here:

  • Check for 410s in the Phylopic DAG and raise a more specific exception which has context and directions for how to proceed.
  • Catch a 410 in this case, and reset both the build and the page number and just start processing over from the beginning again. The upsert process should filter out any duplicates that occur by this method.

Personally, I'm partial to the latter as it will mean less intervention from operators and the DAG will be able to complete as expected. @WordPress/openverse-catalog, any other thoughts/opinions?

Reproduction

You should be able to replicate this by clicking one of the links above, or triggering the DAG locally with {"build": 307} as part of the additional_query_params.

DAG status

The DAG looks like it's chugging along successfully on a re-run so no need to change the status.

@AetherUnbound AetherUnbound added 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs help wanted Open to participation from the community labels Feb 21, 2024
@stacimc stacimc self-assigned this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants