Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from updated build_param in Phylopic DAG #3874

Merged
merged 5 commits into from Mar 12, 2024

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Mar 4, 2024

Fixes

Fixes #3820 by @AetherUnbound, fixes #1369

Description

#3820 gives an excellent explanation of the problem, but the tldr is that the Phylopic DAG works by first fetching a build_param that is used in all subsequent requests for data, and sometimes this build_param changes while ingestion is underway, causing errors. When that happens we want to identify the issue, fetch the new build_param, and start ingestion over from the top.

To accomplish that, this PR ended up making additional changes to the DelayedRequester. Currently the requester will, after all configured retries have been exhausted, raise a RetriesExceeded error with no other context about what the actual error was from the provider API. This PR updates the requester to instead raise the actual error from the API, which I think is more useful information. For example we already have one other provider (Freesound) which was trying to detect and handle specific API errors, and now we can do that.

Consequently tests were added and updated, but reviewers should weigh in on whether they think there's reason to keep the previous implementation.

Testing Instructions

The easiest way to test this manually is to add a few lines of code to the Phylopic DAG to make it return an incorrect build_param on the first try. I did this by updating the __init__ method to initialize a test variable:

     def __init__(self, *args, **kwargs):
+        self.test = 0

And then further updating the _get_initial_query_params to reset build_param to a bad value on the first iteration, using the var:

@@ -76,6 +77,10 @@ class PhylopicDataIngester(ProviderDataIngester):
         self.total_pages = resp.get("totalPages")
         self.build_param = resp.get("build")

+        if self.test == 0:
+            self.build_param = 307
+            self.test += 1

Now try running the Phylopic DAG locally and inspect the logs. You should see a log like this when it tries to fetch a batch with the bad build param:

[2024-03-04, 20:19:25 UTC] {requester.py:85} ERROR - Error with the request for URL: https://api.phylopic.org/images
[2024-03-04, 20:19:25 UTC] {requester.py:86} INFO - HTTPError: 410 Client Error: Gone for url: https://api.phylopic.org/images?build=307&page=0&embed_items=true

You should see the same request retry 3 times (and fail each time).

However the DAG should not fail; you should instead see the build_param is refetched and ingestion starts over, this time successfully. The logs will look like:


[2024-03-04, 20:19:45 UTC] {phylopic.py:61} INFO - Build_param changed from 307 to 321 during ingestion. Restarting ingestion from the beginning.
[2024-03-04, 20:19:45 UTC] {provider_data_ingester.py:228} INFO - Begin ingestion for PhylopicDataIngester

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Mar 4, 2024
@stacimc stacimc self-assigned this Mar 4, 2024
@stacimc stacimc requested a review from a team as a code owner March 4, 2024 20:36
there are no remaining retries, it will instead raise the error.
"""
if retries <= 0:
logger.error("No retries remaining. Failure.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.error("No retries remaining. Failure.")
logger.error("No retries remaining. Failure.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staci, the PR description reads like a fascinating story, and the code looks great. Nice that we could enable better error handling in the other providers. I will approve this PR after local testing.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with the testing instructions works well locally. The comments I have inline are non-blocking.

catalog/dags/common/requester.py Outdated Show resolved Hide resolved
except HTTPError as error:
if error.response.status_code == 410:
# Refetch initial query params; this will update the build_param to the
# most recent value and reset the `current_page` to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause an infinite loop when the build_param keeps changing in the middle of the ingestion, and we keep resetting the page to 1? Do we need a mechanism to prevent that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- it should not! Only the first call to super().ingest_records() is in the try; if it raises an error again on the second attempt, it will not be caught. I though it was fair to only retry once, to prevent the infinite loop as you mentioned.

You can verify this using the same code from the testing instructions, but in get_initial_query_params add:

+        if self.test == 0:
+            self.build_param = 307
+        if self.test == 1:
+            self.build_param = 218
+        self.test += 1

(You need to change the build_param to a different, also incorrect build param on the second run). The DAG errors after both bad build params have been tried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying only once makes sense (albeit I don't fully understand what build_param is and how and why it can change during ingestion :) ).

@stacimc, I think this PR also fixes this issue: Bubble up original exception when retries have exceeded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!! And wow, how simple it was to tackle that ticket and how useful an impact this will have on future failed runs! ✨

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic and runs exactly as expected! I can't wait for the error handling change that's a part of this too 🚀 I have a couple of nits but nothing to block a merge.

catalog/dags/providers/provider_api_scripts/phylopic.py Outdated Show resolved Hide resolved
@@ -142,12 +141,15 @@ def _get_set_info(self, set_url):
set_id = response_json.get("id")
set_name = response_json.get("name")
return set_id, set_name
except RetriesExceeded:
except HTTPError as error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am so glad we can do this now!!

except HTTPError as error:
if error.response.status_code == 410:
# Refetch initial query params; this will update the build_param to the
# most recent value and reset the `current_page` to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!! And wow, how simple it was to tackle that ticket and how useful an impact this will have on future failed runs! ✨

catalog/dags/providers/provider_api_scripts/phylopic.py Outdated Show resolved Hide resolved
Comment on lines +53 to +56
if old_build_param == self.build_param:
# If the build_param could not be updated, there must be another
# issue. Raise the original error.
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@AetherUnbound AetherUnbound linked an issue Mar 8, 2024 that may be closed by this pull request
1 task
@openverse-bot openverse-bot added the 🕹 aspect: interface Concerns end-users' experience with the software label Mar 8, 2024
@stacimc stacimc merged commit d80b590 into main Mar 12, 2024
40 checks passed
@stacimc stacimc deleted the fix/phylopic-updated-build-param branch March 12, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
4 participants