New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from updated build_param in Phylopic DAG #3874
Conversation
catalog/dags/common/requester.py
Outdated
there are no remaining retries, it will instead raise the error. | ||
""" | ||
if retries <= 0: | ||
logger.error("No retries remaining. Failure.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.error("No retries remaining. Failure.") | |
logger.error("No retries remaining. Failure.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staci, the PR description reads like a fascinating story, and the code looks great. Nice that we could enable better error handling in the other providers. I will approve this PR after local testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with the testing instructions works well locally. The comments I have inline are non-blocking.
except HTTPError as error: | ||
if error.response.status_code == 410: | ||
# Refetch initial query params; this will update the build_param to the | ||
# most recent value and reset the `current_page` to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause an infinite loop when the build_param
keeps changing in the middle of the ingestion, and we keep resetting the page to 1? Do we need a mechanism to prevent that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- it should not! Only the first call to super().ingest_records()
is in the try
; if it raises an error again on the second attempt, it will not be caught. I though it was fair to only retry once, to prevent the infinite loop as you mentioned.
You can verify this using the same code from the testing instructions, but in get_initial_query_params
add:
+ if self.test == 0:
+ self.build_param = 307
+ if self.test == 1:
+ self.build_param = 218
+ self.test += 1
(You need to change the build_param to a different, also incorrect build param on the second run). The DAG errors after both bad build params have been tried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying only once makes sense (albeit I don't fully understand what build_param
is and how and why it can change during ingestion :) ).
@stacimc, I think this PR also fixes this issue: Bubble up original exception when retries have exceeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!! And wow, how simple it was to tackle that ticket and how useful an impact this will have on future failed runs! ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic and runs exactly as expected! I can't wait for the error handling change that's a part of this too 🚀 I have a couple of nits but nothing to block a merge.
@@ -142,12 +141,15 @@ def _get_set_info(self, set_url): | |||
set_id = response_json.get("id") | |||
set_name = response_json.get("name") | |||
return set_id, set_name | |||
except RetriesExceeded: | |||
except HTTPError as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am so glad we can do this now!!
except HTTPError as error: | ||
if error.response.status_code == 410: | ||
# Refetch initial query params; this will update the build_param to the | ||
# most recent value and reset the `current_page` to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!! And wow, how simple it was to tackle that ticket and how useful an impact this will have on future failed runs! ✨
if old_build_param == self.build_param: | ||
# If the build_param could not be updated, there must be another | ||
# issue. Raise the original error. | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Fixes
Fixes #3820 by @AetherUnbound, fixes #1369
Description
#3820 gives an excellent explanation of the problem, but the tldr is that the Phylopic DAG works by first fetching a
build_param
that is used in all subsequent requests for data, and sometimes thisbuild_param
changes while ingestion is underway, causing errors. When that happens we want to identify the issue, fetch the newbuild_param
, and start ingestion over from the top.To accomplish that, this PR ended up making additional changes to the
DelayedRequester
. Currently the requester will, after all configured retries have been exhausted, raise aRetriesExceeded
error with no other context about what the actual error was from the provider API. This PR updates the requester to instead raise the actual error from the API, which I think is more useful information. For example we already have one other provider (Freesound) which was trying to detect and handle specific API errors, and now we can do that.Consequently tests were added and updated, but reviewers should weigh in on whether they think there's reason to keep the previous implementation.
Testing Instructions
The easiest way to test this manually is to add a few lines of code to the Phylopic DAG to make it return an incorrect
build_param
on the first try. I did this by updating the__init__
method to initialize a test variable:And then further updating the
_get_initial_query_params
to resetbuild_param
to a bad value on the first iteration, using the var:Now try running the Phylopic DAG locally and inspect the logs. You should see a log like this when it tries to fetch a batch with the bad build param:
You should see the same request retry 3 times (and fail each time).
However the DAG should not fail; you should instead see the
build_param
is refetched and ingestion starts over, this time successfully. The logs will look like:Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin