This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
Log last query_params hit before AirflowTaskTimeout #1058
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Description
The
ProviderDataIngester
includes error handling that, among other things, logs the lastquery_params
reached by the DAG before an error is encountered during ingestion, which is helpful for debugging or resuming a failed DAG.AirflowException
s are exempted from this custom handling, but that means we also don't get thequery_param
logging when a task is stopped by Airflow.This PR changes nothing about the handling of those exceptions, but just adds a log for the last query params hit before raising. This is useful when a
pull_data
task times out, because we can see exactly where the DAG managed to get to and resume it at that point.Testing Instructions
Update the
pull_timeout
for a provider ingester to something small. You can do this in theprovider_workflows.py
file or via the Airflow variable as described in #976. I updated the Metropolitan museum workflow to have a 1 minute pull timeout.Then run the DAG locally and wait for the
pull_data
step to timeout. The task should raise anAirflowTaskTimeout
as normal, but when you view the task logs you should be able to scroll up and see a log like:Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin