-
Notifications
You must be signed in to change notification settings - Fork 51
Respect ingestion limit in process_batch #818
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Do you think we'd be able to remove any of the higher level checks if we have this one at the lower level? 😮
Weirdly, I don't think so! Or rather, we could, but while it would stop the ingester from committing any new records past the limit, it wouldn't stop it from trying to fetch additional batches (it would keep fetching and return early each time). The current checks are all needed to make sure ingestion totally halts immediately. There's two other checks:
It is surprisingly complicated to implement in a way that suits all our providers! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great! Also, thanks for the context, I was feeling a dejavú like, some work in this area has been done before.
You and me both! 😄 |
Description
I think this should fix the final big edge case with ingestion limit :)
Before this PR, when an INGESTION_LIMIT was set in a provider script, we would cap the
batch_limit
to that value, and then check if the limit was exceeded after each call toprocess_batch
. This leaves an edge case when a provider doesn't passbatch_limit
to their API, and may get back huge batches that are larger than our limit.This PR adds one extra check for the
ingestion_limit
, insideprocess_batch
itself. Now we will halt processing even within a batch when the limit is reached.Testing Instructions
Metropolitan is a good example of the buggy behavior. You can observe the bug in action by setting an
INGESTION_LIMIT
Airflow variable to100
locally and runningmetropolitan_workflow
. You'll see that it appears to ignore the limit and continues processing after 100.Then checkout this branch and try again. You could also try verifying that the limit is still respected on other providers like Cleveland.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin