Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Respect ingestion limit in process_batch #818

Merged
merged 1 commit into from
Oct 25, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Oct 21, 2022

Description

I think this should fix the final big edge case with ingestion limit :)

Before this PR, when an INGESTION_LIMIT was set in a provider script, we would cap the batch_limit to that value, and then check if the limit was exceeded after each call to process_batch. This leaves an edge case when a provider doesn't pass batch_limit to their API, and may get back huge batches that are larger than our limit.

This PR adds one extra check for the ingestion_limit, inside process_batch itself. Now we will halt processing even within a batch when the limit is reached.

Testing Instructions

Metropolitan is a good example of the buggy behavior. You can observe the bug in action by setting an INGESTION_LIMIT Airflow variable to 100 locally and running metropolitan_workflow. You'll see that it appears to ignore the limit and continues processing after 100.

Then checkout this branch and try again. You could also try verifying that the limit is still respected on other providers like Cleveland.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Oct 21, 2022
@stacimc stacimc added this to In progress in Openverse PRs via automation Oct 21, 2022
@stacimc stacimc requested a review from a team as a code owner October 21, 2022 23:43
@stacimc stacimc self-assigned this Oct 21, 2022
@openverse-bot openverse-bot moved this from In progress to Needs review in Openverse PRs Oct 21, 2022
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Do you think we'd be able to remove any of the higher level checks if we have this one at the lower level? 😮

@stacimc
Copy link
Contributor Author

stacimc commented Oct 22, 2022

Do you think we'd be able to remove any of the higher level checks if we have this one at the lower level? 😮

Weirdly, I don't think so! Or rather, we could, but while it would stop the ingester from committing any new records past the limit, it wouldn't stop it from trying to fetch additional batches (it would keep fetching and return early each time). The current checks are all needed to make sure ingestion totally halts immediately.

There's two other checks:

  • [1] At the beginning of ingest_records. We need this one to stop the ingester from attempting to start ingestion again in providers that override ingest_records to be called multiple times (like Finnish).
  • [2] At the end of ingest_records -- I thought we might be able to get rid of this one, but we need it to set should_continue to False and stop the loop.

It is surprisingly complicated to implement in a way that suits all our providers!

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Openverse PRs automation moved this from Needs review to Reviewer approved Oct 25, 2022
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great! Also, thanks for the context, I was feeling a dejavú like, some work in this area has been done before.

@stacimc
Copy link
Contributor Author

stacimc commented Oct 25, 2022

I was feeling a dejavú like, some work in this area has been done before.

You and me both! 😄

@stacimc stacimc merged commit e9311a3 into main Oct 25, 2022
Openverse PRs automation moved this from Reviewer approved to Merged! Oct 25, 2022
@stacimc stacimc deleted the fix/ingestion-limit-edge-case branch October 25, 2022 19:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
No open projects
Openverse PRs
  
Merged!
Development

Successfully merging this pull request may close these issues.

None yet

4 participants