Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent iNaturalist from running alongside any other DAGs #1276

Closed
AetherUnbound opened this issue Mar 1, 2023 · 2 comments · Fixed by #3025
Closed

Prevent iNaturalist from running alongside any other DAGs #1276

AetherUnbound opened this issue Mar 1, 2023 · 2 comments · Fixed by #3025
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects

Comments

@AetherUnbound
Copy link
Contributor

Description

Similar to #1277 in reasoning, iNaturalist can be intensive and disruptive for the other DAGs running on the instance. If possible, we would like to prevent iNaturalist from running while other DAGs are in progress. One way to do this would be to leverage Airflow pools. Currently most of our tasks sit in the default_pool, which has 128 slots. (We could increase this to a higher number if desired).

Tasks which use pools can request multiple slots. One way to prevent iNaturalist concurrency would be to to set its pool_slots to the maximum size for default_pool, meaning it would require all pool slots to be available. This, paired with a reduced priority weight would mean that all other tasks would run prior to iNaturalist, and the latter could only run if all slots are available. This would apply per-task, so even if iNaturalist ran a task, as soon as it completed a task it would free up the pool slots for other DAGs.

Alternatives

Additional context

See #1277 for an additional/alternate option.

@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents and removed 🟨 priority: medium Not blocking but should be addressed soon labels Mar 1, 2023
@rwidom
Copy link
Collaborator

rwidom commented Mar 10, 2023

I'm sure that this is much lower priority now, but I have some questions that I didn't want to lose for whenever we get back to this. Mainly, they're about the mapped task that takes the vast majority of the iNaturalist runtime. Are the mapped subtasks going to count as separate tasks where others could go in between? I can't tell if that would be good (just have iNaturalist upsert a few more records whenever there is downtime) or bad (keep the inaturalist schema around almost all the time). If the mapped task counts as one single task, then it might be like just stopping every other dag for two days once a month. Do we have two days a month without any dags running currently? What kinds of changes to batch size and/or timeouts might we want to make as a result of answers to the questions above?

@AetherUnbound
Copy link
Contributor Author

The tasks themselves would count individually, rather than as a block. I think that should make it easy to run other tasks while iNaturalist is processing that batch of load tasks! I'm less concerned about the space the inaturalist schema takes up in that case, so I think we wouldn't need to make any other changes to other DAGs!

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Mar 27, 2023
@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Mar 28, 2023
@openverse-bot openverse-bot added this to Backlog in Openverse Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@rwidom rwidom self-assigned this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Openverse
  
Backlog
Development

Successfully merging a pull request may close this issue.

3 participants