Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate converting iNaturalist to an incremental DAG #1456

Closed
1 task
AetherUnbound opened this issue Aug 23, 2022 · 1 comment
Closed
1 task

Investigate converting iNaturalist to an incremental DAG #1456

AetherUnbound opened this issue Aug 23, 2022 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Contributor

Description

The iNaturalist DAG (#549) is a different beast from all our other provider scripts in that each run processes a massive amount of data. The first run of this DAG is already proving that the process will take quite a while to complete.

The DAG is presently set up to check when the upstream S3 files have been modified and re-run the entire ingestion. It would be ideal (and likely more performant) if we could run this update incrementally, i.e. only process the records that have been updated since the last run. This would either require some marker upstream being changed (like observed_on date for instance), or us keeping track of what records we've processed and performing a diff before ingestion. The latter seems like it might be quite intensive both on the data storage and compute end, so we should explore the feasibility of the former first.

Additional context

CC @rwidom

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🟩 priority: low Low priority and doesn't need to be rushed labels Aug 23, 2022
@AetherUnbound
Copy link
Contributor Author

Thanks to @rwidom's incredible work on WordPress/openverse-catalog#745 (which has now been run successfully) and the fact that the iNaturalist DAG now also checks for updates to the S3 files before attempting to run again, I think this can be closed. The iNaturalist data doesn't have any date fields we could use to do an incremental update, and even if we operated only on new IDs, we might lose any updates that had happened since the previous run. So we'll leave things as-is for now!

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

No branches or pull requests

1 participant