Investigate converting iNaturalist to an incremental DAG #1456
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟩 priority: low
Low priority and doesn't need to be rushed
🔧 tech: airflow
Involves Apache Airflow
🐍 tech: python
Involves Python
Description
The iNaturalist DAG (#549) is a different beast from all our other provider scripts in that each run processes a massive amount of data. The first run of this DAG is already proving that the process will take quite a while to complete.
The DAG is presently set up to check when the upstream S3 files have been modified and re-run the entire ingestion. It would be ideal (and likely more performant) if we could run this update incrementally, i.e. only process the records that have been updated since the last run. This would either require some marker upstream being changed (like
observed_on
date for instance), or us keeping track of what records we've processed and performing a diff before ingestion. The latter seems like it might be quite intensive both on the data storage and compute end, so we should explore the feasibility of the former first.Additional context
CC @rwidom
Implementation
The text was updated successfully, but these errors were encountered: