Investigate converting iNaturalist to an incremental DAG #1456

AetherUnbound · 2022-08-23T18:47:06Z

Description

The iNaturalist DAG (#549) is a different beast from all our other provider scripts in that each run processes a massive amount of data. The first run of this DAG is already proving that the process will take quite a while to complete.

The DAG is presently set up to check when the upstream S3 files have been modified and re-run the entire ingestion. It would be ideal (and likely more performant) if we could run this update incrementally, i.e. only process the records that have been updated since the last run. This would either require some marker upstream being changed (like observed_on date for instance), or us keeping track of what records we've processed and performing a diff before ingestion. The latter seems like it might be quite intensive both on the data storage and compute end, so we should explore the feasibility of the former first.

Additional context

CC @rwidom

Implementation

🙋 I would be interested in implementing this feature.

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2023-02-16T23:35:39Z

Thanks to @rwidom's incredible work on WordPress/openverse-catalog#745 (which has now been run successfully) and the fact that the iNaturalist DAG now also checks for updates to the S3 files before attempting to run again, I think this can be closed. The iNaturalist data doesn't have any date fields we could use to do an incremental update, and even if we operated only on new IDs, we might lose any updates that had happened since the previous run. So we'll leave things as-is for now!

rwidom mentioned this issue Sep 28, 2022

iNaturalist in-SQL loading WordPress/openverse-catalog#745

Merged

AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate converting iNaturalist to an incremental DAG #1456

Investigate converting iNaturalist to an incremental DAG #1456

AetherUnbound commented Aug 23, 2022

AetherUnbound commented Feb 16, 2023

Investigate converting iNaturalist to an incremental DAG #1456

Investigate converting iNaturalist to an incremental DAG #1456

Comments

AetherUnbound commented Aug 23, 2022

Description

Additional context

Implementation

AetherUnbound commented Feb 16, 2023