Systematically update CC catalog records (original #164) #1749
Labels
馃П stack: catalog
Related to the catalog and Airflow DAGs
馃敡 tech: airflow
Involves Apache Airflow
馃悕 tech: python
Involves Python
Projects
This issue has been migrated from the CC Search Catalog repository
Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.
We need to discuss how we want to go about maintaining/updating data in the Catalog.
For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/
Original Comments:
annatuma commented on Fri Dec 06 2019:
kgodey commented on Fri Mar 20 2020:
mathemancer commented on Thu May 14 2020:
mathemancer commented on Thu May 14 2020:
implies difficulty. If we need to reingest the entire Met Museum
collection, it's possible to do that at any time.
doing the daily ingestions, and combine them with a monthly 'complete'
ingestion of their entire catalog.
that we'd have to reingest the entire collection to make sure we're in a
consistent state.
source
mathemancer commented on Thu May 14 2020:
mathemancer commented on Fri May 15 2020:
mathemancer commented on Fri May 29 2020:
For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):
For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:
The only current API provider not mentioned above is Thingiverse which is BLOCKED by cc-archive/cccatalog#391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy.
source
mathemancer commented on Fri May 29 2020:
The text was updated successfully, but these errors were encountered: