-
Notifications
You must be signed in to change notification settings - Fork 60
Systematically update CC catalog records #164
Comments
Loosely aiming for Q2 2020 to tackle this. |
@mathemancer has a plan for this, starting with #298 |
There are three fundamentally different types of Provider APIs with regards to
|
I think we should focus on Wikimedia Commons next, both because it's the only one which isn't solved for which this might be challenging, and because we need more data from WMC for Regoknition analysis. |
See issue #395 for the Wikimedia Commons plan. |
For the following API providers, reingestion is not necessary, because we ingest all their data every time we run:
For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):
For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:
The only current API provider not mentioned above is Thingiverse which is BLOCKED by #391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy. |
Moving forward, we should use the reingestion strategy out of the gate for any new Provider API Script, whenever it uses the date-partitioning strategy to ingest in the first place. |
Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.
We need to discuss how we want to go about maintaining/updating data in the Catalog.
For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/
The text was updated successfully, but these errors were encountered: