Systematically update CC catalog records (original #164) #1749

obulat · 2021-04-21T12:16:34Z

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Fri Jun 14 2019
Labels: 🙅 status: discontinued

Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.

We need to discuss how we want to go about maintaining/updating data in the Catalog.

For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/

Original Comments:

annatuma commented on Fri Dec 06 2019:

Loosely aiming for Q2 2020 to tackle this.
source

kgodey commented on Fri Mar 20 2020:

@mathemancer has a plan for this, starting with cc-archive/cccatalog#298
source

mathemancer commented on Thu May 14 2020:

#298 has been implemented (see the PR cc-archive/cccatalog#394 ).
source

mathemancer commented on Thu May 14 2020:

There are three fundamentally different types of Provider APIs with regards to
this issue:

APIs that let us ingest metadata related to objects uploaded on a given date.
- For these providers, a strategy like the one outlined in Refactor delay tests to prevent them from being flaky openverse-catalog#298 is necessary.
- Of the scripts currently implmented, these are:
  - Flickr
  - Met Museum
  - PhyloPic
  - Wikimedia Commons
- From that list, only Flickr and Wikimedia Commons are at a scale that
  implies difficulty. If we need to reingest the entire Met Museum
  collection, it's possible to do that at any time.
- For The 'smaller' providers, the easiest solution would be to continue
  doing the daily ingestions, and combine them with a monthly 'complete'
  ingestion of their entire catalog.
APIs for which we ingest the entire collection every time we ingest from it.
- For these providers, the problem is already solved.
APIs that offer a 'delta' endpoint that we use (currently none).
- This would be tricky, since if we miss ingesting for some time, it may be
  that we'd have to reingest the entire collection to make sure we're in a
  consistent state.

source

mathemancer commented on Thu May 14 2020:

I think we should focus on Wikimedia Commons next, both because it's the only one which isn't solved for which this might be challenging, and because we need more data from WMC for Regoknition analysis.
source

mathemancer commented on Fri May 15 2020:

See issue cc-archive/cccatalog#395 for the Wikimedia Commons plan.
source

mathemancer commented on Fri May 29 2020:

For the following API providers, reingestion is not necessary, because we ingest all their data every time we run:

Cleveland Museum
RawPixel
Smithsonian Institution

For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):

PhyloPic

For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:

Flickr -- see [Infrastructure] Implement Reingestion Strategy for Flickr cc-archive/cccatalog#298 and Add Date Partitioned Flickr reingestion workflow cc-archive/cccatalog#394
Wikimedia Commons -- see [Infrastructure] Implement Reingestion strategy for Wikimedia Commons cc-archive/cccatalog#395 and Add Date Partitioned Wikimedia reingestion workflow cc-archive/cccatalog#402
Europeana -- see [Infrastructure] Implement reingestion strategy for Europeana cc-archive/cccatalog#412
Met Museum -- see [Infrastructure] Implement Reingestion Strategy for Met Museum cc-archive/cccatalog#413

The only current API provider not mentioned above is Thingiverse which is BLOCKED by cc-archive/cccatalog#391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy.
source

mathemancer commented on Fri May 29 2020:

Moving forward, we should use the reingestion strategy out of the gate for any new Provider API Script, whenever it uses the date-partitioning strategy to ingest in the first place.
source

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2023-02-28T21:32:56Z

All of the aforementioned providers now have reingestion workflows associated with each of them (with the exception of those that pull modified records instead of created records on each run, thus getting any modified data automatically). This can be closed!

AetherUnbound added 🐍 tech: python Involves Python 🔧 tech: airflow Involves Apache Airflow labels Jan 25, 2022

obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023

krysal added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Feb 27, 2023

AetherUnbound closed this as completed Feb 28, 2023

obulat removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 1, 2023

AetherUnbound mentioned this issue Apr 17, 2023

Query API v2 for images modified within a range keesey/phylopic#5

Closed

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematically update CC catalog records (original #164) #1749

Systematically update CC catalog records (original #164) #1749

obulat commented Apr 21, 2021 •

edited

AetherUnbound commented Feb 28, 2023

Systematically update CC catalog records (original #164) #1749

Systematically update CC catalog records (original #164) #1749

Comments

obulat commented Apr 21, 2021 • edited

Original Comments:

AetherUnbound commented Feb 28, 2023

obulat commented Apr 21, 2021 •

edited