Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematically update CC catalog records (original #164) #1749

Closed
2 of 4 tasks
obulat opened this issue Apr 21, 2021 · 1 comment
Closed
2 of 4 tasks

Systematically update CC catalog records (original #164) #1749

obulat opened this issue Apr 21, 2021 · 1 comment
Labels
馃П stack: catalog Related to the catalog and Airflow DAGs 馃敡 tech: airflow Involves Apache Airflow 馃悕 tech: python Involves Python
Projects

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Fri Jun 14 2019
Labels: 馃檯 status: discontinued

Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.

We need to discuss how we want to go about maintaining/updating data in the Catalog.

For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/


Original Comments:

annatuma commented on Fri Dec 06 2019:

Loosely aiming for Q2 2020 to tackle this.
source

kgodey commented on Fri Mar 20 2020:

@mathemancer has a plan for this, starting with cc-archive/cccatalog#298
source

mathemancer commented on Thu May 14 2020:

#298 has been implemented (see the PR cc-archive/cccatalog#394 ).
source

mathemancer commented on Thu May 14 2020:

There are three fundamentally different types of Provider APIs with regards to
this issue:

  1. APIs that let us ingest metadata related to objects uploaded on a given date.
    • For these providers, a strategy like the one outlined in Refactor delay tests to prevent them from being flaky聽openverse-catalog#298 is necessary.
    • Of the scripts currently implmented, these are:
      • Flickr
      • Met Museum
      • PhyloPic
      • Wikimedia Commons
    • From that list, only Flickr and Wikimedia Commons are at a scale that
      implies difficulty. If we need to reingest the entire Met Museum
      collection, it's possible to do that at any time.
    • For The 'smaller' providers, the easiest solution would be to continue
      doing the daily ingestions, and combine them with a monthly 'complete'
      ingestion of their entire catalog.
  2. APIs for which we ingest the entire collection every time we ingest from it.
    • For these providers, the problem is already solved.
  3. APIs that offer a 'delta' endpoint that we use (currently none).
    • This would be tricky, since if we miss ingesting for some time, it may be
      that we'd have to reingest the entire collection to make sure we're in a
      consistent state.

source

mathemancer commented on Thu May 14 2020:

I think we should focus on Wikimedia Commons next, both because it's the only one which isn't solved for which this might be challenging, and because we need more data from WMC for Regoknition analysis.
source

mathemancer commented on Fri May 15 2020:

See issue cc-archive/cccatalog#395 for the Wikimedia Commons plan.
source

mathemancer commented on Fri May 29 2020:

For the following API providers, reingestion is not necessary, because we ingest all their data every time we run:

  • Cleveland Museum
  • RawPixel
  • Smithsonian Institution

For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):

  • PhyloPic

For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:

The only current API provider not mentioned above is Thingiverse which is BLOCKED by cc-archive/cccatalog#391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy.
source

mathemancer commented on Fri May 29 2020:

Moving forward, we should use the reingestion strategy out of the gate for any new Provider API Script, whenever it uses the date-partitioning strategy to ingest in the first place.
source

@AetherUnbound AetherUnbound added 馃悕 tech: python Involves Python 馃敡 tech: airflow Involves Apache Airflow labels Jan 25, 2022
@obulat obulat added the 馃П stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@krysal krysal added the 馃殾 status: awaiting triage Has not been triaged & therefore, not ready for work label Feb 27, 2023
@AetherUnbound
Copy link
Contributor

All of the aforementioned providers now have reingestion workflows associated with each of them (with the exception of those that pull modified records instead of created records on each run, thus getting any modified data automatically). This can be closed!

@obulat obulat removed the 馃殾 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 1, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
馃П stack: catalog Related to the catalog and Airflow DAGs 馃敡 tech: airflow Involves Apache Airflow 馃悕 tech: python Involves Python
Projects
Archived in project
Openverse
  
Done!
Development

No branches or pull requests

3 participants