Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Systematically update CC catalog records #164

Closed
kgodey opened this issue Jun 14, 2019 · 8 comments
Closed

Systematically update CC catalog records #164

kgodey opened this issue Jun 14, 2019 · 8 comments
Assignees
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance

Comments

@kgodey
Copy link
Contributor

kgodey commented Jun 14, 2019

Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.

We need to discuss how we want to go about maintaining/updating data in the Catalog.

For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/

@kgodey kgodey created this issue from a note in Backlog (Next Sprint) Jun 14, 2019
@kgodey kgodey added this to To do in Active Sprint via automation Jun 14, 2019
@kgodey kgodey removed this from Next Sprint in Backlog Jun 14, 2019
@janetpkr janetpkr assigned kgodey and sarahpearson and unassigned janetpkr Jul 3, 2019
@kgodey kgodey added this to the Backlog milestone Jul 11, 2019
@kgodey kgodey removed this from To do in Active Sprint Jul 12, 2019
@kgodey kgodey modified the milestones: Backlog, Q3 Sprint 3 Jul 18, 2019
@kgodey kgodey modified the milestones: Q3 Sprint 3, Backlog Aug 12, 2019
@annatuma annatuma added this to To Be Prioritized in Backlog Nov 14, 2019
@annatuma annatuma removed this from the Backlog milestone Dec 5, 2019
@annatuma annatuma assigned annatuma and mathemancer and unassigned kgodey Dec 5, 2019
@annatuma annatuma moved this from To Be Prioritized to Q2 2020 in Backlog Dec 5, 2019
@annatuma
Copy link
Contributor

annatuma commented Dec 5, 2019

Loosely aiming for Q2 2020 to tackle this.

@kgodey
Copy link
Contributor Author

kgodey commented Mar 19, 2020

@mathemancer has a plan for this, starting with #298

@kgodey kgodey changed the title Discuss and decide how to systematically update CC catalog records Systematically update CC catalog records Apr 2, 2020
@kgodey kgodey moved this from Q2 2020 to Next Sprint in Backlog Apr 2, 2020
@kgodey kgodey moved this from Next Sprint to Q2 2020 in Backlog Apr 3, 2020
@kgodey kgodey moved this from Q2 2020 to Next Sprint in Backlog Apr 10, 2020
@kgodey kgodey removed this from Next Sprint in Backlog Apr 17, 2020
@kgodey kgodey added this to Ready for Development in Active Sprint via automation Apr 17, 2020
@kgodey kgodey moved this from Ready for Development to Ticket Work Required in Active Sprint Apr 30, 2020
@mathemancer
Copy link
Contributor

#298 has been implemented (see the PR #394 ).

@mathemancer
Copy link
Contributor

There are three fundamentally different types of Provider APIs with regards to
this issue:

  1. APIs that let us ingest metadata related to objects uploaded on a given date.
    • For these providers, a strategy like the one outlined in [Infrastructure] Implement Reingestion Strategy for Flickr #298 is necessary.
    • Of the scripts currently implmented, these are:
      • Flickr
      • Met Museum
      • PhyloPic
      • Wikimedia Commons
    • From that list, only Flickr and Wikimedia Commons are at a scale that
      implies difficulty. If we need to reingest the entire Met Museum
      collection, it's possible to do that at any time.
    • For The 'smaller' providers, the easiest solution would be to continue
      doing the daily ingestions, and combine them with a monthly 'complete'
      ingestion of their entire catalog.
  2. APIs for which we ingest the entire collection every time we ingest from it.
    • For these providers, the problem is already solved.
  3. APIs that offer a 'delta' endpoint that we use (currently none).
    • This would be tricky, since if we miss ingesting for some time, it may be
      that we'd have to reingest the entire collection to make sure we're in a
      consistent state.

@mathemancer
Copy link
Contributor

I think we should focus on Wikimedia Commons next, both because it's the only one which isn't solved for which this might be challenging, and because we need more data from WMC for Regoknition analysis.

@mathemancer
Copy link
Contributor

See issue #395 for the Wikimedia Commons plan.

@mathemancer
Copy link
Contributor

mathemancer commented May 29, 2020

For the following API providers, reingestion is not necessary, because we ingest all their data every time we run:

  • Cleveland Museum
  • RawPixel
  • Smithsonian Institution

For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):

  • PhyloPic

For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:

The only current API provider not mentioned above is Thingiverse which is BLOCKED by #391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy.

@mathemancer
Copy link
Contributor

Moving forward, we should use the reingestion strategy out of the gate for any new Provider API Script, whenever it uses the date-partitioning strategy to ingest in the first place.

@kgodey kgodey moved this from Ticket Work Required to In Progress in Active Sprint Aug 24, 2020
@kgodey kgodey removed the meta label Sep 22, 2020
@kgodey kgodey moved this from In Progress to [TEMPORARY] Deprioritize in Active Sprint Oct 27, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@TimidRobot TimidRobot removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 9, 2020
@TimidRobot TimidRobot added this to Pending Review in Backlog via automation Dec 9, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
@kgodey kgodey moved this from Pending Review to Done in Backlog Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance
Development

No branches or pull requests

6 participants