Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[Infrastructure] Implement Reingestion Strategy for Met Museum #413

Closed
mathemancer opened this issue May 29, 2020 · 2 comments
Closed

[Infrastructure] Implement Reingestion Strategy for Met Museum #413

mathemancer opened this issue May 29, 2020 · 2 comments

Comments

@mathemancer
Copy link
Contributor

Current Situation

Currently, we generally only get information about a given image just after it
is uploaded to the Met Museum's website. This is insufficient for the reasons
outlined in #298. To summarize, some relevant metadata goes stale over time.

Suggested Improvement

We need to use a date-partitioned reingestion Apache Airflow DAG to keep the
data up to date over time. For examples, see:

  • src/cc_catalog_airflow/dags/flickr_ingestion_workflow.py, and
  • src/cc_catalog_airflow/dags/wikimedia_ingestion_workflow.py.

For the Met Museum, it's not clear yet how many days' worth of uploads we can
ingest per day. We need to:

  1. Determine (on average) how many days' worth of uploaded metadata we can
    ingest from their API per day.
  2. Come up with an ingestion schedule similar to the ones described in [Infrastructure] Implement Reingestion Strategy for Flickr #298 and
    [Infrastructure] Implement Reingestion strategy for Wikimedia Commons #395 .
  3. Use the same general structure of the Ingestion workflows listed above to
    implement reingestion on that schedule.

Benefit

Implementing this will allow us to manage data that needs to be refreshed over
time for the Met Museum and catch any additions or deletions when they happen.
In the future, this would also allow us to ingest popularity data that changes
over time.

Additional context

The implementer is advised to use the dag_factory implemented in #394, as well
as the helper function to calculate the list lists of day shifts. It would be a
good idea to follow the general pattern used in the implementation of #394, just
with different parameters. Doing so will make this a quick and easy
implementation.

@kgodey kgodey added this to Pending Review in Backlog May 29, 2020
@kgodey kgodey moved this from Pending Review to Q2 2020 in Backlog Jun 4, 2020
@annatuma annatuma moved this from Q2 2020 to Q3 2020 in Backlog Jun 12, 2020
@kss682
Copy link
Contributor

kss682 commented Jul 15, 2020

@mathemancer @annatuma
The objects updated over the last 6 months for met museum .

Date 	         Object count
2020-01-01 	 474441
2020-03-01 	 474441
2020-03-31 	 119361
2020-04-30 	 62519
2020-05-15 	 55163
2020-05-30 	 44238
2020-06-14 	 30850
2020-06-29 	 20343

In Met Museum, the date in metadataDate query param returns all objects updated after that date . So as we go back in date it results in cumulatively increasing set of records.

In that case, once we get all the data , we don't need to go back in date to refresh it as the daily crawl would automatically refresh it. (Is this correct ? )

@mathemancer
Copy link
Contributor Author

That's correct. Your research shows this ticket is unneeded. Thanks for pointing that out!

@kgodey kgodey removed this from Q3 2020 in Backlog Jul 16, 2020
@kgodey kgodey added this to Ready for Development in Active Sprint via automation Jul 16, 2020
@kgodey kgodey moved this from Ready for Development to Done in Active Sprint Jul 16, 2020
@TimidRobot TimidRobot removed this from Done in Active Sprint Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants