This repository has been archived by the owner on Jan 13, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
[Infrastructure] Implement Reingestion Strategy for Met Museum #413
Comments
@mathemancer @annatuma
In Met Museum, the date in In that case, once we get all the data , we don't need to go back in date to refresh it as the daily crawl would automatically refresh it. (Is this correct ? ) |
That's correct. Your research shows this ticket is unneeded. Thanks for pointing that out! |
4 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Current Situation
Currently, we generally only get information about a given image just after it
is uploaded to the Met Museum's website. This is insufficient for the reasons
outlined in #298. To summarize, some relevant metadata goes stale over time.
Suggested Improvement
We need to use a date-partitioned reingestion Apache Airflow DAG to keep the
data up to date over time. For examples, see:
src/cc_catalog_airflow/dags/flickr_ingestion_workflow.py
, andsrc/cc_catalog_airflow/dags/wikimedia_ingestion_workflow.py
.For the Met Museum, it's not clear yet how many days' worth of uploads we can
ingest per day. We need to:
ingest from their API per day.
[Infrastructure] Implement Reingestion strategy for Wikimedia Commons #395 .
implement reingestion on that schedule.
Benefit
Implementing this will allow us to manage data that needs to be refreshed over
time for the Met Museum and catch any additions or deletions when they happen.
In the future, this would also allow us to ingest popularity data that changes
over time.
Additional context
The implementer is advised to use the
dag_factory
implemented in #394, as wellas the helper function to calculate the list lists of day shifts. It would be a
good idea to follow the general pattern used in the implementation of #394, just
with different parameters. Doing so will make this a quick and easy
implementation.
The text was updated successfully, but these errors were encountered: