[Infrastructure] Implement reingestion strategy for Europeana #412
Comments
StrategyDistribution of item over the years
No items where uploaded before 2013. Calculation of days of ingestion per dayLet x be the maximum number of items created on a day . Requests needed for getting one day's data : Number of days that can ingested per day : From the above distribution we can see that the year 2018-2019 had the maximum amount of items created , 5033245. Number of days ingested per day - I have doubts regarding this approach since there is possibility of bulk upload during the initial stages of uploading 2013, 2014, etc and there might be days when the created items total greater than the above estimated average. |
So, Europeana have their first uploaded date on The strategy in the previous comment was based on the assumption that there would be some kind of uniform distribution of data with a maximum upload per day at 20,000. This assumption goes wrong as we can see that the max upload per day goes above hundred thousand . |
Current Situation
Once #368 is merged, we will be ingesting the previous day's uploads' metadata
each day for Europeana. But, we won't be reingesting the metadata for a given
image at any point (at least not automatically).
Suggested Improvement
We need to use a date-partitioned reingestion Apache Airflow DAG to keep the
data up to date over time. For examples, see:
src/cc_catalog_airflow/dags/flickr_ingestion_workflow.py
, andsrc/cc_catalog_airflow/dags/wikimedia_ingestion_workflow.py
.For Europeana, it's not clear yet how many days' worth of uploads we can ingest
per day. We need to:
ingest from their API per day.
[Infrastructure] Implement Reingestion strategy for Wikimedia Commons #395 .
implement reingestion on that schedule.
Because Europeana is relatively low volume, we should try to reingest all of
their data at least once every month or two.
Benefit
Implementing this will allow us to manage data that needs to be refreshed over
time for Europeana, and catch any additions or deletions when they happen. In
the future, this would also allow us to ingest popularity data that changes over
time.
Additional context
dag_factory
implemented in Add Date Partitioned Flickr reingestion workflow #394, aswell as the helper function to calculate the list lists of day shifts. It
would be a good idea to follow the general pattern used in the implementation
of Add Date Partitioned Flickr reingestion workflow #394, just with different parameters. Doing so will make this a quick and
easy implementation.
The text was updated successfully, but these errors were encountered: