Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[Infrastructure] Implement reingestion strategy for Europeana #412

Closed
mathemancer opened this issue May 29, 2020 · 2 comments · Fixed by #473
Closed

[Infrastructure] Implement reingestion strategy for Europeana #412

mathemancer opened this issue May 29, 2020 · 2 comments · Fixed by #473
Assignees
Labels
help wanted Open to participation from the community

Comments

@mathemancer
Copy link
Contributor

Current Situation

Once #368 is merged, we will be ingesting the previous day's uploads' metadata
each day for Europeana. But, we won't be reingesting the metadata for a given
image at any point (at least not automatically).

Suggested Improvement

We need to use a date-partitioned reingestion Apache Airflow DAG to keep the
data up to date over time. For examples, see:

  • src/cc_catalog_airflow/dags/flickr_ingestion_workflow.py, and
  • src/cc_catalog_airflow/dags/wikimedia_ingestion_workflow.py.

For Europeana, it's not clear yet how many days' worth of uploads we can ingest
per day. We need to:

  1. Determine (on average) how many days' worth of uploaded metadata we can
    ingest from their API per day.
  2. Come up with an ingestion schedule similar to the ones described in [Infrastructure] Implement Reingestion Strategy for Flickr #298 and
    [Infrastructure] Implement Reingestion strategy for Wikimedia Commons #395 .
  3. Use the same general structure of the Ingestion workflows listed above to
    implement reingestion on that schedule.

Because Europeana is relatively low volume, we should try to reingest all of
their data at least once every month or two.

Benefit

Implementing this will allow us to manage data that needs to be refreshed over
time for Europeana, and catch any additions or deletions when they happen. In
the future, this would also allow us to ingest popularity data that changes over
time.

Additional context

@kgodey kgodey added this to Pending Review in Backlog May 29, 2020
@kgodey kgodey moved this from Pending Review to Q2 2020 in Backlog Jun 4, 2020
@annatuma annatuma moved this from Q2 2020 to Q3 2020 in Backlog Jun 12, 2020
@mathemancer mathemancer added help wanted Open to participation from the community and removed awaiting triage labels Jun 16, 2020
@kgodey kgodey moved this from Q3 2020 to Next Sprint in Backlog Jun 25, 2020
@kgodey kgodey removed this from Next Sprint in Backlog Jun 26, 2020
@kgodey kgodey added this to Ready for Development in Active Sprint via automation Jun 26, 2020
@kss682
Copy link
Contributor

kss682 commented Jul 20, 2020

Strategy

Distribution of item over the years

start_time                end_time          count
2013-01-01T00:00:00Z | 2014-01-01T00:00:00Z | 201172 |  
2014-01-01T00:00:00Z | 2015-01-01T00:00:00Z | 3575843 |  
2015-01-01T00:00:00Z | 2016-01-01T00:00:00Z | 1531808 |  
2016-01-01T00:00:00Z | 2016-12-31T00:00:00Z | 2515575 |  
2017-01-01T00:00:00Z | 2018-01-01T00:00:00Z | 944019 |  
2018-01-01T00:00:00Z | 2019-01-01T00:00:00Z | 5033245 |  
2019-01-01T00:00:00Z | 2020-01-01T00:00:00Z | 4562373 |  
                                                                     Total   |  | 18364035 |  

No items where uploaded before 2013.

Calculation of days of ingestion per day

Let x be the maximum number of items created on a day .
Maximum 10,000 requests per day and at a max of 100 items per request. (Based on the research in europeana ticket #279 )

Requests needed for getting one day's data : x / 100

Number of days that can ingested per day : (10000 * 100 ) / x

From the above distribution we can see that the year 2018-2019 had the maximum amount of items created , 5033245.
This means an average of 13,790 items were created on a single day.
Assuming this average across all years and adding certain buffer items lets keep 20,000 items per day as a base.

Number of days ingested per day - ( 10000 * 100 ) / 20000 - 50 days

I have doubts regarding this approach since there is possibility of bulk upload during the initial stages of uploading 2013, 2014, etc and there might be days when the created items total greater than the above estimated average.

@mathemancer

@kss682
Copy link
Contributor

kss682 commented Jul 21, 2020

So, Europeana have their first uploaded date on Nov 21 2013, running script to get the daily item created count from the start date to end date (Dec 31 , 2019) gave some interesting results. They do not create or upload data on a daily basis . So the script ran to check the data for 2231 days and they have created data only in 561 days of it.
Other days item count is 0.

Screenshot from 2020-07-21 12-39-09

The strategy in the previous comment was based on the assumption that there would be some kind of uniform distribution of data with a maximum upload per day at 20,000. This assumption goes wrong as we can see that the max upload per day goes above hundred thousand .

@kgodey kgodey moved this from Ready for Development to In Progress (Community) in Active Sprint Jul 24, 2020
@kss682 kss682 mentioned this issue Jul 27, 2020
7 tasks
Active Sprint automation moved this from In Progress (Community) to Done Aug 3, 2020
@TimidRobot TimidRobot removed this from Done in Active Sprint Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Open to participation from the community
Development

Successfully merging a pull request may close this issue.

2 participants