[Infrastructure] Implement Reingestion Strategy for Flickr #298

mathemancer · 2020-02-28T13:28:14Z

Current Situation

Currently, we generally only get information about a given image when it is first uploaded to one of our larger providers. For popularity data, this won't suffice; It's not very meaningful to know how many views an image has just after it was uploaded. Thus, we have come up with a strategy that will allow us to update info about images uploaded in the past regularly, with a preference towards 'freshening' the information about recently uploaded images. This will also (in the future) allow us to remove images that were taken down at the source from CC Search.

Suggested Improvement

Goal

In order to keep the data in CC Catalog up to date over time, it is necessary to 'recheck' the information for each image periodically over time. This makes tracking metrics that change with time, e.g., 'views', more sensible. It also gives us the possibility of removing images from the index that were removed from the source for some reason.

Ingestion Strategy -- broadly speaking

We would prefer to reingest the information for newer images more frequently, and the information for older images less frequently. This is because we assume the information for newer images will be updated at the source in more interesting ways when the image is newer. For example, assume a picture is viewed 100 times per month.

 month | total views | % increase
-------|-------------|------------
   1   |     100     | infinite
-------|-------------|------------
   2   |     200     |   100% 
-------|-------------|------------
   3   |     300     |    50% 
-------|-------------|------------
   4   |     400     |    33% 
-------|-------------|------------
   5   |     500     |    25% 
-------|-------------|------------
   6   |     600     |    20% 
-------|-------------|------------
   7   |     700     |    17% 
-------|-------------|------------
   8   |     800     |    14% 
-------|-------------|------------
   9   |     900     |    13% 
-------|-------------|------------
  10   |    1000     |    11% 
-------|-------------|------------
  11   |    1100     |    10% 
-------|-------------|------------
  12   |    1200     |     9%

As we see, given consistent monthly views, the 'percent increase' of the total views metric drops off as the picture ages (In reality, it appears that in most cases, pictures are mostly viewed when they are new).

Thus, it makes sense to focus more on keeping the information up-to-date for the most recently uploaded images.

Metric: days ingested per day of processing

The basic thing to consider when trying to figure out a proper strategy for keeping image data up-to-date via reingestion is: For what percentage of the Provider's collection can we reingest the image data per day? This tells us how sparsely we need to spread out reingestion of image data. For example, if we can only reingest 1% per day, then we'd expect the mean time between reingesting
the data for a given image to be about 100 days. Since we ingest from most providers based on the date an image was uploaded, a convenient approximation of this is: How many days worth of uploaded data can we ingest per day?

Example: Flickr

For Flickr, we can ingest about 100 days of uploaded image data per day. (This was calculated using the year 2016 as an example. Because 2016 was around the peak for the number of images uploaded to Flickr per day, the actual number if days of data we can ingest per day is likely higher.) This means it takes around 3.65 days to ingest the data of all images uploaded to Wikimedia Commons in a year.

Algorithm to choose which days to (re)ingest

Basically, we assume we'll ingest some number n of days each day. We set some maximum number of days D we're willing to wait between reingestion of the data for a given image, subject to the constraint that we need to have nD > x, where x is the total number of days of data we want to ingest eventually. If we have 'wiggle room' (e.g., if nD > x + y for some y >= 1), we use it to create a 'smooth increase' in the number of days between ingestion, with perhaps the first reingestion being when the image is only 30 days old, rather than D days old.

Example: Flickr calculations/options

If we want to reingest the data for an image at least every 6 months, the best we can do (optimizing for the total number of days ingested eventually) is to reingest the data once every 6 months. This means we'd be able to reingest data that is up to 49.5 years old (assuming that we can ingest 100 days per day, and noting that we need to also ingest the current date's data). In months, that means we'd ingest the data for an image when it is

0, 6, 12, 18, ..., or 594

months old. Now, noticing that Flickr has only existed since 2004, we could modify our strategy to reingest their data when it is

0, 1, 2, ..., 23, 24, 27, 30, ..., 129, 132, 138, 144, ..., or 366

months old. We've "sacrificed" reingesting (non-existent) old data of our ingestion, but we'd still get all data newer than 30.5 years (rolling) at least once every 6 months on the following schedule:

Newer than two years: every month
2 to 11 years old: every 3 months
11 to 30.5 years old: every 6 months

So, using this strategy, we ensure that all data is updated at least every 6 months, with a preference towards data about pictures uploaded within the last two years. Because this covers 30.5 years back in time, this strategy would suffice to reingest all relevant Flickr data for the next 15 years or so (the current date is 2020).

Benefit

Implementing this will allow us to manage data that needs to be refreshed over time for Flickr (e.g., popularity data such as views, comments, or likes). This is important for ranking in CC Search, and choosing images for the AWS Imagine Grant.

Alternatives

I'm open to suggestions from Community members.
We've also looked at using CommonCrawl data for this sort of repetitive updating.

Additional context

My assumption here is that reingesting data for an image at a rate slower than once per year isn't sufficient. If that holds true, then it puts a major constraint on the speed of ingestion to overall volume ratio of the Provider API

The text was updated successfully, but these errors were encountered:

mathemancer · 2020-02-28T13:29:34Z

We're very open to comments / criticisms / suggestions on this Issue until we begin coding on it.

Note that this is a much longer issue description than would usually be appropriate. This is because we copied a Wiki page here for visibility.

kgodey · 2020-03-23T17:33:37Z

We're only doing Flickr for now. @mathemancer will update the ticket accordingly.

mathemancer · 2020-03-25T09:41:06Z

Note for actual implementation: '30 days' is a perfectly acceptable substitute for 'month' in the description above. It's better, because you don't have to decide what day was a month before, e.g., March 31st.

mathemancer · 2020-04-08T08:20:59Z

The suggested mechanism for re-running the ingestion for a given date would be to ask Airflow (either from a cron job, or from another DAG) to rerun the script for that date, which could be done either by asking for a backfill of that date, or clearing the already-existing run for that date.

mathemancer added enhancement labels Feb 28, 2020

mathemancer added this to To Be Prioritized in Backlog via automation Feb 28, 2020

mathemancer moved this from To Be Prioritized to Next Sprint in Backlog Feb 28, 2020

kgodey mentioned this issue Mar 19, 2020

Systematically update CC catalog records #164

Closed

kgodey removed this from Next Sprint in Backlog Mar 23, 2020

kgodey added this to Ready for Development in Active Sprint via automation Mar 23, 2020

kgodey moved this from Ready for Development to Ticket Work Required in Active Sprint Mar 23, 2020

mathemancer changed the title ~~[Infrastructure] Implement Reingestion Strategy for Wikimedia Commons or Flickr~~ [Infrastructure] Implement Reingestion Strategy for Flickr Mar 25, 2020

kgodey added help wanted Open to participation from the community and removed not ready for work labels Apr 3, 2020

kgodey removed this from Ticket Work Required in Active Sprint Apr 3, 2020

kgodey added this to Pending Review in Backlog via automation Apr 3, 2020

kgodey moved this from Pending Review to Q2 2020 in Backlog Apr 3, 2020

kgodey moved this from Q2 2020 to Next Sprint in Backlog Apr 30, 2020

kgodey removed this from Next Sprint in Backlog Apr 30, 2020

kgodey added this to Ready for Development in Active Sprint via automation Apr 30, 2020

kgodey assigned mathemancer Apr 30, 2020

mathemancer moved this from Ready for Development to In Progress in Active Sprint May 11, 2020

mathemancer added in progress and removed help wanted Open to participation from the community labels May 14, 2020

This was referenced May 14, 2020

Add Date Partitioned Flickr reingestion workflow #394

Merged

[Infrastructure] Implement Reingestion strategy for Wikimedia Commons #395

Closed

mathemancer closed this as completed in #394 May 25, 2020

Active Sprint automation moved this from In Progress to Done May 25, 2020

This was referenced May 29, 2020

[Infrastructure] Implement reingestion strategy for Europeana #412

Closed

[Infrastructure] Implement Reingestion Strategy for Met Museum #413

Closed

obulat mentioned this issue Apr 17, 2023

Systematically update CC catalog records (original #164) WordPress/openverse#1749

Closed

4 tasks

TimidRobot removed this from Done in Active Sprint Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infrastructure] Implement Reingestion Strategy for Flickr #298

[Infrastructure] Implement Reingestion Strategy for Flickr #298

mathemancer commented Feb 28, 2020 •

edited

mathemancer commented Feb 28, 2020

kgodey commented Mar 23, 2020

mathemancer commented Mar 25, 2020

mathemancer commented Apr 8, 2020

[Infrastructure] Implement Reingestion Strategy for Flickr #298

[Infrastructure] Implement Reingestion Strategy for Flickr #298

Comments

mathemancer commented Feb 28, 2020 • edited

Current Situation

Suggested Improvement

Goal

Ingestion Strategy -- broadly speaking

Metric: days ingested per day of processing

Example: Flickr

Algorithm to choose which days to (re)ingest

Example: Flickr calculations/options

Benefit

Alternatives

Additional context

mathemancer commented Feb 28, 2020

kgodey commented Mar 23, 2020

mathemancer commented Mar 25, 2020

mathemancer commented Apr 8, 2020

mathemancer commented Feb 28, 2020 •

edited