Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[Infrastructure] Implement Reingestion Strategy for Flickr #298

Closed
mathemancer opened this issue Feb 28, 2020 · 4 comments · Fixed by #394
Closed

[Infrastructure] Implement Reingestion Strategy for Flickr #298

mathemancer opened this issue Feb 28, 2020 · 4 comments · Fixed by #394
Assignees

Comments

@mathemancer
Copy link
Contributor

mathemancer commented Feb 28, 2020

Current Situation

Currently, we generally only get information about a given image when it is first uploaded to one of our larger providers. For popularity data, this won't suffice; It's not very meaningful to know how many views an image has just after it was uploaded. Thus, we have come up with a strategy that will allow us to update info about images uploaded in the past regularly, with a preference towards 'freshening' the information about recently uploaded images. This will also (in the future) allow us to remove images that were taken down at the source from CC Search.

Suggested Improvement

Goal

In order to keep the data in CC Catalog up to date over time, it is necessary to 'recheck' the information for each image periodically over time. This makes tracking metrics that change with time, e.g., 'views', more sensible. It also gives us the possibility of removing images from the index that were removed from the source for some reason.

Ingestion Strategy -- broadly speaking

We would prefer to reingest the information for newer images more frequently, and the information for older images less frequently. This is because we assume the information for newer images will be updated at the source in more interesting ways when the image is newer. For example, assume a picture is viewed 100 times per month.

 month | total views | % increase
-------|-------------|------------
   1   |     100     | infinite
-------|-------------|------------
   2   |     200     |   100% 
-------|-------------|------------
   3   |     300     |    50% 
-------|-------------|------------
   4   |     400     |    33% 
-------|-------------|------------
   5   |     500     |    25% 
-------|-------------|------------
   6   |     600     |    20% 
-------|-------------|------------
   7   |     700     |    17% 
-------|-------------|------------
   8   |     800     |    14% 
-------|-------------|------------
   9   |     900     |    13% 
-------|-------------|------------
  10   |    1000     |    11% 
-------|-------------|------------
  11   |    1100     |    10% 
-------|-------------|------------
  12   |    1200     |     9% 

As we see, given consistent monthly views, the 'percent increase' of the total views metric drops off as the picture ages (In reality, it appears that in most cases, pictures are mostly viewed when they are new).

Thus, it makes sense to focus more on keeping the information up-to-date for the most recently uploaded images.

Metric: days ingested per day of processing

The basic thing to consider when trying to figure out a proper strategy for keeping image data up-to-date via reingestion is: For what percentage of the Provider's collection can we reingest the image data per day? This tells us how sparsely we need to spread out reingestion of image data. For example, if we can only reingest 1% per day, then we'd expect the mean time between reingesting
the data for a given image to be about 100 days. Since we ingest from most providers based on the date an image was uploaded, a convenient approximation of this is: How many days worth of uploaded data can we ingest per day?

Example: Flickr

For Flickr, we can ingest about 100 days of uploaded image data per day. (This was calculated using the year 2016 as an example. Because 2016 was around the peak for the number of images uploaded to Flickr per day, the actual number if days of data we can ingest per day is likely higher.) This means it takes around 3.65 days to ingest the data of all images uploaded to Wikimedia Commons in a year.

Algorithm to choose which days to (re)ingest

Basically, we assume we'll ingest some number n of days each day. We set some maximum number of days D we're willing to wait between reingestion of the data for a given image, subject to the constraint that we need to have nD > x, where x is the total number of days of data we want to ingest eventually. If we have 'wiggle room' (e.g., if nD > x + y for some y >= 1), we use it to create a 'smooth increase' in the number of days between ingestion, with perhaps the first reingestion being when the image is only 30 days old, rather than D days old.

Example: Flickr calculations/options

If we want to reingest the data for an image at least every 6 months, the best we can do (optimizing for the total number of days ingested eventually) is to reingest the data once every 6 months. This means we'd be able to reingest data that is up to 49.5 years old (assuming that we can ingest 100 days per day, and noting that we need to also ingest the current date's data). In months, that means we'd ingest the data for an image when it is

0, 6, 12, 18, ..., or 594

months old. Now, noticing that Flickr has only existed since 2004, we could modify our strategy to reingest their data when it is

0, 1, 2, ..., 23, 24, 27, 30, ..., 129, 132, 138, 144, ..., or 366

months old. We've "sacrificed" reingesting (non-existent) old data of our ingestion, but we'd still get all data newer than 30.5 years (rolling) at least once every 6 months on the following schedule:

  • Newer than two years: every month
  • 2 to 11 years old: every 3 months
  • 11 to 30.5 years old: every 6 months

So, using this strategy, we ensure that all data is updated at least every 6 months, with a preference towards data about pictures uploaded within the last two years. Because this covers 30.5 years back in time, this strategy would suffice to reingest all relevant Flickr data for the next 15 years or so (the current date is 2020).

Benefit

Implementing this will allow us to manage data that needs to be refreshed over time for Flickr (e.g., popularity data such as views, comments, or likes). This is important for ranking in CC Search, and choosing images for the AWS Imagine Grant.

Alternatives

  • I'm open to suggestions from Community members.
  • We've also looked at using CommonCrawl data for this sort of repetitive updating.

Additional context

  • My assumption here is that reingesting data for an image at a rate slower than once per year isn't sufficient. If that holds true, then it puts a major constraint on the speed of ingestion to overall volume ratio of the Provider API
@mathemancer
Copy link
Contributor Author

We're very open to comments / criticisms / suggestions on this Issue until we begin coding on it.

Note that this is a much longer issue description than would usually be appropriate. This is because we copied a Wiki page here for visibility.

@mathemancer mathemancer added this to To Be Prioritized in Backlog via automation Feb 28, 2020
@mathemancer mathemancer moved this from To Be Prioritized to Next Sprint in Backlog Feb 28, 2020
@kgodey kgodey removed this from Next Sprint in Backlog Mar 23, 2020
@kgodey kgodey added this to Ready for Development in Active Sprint via automation Mar 23, 2020
@kgodey kgodey moved this from Ready for Development to Ticket Work Required in Active Sprint Mar 23, 2020
@kgodey
Copy link
Contributor

kgodey commented Mar 23, 2020

We're only doing Flickr for now. @mathemancer will update the ticket accordingly.

@mathemancer mathemancer changed the title [Infrastructure] Implement Reingestion Strategy for Wikimedia Commons or Flickr [Infrastructure] Implement Reingestion Strategy for Flickr Mar 25, 2020
@mathemancer
Copy link
Contributor Author

Note for actual implementation: '30 days' is a perfectly acceptable substitute for 'month' in the description above. It's better, because you don't have to decide what day was a month before, e.g., March 31st.

@kgodey kgodey added help wanted Open to participation from the community and removed not ready for work labels Apr 3, 2020
@kgodey kgodey removed this from Ticket Work Required in Active Sprint Apr 3, 2020
@kgodey kgodey added this to Pending Review in Backlog via automation Apr 3, 2020
@kgodey kgodey moved this from Pending Review to Q2 2020 in Backlog Apr 3, 2020
@mathemancer
Copy link
Contributor Author

The suggested mechanism for re-running the ingestion for a given date would be to ask Airflow (either from a cron job, or from another DAG) to rerun the script for that date, which could be done either by asking for a backfill of that date, or clearing the already-existing run for that date.

@kgodey kgodey moved this from Q2 2020 to Next Sprint in Backlog Apr 30, 2020
@kgodey kgodey removed this from Next Sprint in Backlog Apr 30, 2020
@kgodey kgodey added this to Ready for Development in Active Sprint via automation Apr 30, 2020
@mathemancer mathemancer moved this from Ready for Development to In Progress in Active Sprint May 11, 2020
@mathemancer mathemancer added in progress and removed help wanted Open to participation from the community labels May 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants