Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Come up with a solution for consuming crawler events #457

Closed
aldenstpage opened this issue Jul 8, 2020 · 0 comments
Closed

Come up with a solution for consuming crawler events #457

aldenstpage opened this issue Jul 8, 2020 · 0 comments
Assignees
Labels
✨ goal: improvement Improvement to an existing feature 🙅 status: discontinued Not suitable for work as repo is in maintenance

Comments

@aldenstpage
Copy link
Contributor

aldenstpage commented Jul 8, 2020

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

  • We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
  • In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
  • We record 404s in the link_rot topic

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.

@aldenstpage aldenstpage added this to Pending Review in Backlog via automation Jul 8, 2020
@annatuma annatuma moved this from Pending Review to Q3 2020 in Backlog Jul 13, 2020
@kgodey kgodey moved this from Q3 2020 to tmp in Backlog Aug 13, 2020
@kgodey kgodey added ✨ goal: improvement Improvement to an existing feature and removed enhancement labels Sep 22, 2020
@annatuma annatuma moved this from Q3 2020 to Q4 2020 in Backlog Oct 1, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey moved this from Q4 2020 to CC Search in Backlog Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
@kgodey kgodey moved this from CC Search to Done in Backlog Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing feature 🙅 status: discontinued Not suitable for work as repo is in maintenance
Development

No branches or pull requests

4 participants