Come up with a solution for consuming crawler events #457

aldenstpage · 2020-07-08T15:54:35Z

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
We record 404s in the link_rot topic

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.

The text was updated successfully, but these errors were encountered:

aldenstpage added the enhancement label Jul 8, 2020

aldenstpage assigned mathemancer and aldenstpage Jul 8, 2020

aldenstpage added this to Pending Review in Backlog via automation Jul 8, 2020

annatuma moved this from Pending Review to Q3 2020 in Backlog Jul 13, 2020

kgodey moved this from Q3 2020 to tmp in Backlog Aug 13, 2020

kgodey added ✨ goal: improvement Improvement to an existing feature and removed enhancement labels Sep 22, 2020

annatuma moved this from Q3 2020 to Q4 2020 in Backlog Oct 1, 2020

cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey moved this from Q4 2020 to CC Search in Backlog Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020

kgodey closed this as completed Dec 16, 2020

kgodey moved this from CC Search to Done in Backlog Dec 16, 2020

obulat mentioned this issue Apr 17, 2023

Come up with a solution for consuming crawler events (original #457) WordPress/openverse#1765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a solution for consuming crawler events #457

Come up with a solution for consuming crawler events #457

aldenstpage commented Jul 8, 2020 •

edited

Come up with a solution for consuming crawler events #457

Come up with a solution for consuming crawler events #457

Comments

aldenstpage commented Jul 8, 2020 • edited

aldenstpage commented Jul 8, 2020 •

edited