Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate Rekognition data into the catalog #431

Open
2 of 4 tasks
obulat opened this issue Feb 18, 2023 · 7 comments
Open
2 of 4 tasks

Incorporate Rekognition data into the catalog #431

obulat opened this issue Feb 18, 2023 · 7 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Feb 18, 2023

Summary

Rekognition data in the form of object labels was collected for roughly 100m records in the Openverse catalog.

These labels should be sanitized for suitability in the Openverse project and applied to records in the Openverse Catalog as tags.

Description

Some exploratory work was done to assess the quality of these labels. The team generally felt positive about them, given we would blanket remove a subset of them (e.g. ones that assume a gender). We will need to do a broader analysis to determine if there are more labels we would want to exclude, and then incorporate them into the existing tags for each record in the catalog. The automated tags include a confidence score associated with the tag value, and we should also incorporate those values into the overall document score for relevant searches.

Best guess at list of implementation plans:

  • Strategy for filtering then upserting the tags into their associated records.
  • Determining whether/how to surface these tags in the frontend and differentiate them from provider-supplied tags

Documents

Issues

Prior Art

@obulat obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 18, 2023
@openverse-bot openverse-bot added this to Backlog in Openverse Feb 18, 2023
@obulat obulat removed this from Backlog in Openverse Feb 22, 2023
@zackkrida
Copy link
Member

Early Testing

Back in April I ran a simple script to do some basic analysis of the Rekognition labels. I mostly wanted to test the speed of reading all of the data.

Here's the script I used: https://gist.github.com/zackkrida/cb125155e87aa1c296887e5c27ea33ff

Infra setup

The script was run on a manually-provisioned EC2 instance. The instance was configured with permissions to access our S3 bucket. I also used an instance with Enhanced Networking support so the script would theoretically stream the rekognigtion data as fast as possible.

Unfortunately I only loosely recall how long it took, and am struggling to find my notes. I believe it was around 4-5 hrs. I do remember being happy with the speed.

General recommendations

For this project I would strongly recommend we download the full list of Rekognition labels from this page: https://docs.aws.amazon.com/rekognition/latest/dg/labels.html and filter out anything related to gender prediction.

As far as the approach we take to importing the rekognition data, we could probably use a script much like the one I wrote to stream the rekognition data and then perform sql updates in batches, adding the new tags to the existing array with a provider value of "Rekognition". We may also want to store the confidence of each tag in the Catalog DB. This would give us more flexibility in the future. We could fine tune tags in Elasticsearch, for example, and only choose to show those with a certain confidence level.

@AetherUnbound
Copy link
Contributor

The project proposal has recently been merged, and issues for the 3 implementation plans have been created (linked above). I plan on starting the API-related IP soon.

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Contributor

No change since the previous update - IPs still need to be drafted.

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Contributor

The IP for the API-side of things has been merged (#4189) and can be seen here. The only issues necessary for this work has been created and will be worked on in the next week or so: #4273.

@fcoveram has also established mock-ups for how the machine-generated tags will be displayed in the frontend in #4192. This was a necessary prerequisite for the frontend IP, #4039, which @obulat will be working on.

Work can also begin on the final IP, #4040, which will be a more subjective dive into the tags themselves and what policy Openverse will take for machine-generated labels.

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 💬 In RFC
Development

No branches or pull requests

4 participants