Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use image retrieval techniques to find similiar images #27

Closed
dlangenk opened this issue Jan 4, 2019 · 8 comments · Fixed by #128
Closed

Use image retrieval techniques to find similiar images #27

dlangenk opened this issue Jan 4, 2019 · 8 comments · Fixed by #128
Assignees

Comments

@dlangenk
Copy link
Member

dlangenk commented Jan 4, 2019

More like a nice to have.

I just browsed through the results of novelty detection. Unfortunately the classes are quite scattered, so that selection takes some time. In addition, some classes are much more abundant than others, so the rare classes might be "lost" in the downstream steps. It would be nice to have a "show me more thumbnails that look like this one" mechanism. Algorithms for that are available in image retrieval. We could for example use mpeg7 features or something similar to create a tree structure from the data to make it easier browsable. Creation of that structure shouldn't take much time or resources.

@mzur mzur added the student label Jan 4, 2019
@mzur
Copy link
Member

mzur commented Apr 7, 2021

#66 should be implemented first.

@mzur
Copy link
Member

mzur commented Apr 7, 2021

Idea for the UI: If this feature is active (which is optional or disabled if not enough training data is available), the grid of image patches in MAIA is split vertically (e.g. 80% rows showing the regular patches, 20% rows showing patches suggested by this method). This way the original MAIA workflow is still possible even if this method performs poorly for a given use case.

@mzur
Copy link
Member

mzur commented Aug 13, 2021

This can be done with the image features and similarity search implemented for biigle/core#336. The function should be available for training proposals and annotation candidates.

@mzur
Copy link
Member

mzur commented Oct 18, 2021

Next idea for the UI: The selected proposal/candidate is shown, fixed and highlighted at the first position in the grid. The remaining grid items are sorted according to the similarity to the patch. They scroll and can be interacted with as usual. The filtering can be enabled with a hover button on each patch. It can be disabled with a button on the highlighted fixed patch.

@mzur mzur changed the title Use image retrieval techniques to find similiar images after novelty detection Use image retrieval techniques to find similiar images Nov 30, 2021
@mzur
Copy link
Member

mzur commented Nov 30, 2021

Updated the title to make clear that this should be implemented both for training proposals and annotation candidates.

@mzur mzur removed the student label Oct 20, 2022
@mzur
Copy link
Member

mzur commented Feb 9, 2023

With the student experiments based on Dino features and #96 done, this can move forward now.

@mzur mzur self-assigned this Jun 13, 2023
@mzur
Copy link
Member

mzur commented Oct 4, 2023

I want to pick this up again. New thoughts:

  • Use DINOv2 for feature extraction.
  • Use pgvector to store the features directly in the database (will work with annotation patches, too).
  • I thought about using a separate (vector) database for storing the features but 1. it's too convenient to use the existing constraints and logic to update/delete the rows and 2. there are probably no performance issues (right now) with the amount of data we manage.
  • pgvector supports (indexing) up to 2000 dimensions per feature vector. Each dimension requires ~4 byte. DINOv2 can produce feature vectors with between 384 and 1536 dimensions. A 1536 dim. feature vector would have ~6144 bytes. From a rough estimate, the features of the current BIIGLE image annotations would require >90 GB which is too much, IMO. A 384 dim. feature vector would result in ~23 GB of additional storage. As a start, I'll experiment with patches of size 224x224 and vits14 (384 dims).
  • We could use PCA for dimension reduction with MAIA but we can't for the other use cases (i.e. Largo) as the annotations are created continuously and we can't know the PCs in advance.
  • When this goes live (also for regular annotations in Largo) we must think about migrating the database host to a flavor with more storage.
  • We also have to implement incremental backups, I think (for the "frequent" backup). The hourly backups should be fine even with a much larger DB size.

Here is a notebook with a minimal feature-extraction example with DINOv2: https://colab.research.google.com/drive/1LbtYkzdOezl2SadyxCRJFYhLd_aQNjlq?usp=sharing

@mzur
Copy link
Member

mzur commented Oct 4, 2023

Thinking about it, maybe I prefer decoupling the vector database from our main database. With MAIA and Largo it's easy to implement cleanup of vector database rows, since the annotation/candidate/proposal patch files are also cleaned. Cleanup can be asynchronous as well.

This has the advantage that the vector DB does not have an impact on the regular DB backups. It can have it's own (less frequent) backups and be run on a different host.

Laravel can work with different database connections (also for migrations). We only need to sync (and index) the model IDs from the regular DB to the vector DB but this shouldn't be a problem.

I'll still stick with pgvector, as I don't want to introduce a new technology to the stack.

mzur added a commit that referenced this issue Oct 4, 2023
@mzur mzur mentioned this issue Oct 4, 2023
8 tasks
@mzur mzur linked a pull request Oct 12, 2023 that will close this issue
8 tasks
@mzur mzur closed this as completed in #128 Oct 13, 2023
@mzur mzur mentioned this issue Dec 12, 2023
33 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants