Skip to content

A high-performance image similarity search engine. Uses a pre-trained CNN (MobileNetV2) in Python/PyTorch for feature extraction and a custom k-d tree implementation in C for efficient nearest-neighbor search.

Notifications You must be signed in to change notification settings

cattolatte/image-similarity-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Image Similarity Search Engine

Language C Language Python Library PyTorch

A high-performance image similarity search engine built from scratch. This project uses a pre-trained CNN in Python/PyTorch for high-dimensional feature extraction and a custom k-d tree implementation in C for efficient nearest-neighbor search.

It's a practical demonstration of how modern AI pipelines and classic, high-performance data structures can work together to solve complex problems.


Core Concepts

This project is divided into two main components:

  1. Feature Extraction (The "Visual Brain")

    • A Python script uses a pre-trained Convolutional Neural Network (MobileNetV2) to convert images into meaningful 1280-dimensional feature vectors (embeddings).
    • This process, known as Transfer Learning, leverages a model already trained on millions of images to understand the "content" of a new image and represent it numerically.
  2. Efficient Search (The "Librarian Brain")

    • A C program loads the thousands of feature vectors generated by the Python script.
    • To avoid a slow linear scan, it organizes these high-dimensional points into a k-d tree, a specialized binary search tree for spatial data.
    • This allows for an extremely fast nearest-neighbor search to find the image with the smallest Euclidean distance to a query image, using an intelligent pruning algorithm to avoid unnecessary comparisons.

Project Structure

image-similarity-search/
├── c_search/              # C program for the k-d tree search
│   ├── src/
│   ├── include/
│   ├── data/
│   └── Makefile
│
├── python_extractor/      # Python script for feature extraction
│   ├── dataset/
│   ├── extract_features.py
│   └── requirements.txt
│
├── .gitignore             # Files and folders to ignore
├── .gitattributes         # Configures Git LFS for large files
└── README.md              # You are here

Setup and Installation

Prerequisites

  • A C compiler (like GCC) and make.
  • Python 3.8+ and pip.
  • Git LFS (for handling the large vectors.csv file).

Installation Steps

  1. Clone the repository: Make sure you have Git LFS installed (git lfs install).

    git clone https://github.com/coderstale/image-similarity-search.git
    cd image-similarity-search
  2. Set up the Python environment:

    cd python_extractor
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt

Usage

The process is two steps: first generate the data, then run the search.

Step 1: Generate Feature Vectors

Run the Python script to download the CIFAR-10 dataset and generate the vectors.csv file.

# Make sure you are in the python_extractor/ directory with the venv active
python extract_features.py

This will create a large vectors.csv file in the c_search/data/ directory.

Step 2: Compile and Run the C Search Program

Navigate to the C directory, compile the code with make, and run the application.

# From the project root, navigate to the C directory
cd c_search

# Compile the program
make

# Run the search application
./bin/search_app

The program will load the 50,000 vectors, build the k-d tree, and then prompt you to enter an image ID to find its most similar match.


Future Work

  • Web Frontend: The C backend could be refactored into a simple web server using a library like mongoose. A simple HTML/JavaScript frontend could then be built to provide a graphical interface for searching and displaying images.
  • K-Nearest Neighbors: The search algorithm could be extended to find the K nearest neighbors instead of just one, providing a gallery of similar images.

About

A high-performance image similarity search engine. Uses a pre-trained CNN (MobileNetV2) in Python/PyTorch for feature extraction and a custom k-d tree implementation in C for efficient nearest-neighbor search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published