A high-performance image similarity search engine built from scratch. This project uses a pre-trained CNN in Python/PyTorch for high-dimensional feature extraction and a custom k-d tree implementation in C for efficient nearest-neighbor search.
It's a practical demonstration of how modern AI pipelines and classic, high-performance data structures can work together to solve complex problems.
This project is divided into two main components:
-
Feature Extraction (The "Visual Brain")
- A Python script uses a pre-trained Convolutional Neural Network (MobileNetV2) to convert images into meaningful 1280-dimensional feature vectors (embeddings).
- This process, known as Transfer Learning, leverages a model already trained on millions of images to understand the "content" of a new image and represent it numerically.
-
Efficient Search (The "Librarian Brain")
- A C program loads the thousands of feature vectors generated by the Python script.
- To avoid a slow linear scan, it organizes these high-dimensional points into a k-d tree, a specialized binary search tree for spatial data.
- This allows for an extremely fast nearest-neighbor search to find the image with the smallest Euclidean distance to a query image, using an intelligent pruning algorithm to avoid unnecessary comparisons.
image-similarity-search/
├── c_search/ # C program for the k-d tree search
│ ├── src/
│ ├── include/
│ ├── data/
│ └── Makefile
│
├── python_extractor/ # Python script for feature extraction
│ ├── dataset/
│ ├── extract_features.py
│ └── requirements.txt
│
├── .gitignore # Files and folders to ignore
├── .gitattributes # Configures Git LFS for large files
└── README.md # You are here
- A C compiler (like GCC) and
make. - Python 3.8+ and
pip. - Git LFS (for handling the large
vectors.csvfile).
-
Clone the repository: Make sure you have Git LFS installed (
git lfs install).git clone https://github.com/coderstale/image-similarity-search.git cd image-similarity-search -
Set up the Python environment:
cd python_extractor python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
The process is two steps: first generate the data, then run the search.
Run the Python script to download the CIFAR-10 dataset and generate the vectors.csv file.
# Make sure you are in the python_extractor/ directory with the venv active
python extract_features.pyThis will create a large vectors.csv file in the c_search/data/ directory.
Navigate to the C directory, compile the code with make, and run the application.
# From the project root, navigate to the C directory
cd c_search
# Compile the program
make
# Run the search application
./bin/search_appThe program will load the 50,000 vectors, build the k-d tree, and then prompt you to enter an image ID to find its most similar match.
- Web Frontend: The C backend could be refactored into a simple web server using a library like
mongoose. A simple HTML/JavaScript frontend could then be built to provide a graphical interface for searching and displaying images. - K-Nearest Neighbors: The search algorithm could be extended to find the K nearest neighbors instead of just one, providing a gallery of similar images.