### Andrew Taylor
### atayl136
### Creating AI Enabled Systems

# Search Demo Notebook

# Demo Search Notebook
This notebook demonstrates nearest neighbor search using the implemented FAISS index and various distance measures: **Euclidean**, **Cosine**, **Dot Product**, and **Minkowski**. We compute embeddings for gallery images, perform searches with 10 probe images, and report the rank positions.


In [1]:
import os
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

import sys
# Import FAISS (will be correctly handled in the index classes)
import faiss



# Add the parent directory to the path
# Replace '/path/to/parent/directory' with the actual path to your parent directory
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(parent_dir)

from modules.extraction.embedding import Embedding
from modules.extraction.preprocessing import Preprocessing


# Then import
from modules.retrieval.index.bruteforce import FaissBruteForce
from modules.retrieval.search import FaissSearch

In [2]:
import os

os.environ["KMP_DUPLICATE_LIB_OK"]="True" 



In [3]:
from PIL import Image
import torch
import os
import glob
import numpy as np


# Initialize the preprocessing pipeline and embedding model once.
preprocessing = Preprocessing(image_size=160)
device = 'cpu'
embedding_model = Embedding(pretrained='casia-webface', device=device)

def compute_embedding(image_path):
    # Open and preprocess the image.
    image = Image.open(image_path).convert("RGB")
    processed_image = preprocessing.process(image)
    # Compute the embedding.
    embedding_vector = embedding_model.encode(processed_image)
    return embedding_vector

def get_first_image_from_folder(folder_path):
    """Get the first valid image file from a folder, skipping problematic files."""
    valid_extensions = ['.jpg', '.jpeg', '.png', '.bmp', '.gif']
    
    try:
        for filename in os.listdir(folder_path):
            # Skip files starting with ._ (macOS metadata files)
            if filename.startswith('._'):
                continue
                
            file_path = os.path.join(folder_path, filename)
            if os.path.isfile(file_path) and any(file_path.lower().endswith(ext) for ext in valid_extensions):
                # Verify it's a valid image before returning
                try:
                    with Image.open(file_path) as img:
                        # Just accessing a property forces PIL to validate the file
                        img.format
                    return file_path
                except Exception:
                    # Skip this file if PIL can't open it
                    continue
    except Exception as e:
        print(f"Error accessing folder {folder_path}: {e}")
    
    return None

gallery_dir = '..\storage\multi_image_gallery'
folder_paths = glob.glob(os.path.join(gallery_dir, '*'))
print(f"Found {len(folder_paths)} folders in gallery.")

embeddings = []
metadata = []
successful_count = 0
error_count = 0

for folder_path in folder_paths:
    try:
        img_path = get_first_image_from_folder(folder_path)
        if img_path:
            try:
                embedding = compute_embedding(img_path)
                embeddings.append(embedding)
                metadata.append(os.path.basename(folder_path))
                successful_count += 1
                if successful_count % 10 == 0:
                    print(f"Successfully processed {successful_count} folders")
            except Exception as e:
                error_count += 1
                print(f"Error processing image {img_path}: {e}")
        else:
            print(f"No valid images found in folder: {folder_path}")
    except Exception as e:
        print(f"Fatal error with folder {folder_path}: {e}")

print(f"Processing complete. Success: {successful_count}, Errors: {error_count}")

embeddings = np.array(embeddings)
print(f"Created embeddings for {len(embeddings)} folders")

# Build a FAISS BruteForce index with Euclidean metric for demonstration.
faiss_index = FaissBruteForce(dim=512, metric='euclidean')
faiss_index.add_embeddings(embeddings, metadata)
print("Gallery embeddings indexed.")

  gallery_dir = '..\storage\multi_image_gallery'


Found 1000 folders in gallery.
Successfully processed 10 folders
Successfully processed 20 folders
Successfully processed 30 folders
Successfully processed 40 folders
Successfully processed 50 folders
Successfully processed 60 folders
Successfully processed 70 folders
Successfully processed 80 folders
Successfully processed 90 folders
Successfully processed 100 folders
Successfully processed 110 folders
Successfully processed 120 folders
Successfully processed 130 folders
Successfully processed 140 folders
Successfully processed 150 folders
Successfully processed 160 folders
Successfully processed 170 folders
Successfully processed 180 folders
Successfully processed 190 folders
Successfully processed 200 folders
Successfully processed 210 folders
Successfully processed 220 folders
Successfully processed 230 folders
Successfully processed 240 folders
Successfully processed 250 folders
Successfully processed 260 folders
Successfully processed 270 folders
Successfully processed 280 folder

In [4]:
# Step 2: Load probe images and compute embeddings
print("\nStep 2: Processing probe images...")
probe_dir = '../storage/probe'
probe_folders = glob.glob(os.path.join(probe_dir, '*'))

# Make sure we don't select more probes than are available
num_probes = min(10, len(probe_folders))
probe_folders = np.random.choice(probe_folders, size=num_probes, replace=False)

probe_embeddings = []
probe_metadata = []

print("Selected probe images:")
for i, folder_path in enumerate(probe_folders):
    img_path = get_first_image_from_folder(folder_path)
    if img_path:
        try:
            probe_embedding = compute_embedding(img_path)
            # Convert to numpy immediately
            probe_embeddings.append(probe_embedding)
            probe_metadata.append(os.path.basename(folder_path))
            print(f"Probe {i+1}: {os.path.basename(folder_path)}")
        except Exception as e:
            print(f"Error processing probe image {img_path}: {e}")
    else:
        print(f"No valid image in probe folder: {folder_path}")

probe_embeddings = np.array(probe_embeddings, dtype=np.float32)
print(f"Created embeddings for {len(probe_embeddings)} probe folders with shape {probe_embeddings.shape}")

# Step 3: Define metrics and perform searches
distance_metrics = ['euclidean', 'cosine', 'dot_product', 'minkowski']
k = 5  # Retrieve top 5 nearest neighbors for each probe

# Dictionary to store results for each metric
results = {metric: [] for metric in distance_metrics}


Step 2: Processing probe images...
Selected probe images:
Probe 1: Daniela_Hantuchova
Probe 2: Kim_Ryong-sung
Probe 3: Rick_Romley
Probe 4: Donatella_Versace
Probe 5: Steven_Spielberg
Probe 6: George_Lopez
Probe 7: Paul_McCartney
Probe 8: Larry_Coker
Probe 9: Ann_Veneman
Probe 10: Alec_Baldwin
Created embeddings for 10 probe folders with shape (10, 512)


In [5]:


#searcher = FaissSearch(faiss_index, metric="euclidean")
#distances, indices, meta_results = searcher.search(probe, k=5)
#print(f"  Search completed successfully")

In [6]:
print("\nStep 3: Performing searches with different distance metrics...")

for metric in distance_metrics:
    print(f"\n=== Distance metric: {metric} ===")
    
    # Create a fresh index for each metric
    if metric == 'minkowski':
        # For Minkowski, use euclidean with p=3 in the search
        faiss_index = FaissBruteForce(dim=embeddings.shape[1], metric='euclidean')
    else:
        faiss_index = FaissBruteForce(dim=embeddings.shape[1], metric=metric)
    
    # Add embeddings to the fresh index
    faiss_index.add_embeddings(embeddings, metadata)
    print(f"Created {metric} index with {faiss_index.index.ntotal} vectors")
    
    # Create searcher for this metric
    searcher = FaissSearch(faiss_index, metric=metric, p=3)  # p=3 for Minkowski
    
    # Dictionary to store probe results for this metric
    metric_results = []
    
    # Search for each probe
    for i, probe in enumerate(probe_embeddings):
        probe_name = probe_metadata[i]
        print(f"\nProbe {i+1}/{len(probe_embeddings)}: {probe_name}")
        
        try:
            # Ensure probe is correct format
            probe = np.ascontiguousarray(probe, dtype=np.float32)
            
            # Perform search
            distances, indices, meta_results = searcher.search(probe, k=k)
            
            # Store results
            probe_result = {
                'probe': probe_name,
                'matches': []
            }
            
            print(f"Top {k} matches:")
            for rank, (dist, match) in enumerate(zip(distances[0], meta_results), start=1):
                match_info = {
                    'rank': rank,
                    'name': match,
                    'distance': float(dist)
                }
                probe_result['matches'].append(match_info)
                print(f"  Rank {rank}: {match} (Distance: {dist:.4f})")
            
            metric_results.append(probe_result)
            
        except Exception as e:
            print(f"Error searching for probe {probe_name}: {str(e)}")
            import traceback
            traceback.print_exc()
    
    # Store all results for this metric
    results[metric] = metric_results
    print(f"Completed search with {metric} metric")

# Step 4: Print summary report
print("\n=== SEARCH RESULTS SUMMARY ===")
for metric in distance_metrics:
    print(f"\n--- {metric.upper()} METRIC ---")
    for probe_result in results[metric]:
        probe_name = probe_result['probe']
        print(f"Probe: {probe_name}")
        print("  Rank    | Match                    | Distance")
        print("  --------|--------------------------|---------")
        for match in probe_result['matches']:
            print(f"  {match['rank']:<7} | {match['name']:<24} | {match['distance']:.4f}")
        print()


Step 3: Performing searches with different distance metrics...

=== Distance metric: euclidean ===
Created euclidean index with 1000 vectors

Probe 1/10: Daniela_Hantuchova
Top 5 matches:
  Rank 1: Augustin_Calleri (Distance: 0.1098)
  Rank 2: Monica_Bellucci (Distance: 0.1213)
  Rank 3: Isaiah_Washington (Distance: 0.1320)
  Rank 4: Angelo_Reyes (Distance: 0.1362)
  Rank 5: Nicanor_Duarte_Frutos (Distance: 0.1398)

Probe 2/10: Kim_Ryong-sung
Top 5 matches:
  Rank 1: Sachiko_Yamada (Distance: 0.1147)
  Rank 2: Alberto_Fujimori (Distance: 0.1228)
  Rank 3: Keira_Knightley (Distance: 0.1262)
  Rank 4: Charlotte_Rampling (Distance: 0.1384)
  Rank 5: Stanley_Tong (Distance: 0.1389)

Probe 3/10: Rick_Romley
Top 5 matches:
  Rank 1: Joan_Claybrook (Distance: 0.5574)
  Rank 2: John_F_Kennedy_Jr (Distance: 0.6059)
  Rank 3: Francis_Mer (Distance: 0.6148)
  Rank 4: Eduardo_Duhalde (Distance: 0.6198)
  Rank 5: Paul_ONeill (Distance: 0.6628)

Probe 4/10: Donatella_Versace
Top 5 matches:
  Rank 1

## Observations

- **Euclidean**, **Cosine**, and **Dot Product** metrics yield different rankings, though cosine and dot product are often similar if embeddings are normalized.
- The **Minkowski** and **Euclidean** metrics yielded the shortest distances.
- The choice of distance measure can affect the ranking of nearest neighbors; further tuning and experiments are necessary to determine the best fit for the application.

