# Demonstration of Extraction Process

In [1]:
import os
from PIL import Image
import numpy as np
import torch
import sys


# Add the parent directory to the system path
sys.path.append(os.path.abspath(".."))

# Now you can import from the modules folder
from modules.extraction.preprocessing import Preprocessing
from modules.extraction.embedding import Embedding


def load_first_image(directory):
    """
    Load the first valid image file found in the given directory.
    Assumes image files have extensions .jpg or .png and ignores hidden/system files.
    """
    files = [f for f in os.listdir(directory)
             if f.lower().endswith(('.jpg', '.png')) and not f.startswith("._")]
    if not files:
        raise FileNotFoundError(f"No image files found in {directory}")
    files.sort()  # Ensure consistent order
    return os.path.join(directory, files[0])


def euclidean_distance(vec1, vec2):
    """Compute the Euclidean distance between two vectors."""
    return np.linalg.norm(vec1 - vec2)

# List of individuals to process
individuals = ['Drew_Barrymore', 'Warren_Buffett', 'Owen_Wilson', 'Nelson_Mandela', 'Ian_Thorpe']

# Dictionaries to store the file paths for probe and gallery images
probe_images = {}
gallery_images = {}


# Get the absolute path to the parent directory
parent_dir = os.path.abspath("..")

# Construct file paths for probe and gallery images using the parent directory
for person in individuals:
    probe_dir = os.path.join(parent_dir, "storage", "probe", person)
    gallery_dir = os.path.join(parent_dir, "storage", "multi_image_gallery", person)
    probe_images[person] = load_first_image(probe_dir)
    gallery_images[person] = load_first_image(gallery_dir)


print("Probe Images:")
for person, path in probe_images.items():
    print(f"{person}: {path}")

print("\nGallery Images:")
for person, path in gallery_images.items():
    print(f"{person}: {path}")

# Initialize the preprocessing pipeline
preprocessor = Preprocessing(image_size=160)

def process_and_encode(image_path, embedding_model):
    """
    Open the image, preprocess it, and compute its embedding using the given model.
    """
    image = Image.open(image_path)
    tensor = preprocessor.process(image)
    embedding = embedding_model.encode(tensor)
    return embedding

# Determine the device based on availability
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Initialize two embedding models with different pretrained sources
embedding_casia = Embedding(pretrained='casia-webface', device=device)
embedding_vgg = Embedding(pretrained='vggface2', device=device)


Probe Images:
Drew_Barrymore: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\probe\Drew_Barrymore\Drew_Barrymore_0002.jpg
Warren_Buffett: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\probe\Warren_Buffett\Warren_Buffett_0002.jpg
Owen_Wilson: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\probe\Owen_Wilson\Owen_Wilson_0002.jpg
Nelson_Mandela: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\probe\Nelson_Mandela\Nelson_Mandela_0002.jpg
Ian_Thorpe: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\probe\Ian_Thorpe\Ian_Thorpe_0002.jpg

Gallery Images:
Drew_Barrymore: C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ir

#### **Task 1: Assignment Instructions:**

Demonstrate the embedding capability of the service for 'cassia-webface' and 'vggface2'  by calculating the Euclidean distance of the following five probe images and their corresponding gallery images:

- Drew Barrymore
- Warren Buffet
- Owen Wilson
- Nelson Mandela
- Ian Thorpe

In [2]:
results = {}

for person in individuals:
    # Process and encode the probe and gallery images using the "casia-webface" model
    emb_probe_casia = process_and_encode(probe_images[person], embedding_casia)
    emb_gallery_casia = process_and_encode(gallery_images[person], embedding_casia)
    dist_casia = euclidean_distance(emb_probe_casia, emb_gallery_casia)
    
    # Process and encode using the "vggface2" model
    emb_probe_vgg = process_and_encode(probe_images[person], embedding_vgg)
    emb_gallery_vgg = process_and_encode(gallery_images[person], embedding_vgg)
    dist_vgg = euclidean_distance(emb_probe_vgg, emb_gallery_vgg)
    
    results[person] = {"casia-webface": dist_casia, "vggface2": dist_vgg}

# Display the computed Euclidean distances
print("Euclidean distances between probe and gallery images:")
for person, distances in results.items():
    print(f"{person}: casia-webface = {distances['casia-webface']:.4f}, vggface2 = {distances['vggface2']:.4f}")


Euclidean distances between probe and gallery images:
Drew_Barrymore: casia-webface = 0.5837, vggface2 = 1.0935
Warren_Buffett: casia-webface = 0.9267, vggface2 = 0.9987
Owen_Wilson: casia-webface = 0.7567, vggface2 = 0.5736
Nelson_Mandela: casia-webface = 0.6206, vggface2 = 0.8532
Ian_Thorpe: casia-webface = 0.6561, vggface2 = 1.1900


#### Observations:

Here are some observations based on the computed Euclidean distances:

- **Variation Between Models:**  
  For some individuals, the casia-webface model produces lower distances (e.g., Drew Barrymore, Nelson Mandela, Ian Thorpe), suggesting it might be more consistent for these faces. However, for Owen Wilson, vggface2 produces a lower distance, indicating that the performance can be subject-dependent.

- **Model Sensitivity:**  
  The differences in distances might be attributed to how each model was trained. For example, casia-webface seems to generate embeddings that are closer for Drew Barrymore and Ian Thorpe, while vggface2 shows an advantage for Owen Wilson. Warren Buffett’s distances are quite similar across both models, hinting at consistent performance for that identity.

- **Interpretation of Distances:**  
  Since Euclidean distance is a measure of similarity (lower means more similar), these results imply that for most individuals, casia-webface tends to produce embeddings that are slightly more similar between probe and gallery images, with the exception of Owen Wilson where vggface2 shows a better match. However, without a predefined threshold or further validation against negative pairs, these numbers primarily serve as relative indicators.



#

---

#### **Task 2: Assignment Instructions:**

Precompute the embeddings of ALL images stored in storage/gallery/*. For each of the five probe images, calculate the following distance against all the images in the gallery. Sort the embeddings from shortest to longest distance and print the images of the ten nearest neighbors and the name associated with each image. Note your observations.

In [3]:
def dot_product_distance(vec1, vec2):
    """
    Compute a distance based on the negative dot product.
    A higher dot product (more similar) gives a lower "distance".
    """
    return - np.dot(vec1, vec2)

def cosine_distance(vec1, vec2):
    """
    Compute the cosine distance between two vectors.
    Cosine distance is defined as 1 minus the cosine similarity.
    """
    cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    return 1 - cos_sim

def minkowski_distance(vec1, vec2, p=3):
    """
    Compute the Minkowski distance with order p (default p=3).
    """
    return np.sum(np.abs(vec1 - vec2) ** p) ** (1/p)


In [4]:
# Precompute embeddings for all images in the gallery.
gallery_embeddings = []

# Set the gallery base folder; adjust as needed.
gallery_base = os.path.join(parent_dir, "storage", "multi_image_gallery")

# List all subdirectories (each representing a person)
persons_in_gallery = [d for d in os.listdir(gallery_base) if os.path.isdir(os.path.join(gallery_base, d))]

for person in persons_in_gallery:
    person_dir = os.path.join(gallery_base, person)
    # List valid image files (exclude hidden files)
    image_files = [f for f in os.listdir(person_dir) if f.lower().endswith(('.jpg', '.png')) and not f.startswith("._")]
    image_files.sort()
    for img in image_files:
        img_path = os.path.join(person_dir, img)
        try:
            image = Image.open(img_path)
            tensor = preprocessor.process(image)
            # Using the casia-webface model for gallery embeddings
            embedding = embedding_casia.encode(tensor)
            gallery_embeddings.append({"person": person, "image_path": img_path, "embedding": embedding})
        except Exception as e:
            print(f"Error processing image {img_path}: {e}")

print(f"Precomputed embeddings for {len(gallery_embeddings)} gallery images.")


Precomputed embeddings for 2265 gallery images.


In [5]:
# For each probe image in our individuals list, compute distances against all gallery embeddings.
# We'll use the casia-webface embedding for consistency.
probe_results = {}

for person in individuals:
    probe_img_path = probe_images[person]
    # Process and compute the probe embedding
    image = Image.open(probe_img_path)
    tensor = preprocessor.process(image)
    probe_embedding = embedding_casia.encode(tensor)
    
    # Prepare dictionaries to store distances per metric
    distances = {"euclidean": [], "dot_product": [], "cosine": [], "minkowski": []}
    
    for gallery_entry in gallery_embeddings:
        g_embedding = gallery_entry["embedding"]
        d_euclid = euclidean_distance(probe_embedding, g_embedding)
        d_dot = dot_product_distance(probe_embedding, g_embedding)
        d_cos = cosine_distance(probe_embedding, g_embedding)
        d_mink = minkowski_distance(probe_embedding, g_embedding)
        
        distances["euclidean"].append((d_euclid, gallery_entry))
        distances["dot_product"].append((d_dot, gallery_entry))
        distances["cosine"].append((d_cos, gallery_entry))
        distances["minkowski"].append((d_mink, gallery_entry))
    
    # Sort each metric's list: from smallest to largest "distance"
    for metric in distances:
        distances[metric].sort(key=lambda x: x[0])
    
    probe_results[person] = distances

# Display the top 10 nearest neighbors for each probe image and for each metric.
for person, metrics in probe_results.items():
    print(f"\nResults for probe image of {person}:")
    for metric, results_list in metrics.items():
        print(f"\nTop 10 nearest neighbors using {metric} distance:")
        for rank, (dist_val, gallery_entry) in enumerate(results_list[:10], start=1):
            print(f"Rank {rank}: {gallery_entry['person']} - {gallery_entry['image_path']} with distance {dist_val:.4f}")



Results for probe image of Drew_Barrymore:

Top 10 nearest neighbors using euclidean distance:
Rank 1: Julie_Gerberding - C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\multi_image_gallery\Julie_Gerberding\Julie_Gerberding_0007.jpg with distance 0.3953
Rank 2: Geoff_Hoon - C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\multi_image_gallery\Geoff_Hoon\Geoff_Hoon_0006.jpg with distance 0.3983
Rank 3: Oscar_De_La_Hoya - C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\multi_image_gallery\Oscar_De_La_Hoya\Oscar_De_La_Hoya_0005.jpg with distance 0.4025
Rank 4: Dwayne_Johnson - C:\Users\Putna\OneDrive - Johns Hopkins\Documents\Johns Hopkins\Creating AI Enabled Systems\SP25\ironclad\storage\multi_image_gallery\Dwayne_Johnson\Dwayne_Johnson_0001.jpg with distance 0.4046
Rank 5: Julie_Gerberding - C

#### Observations:
...place your observations and analysis in this Markdown cell...

These results offer several interesting insights:

- **Consistency Across Metrics:**  
  For each probe, the ordering of nearest neighbors is remarkably consistent across Euclidean, dot product, cosine, and Minkowski distances. Although the absolute values differ—with Euclidean distances in the range of 0.39–0.63, negative values for dot product, small fractions for cosine, and even lower values for Minkowski—the relative ranking remains stable. This indicates that the embedding space is structured such that different distance measures agree on which images are most similar.

- **Identity Match Variability:**  
  - For **Nelson_Mandela**, the correct gallery image appears at rank 1 across all metrics, suggesting that his facial features are well captured by the model.  
  - In contrast, the probes for **Drew_Barrymore**, **Warren_Buffett**, **Owen_Wilson**, and **Ian_Thorpe** do not have their corresponding identities in the top 10. For example, Drew_Barrymore’s probe is most similar to images of Julie_Gerberding, Geoff_Hoon, and Oscar De La Hoya rather than to another Drew_Barrymore image. This discrepancy might reflect either a limitation in the embedding quality for those particular faces or a lack of variability in the gallery images for those identities.

- **Distance Gaps and Clustering:**  
  The differences between successive ranks (e.g., for Drew_Barrymore, the gap between the top candidate at 0.3953 and the 10th at 0.4317 in Euclidean distance) are relatively small. This narrow range suggests that several gallery images lie within a tight cluster in the embedding space, which could lead to ambiguity when distinguishing among similar faces.

- **Implications for System Performance:**  
  While the embedding space appears to be internally consistent (as evidenced by similar rankings across metrics), the fact that several probe images do not retrieve the correct identity within the top 10 may indicate that further fine-tuning, enhanced preprocessing, or additional data might be necessary to improve discrimination between identities.

Overall, these observations highlight the importance of not only the choice of distance metric—which in this case appears to have a minimal impact on ranking—but also the need to ensure that the embedding model can reliably distinguish among all target identities.

---

#### **Task 3: Assignment Instructions:**

Report the rank positions of the five probe's associated gallery images. Note your observations.

In [7]:
# Task 3: Report rank positions of the probe's associated gallery images

print("\n--- Rank Positions for Correct Gallery Matches ---\n")
for person, metrics in probe_results.items():
    print(f"Probe Image: {person}")
    for metric, results_list in metrics.items():
        rank_found = None
        # Iterate over the sorted list for the current metric.
        for rank, (dist_val, gallery_entry) in enumerate(results_list, start=1):
            # Check if the gallery image's identity matches the probe's identity.
            if gallery_entry["person"].lower() == person.lower():
                rank_found = rank
                break
        if rank_found is not None:
            print(f"  {metric} distance: Correct gallery image found at rank {rank_found}")
        else:
            print(f"  {metric} distance: No matching gallery image found in the gallery")
    print("-" * 60)



--- Rank Positions for Correct Gallery Matches ---

Probe Image: Drew_Barrymore
  euclidean distance: Correct gallery image found at rank 360
  dot_product distance: Correct gallery image found at rank 360
  cosine distance: Correct gallery image found at rank 360
  minkowski distance: Correct gallery image found at rank 377
------------------------------------------------------------
Probe Image: Warren_Buffett
  euclidean distance: Correct gallery image found at rank 521
  dot_product distance: Correct gallery image found at rank 521
  cosine distance: Correct gallery image found at rank 521
  minkowski distance: Correct gallery image found at rank 510
------------------------------------------------------------
Probe Image: Owen_Wilson
  euclidean distance: Correct gallery image found at rank 437
  dot_product distance: Correct gallery image found at rank 437
  cosine distance: Correct gallery image found at rank 437
  minkowski distance: Correct gallery image found at rank 444
---

#### Observations:
...place your observations and analysis in this Markdown cell...

Here are some observations based on the rank positions of the correct gallery matches:

- **Distinctiveness in Embedding Space:**  
  Nelson_Mandela’s probe image is uniquely represented, as his corresponding gallery image is ranked 1 across all metrics. This suggests that his facial features are very distinctive in the embedding space.

- **High Rank Positions for Some Identities:**  
  For Drew_Barrymore, Warren_Buffett, and Owen_Wilson, the correct gallery image appears very far down the sorted list (ranks 360, 521, and 437, respectively). This indicates that many other gallery images are deemed more similar than the true match, which might reflect overlapping features or a lack of distinctiveness for these individuals.

- **Metric Consistency:**  
  The rank positions are nearly identical across Euclidean, dot product, and cosine distances, with only minor variations for Minkowski. This consistency reinforces that the embedding space is structured similarly regardless of the distance measure used.

- **Intermediate Performance for Ian_Thorpe:**  
  Ian_Thorpe’s correct match appears at a relatively lower rank (rank 33 for most metrics and rank 26 for Minkowski), suggesting a better match compared to Drew_Barrymore, Warren_Buffett, and Owen_Wilson, yet not as distinct as Nelson_Mandela.

- **Implications for System Tuning:**  
  The large rank differences for certain identities imply that further fine-tuning of the embedding model, preprocessing steps, or even augmenting the gallery data might be necessary to improve discrimination—especially for those identities with very high rank positions.

Overall, while the embedding space shows consistency across different distance metrics, the variation in rank positions highlights that some identities are much better captured than others, pointing to potential areas for improvement in the system.