# 🛍️ VectorSearch Pro: Embedding Pipeline

**Course:** ITAI 1378 - Computer Vision  
Alhassane Samassekou
December 2025  
**Notebook:** 2 of 3 - Vector Embedding Generation

---

## Project Overview

This notebook is the **second stage** of a three-part pipeline for building a semantic fashion search engine. The complete pipeline consists of:

| Notebook | Purpose | Output |
|----------|---------|--------|
| 01 - Setup & Exploration | Data acquisition and validation | Cleaned dataset |
| **02 - Embedding Pipeline** | Vector generation with CLIP | Embedding files (.npy) |
| 03 - Search Demo | Interactive search interface | Gradio web app |

### Objectives of This Notebook

1. **Model Loading**: Initialize the CLIP vision-language model
2. **Batch Processing**: Efficiently process 44,000+ images
3. **Vector Generation**: Create normalized 512-dimensional embeddings
4. **Artifact Export**: Save embeddings and metadata for search indexing

---

## Technical Background

### What is CLIP?

**CLIP (Contrastive Language-Image Pre-training)** is a neural network developed by OpenAI that learns visual concepts from natural language supervision. It was trained on 400 million image-text pairs collected from the internet.

### Why CLIP for Fashion Search?

| Advantage | Explanation |
|-----------|-------------|
| **Zero-Shot Capability** | No fine-tuning required for fashion domain |
| **Multimodal Alignment** | Text and images share the same embedding space |
| **Semantic Understanding** | Captures style, color, and category concepts |
| **Pre-trained Quality** | Robust representations from massive training data |

### Embedding Process

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Image     │────▶│   CLIP      │────▶│   512-dim   │────▶│   L2 Norm   │
│   (JPEG)    │     │   ViT-B/32  │     │   Vector    │     │   Vector    │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │  Patch      │
                    │  Embedding  │
                    │  + Attention│
                    └─────────────┘
```

---

## 1. Environment Setup

This section imports the required libraries:

- **transformers**: Hugging Face library for accessing CLIP
- **torch**: PyTorch deep learning framework
- **PIL**: Python Imaging Library for image loading
- **tqdm**: Progress bar for batch processing
- **numpy/pandas**: Data manipulation

In [None]:
import os
import numpy as np
import pandas as pd
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from tqdm.notebook import tqdm

---

## 2. Data Verification

Before proceeding, we verify that the dataset from Notebook 01 is available. If not, the Kaggle download process is triggered.

### Expected Directory Structure

```
working_directory/
├── images/
│   ├── 15970.jpg
│   ├── 39386.jpg
│   └── ... (44,000+ images)
├── styles.csv
└── (notebook files)
```

In [None]:
# If the dataset is not present, download it
if not os.path.exists('images'):
    print("⚠️ Data not found. Please run Notebook 01 first, or run the cells below:")
    from google.colab import files
    print("Upload your kaggle.json file:")
    uploaded = files.upload()
    !mkdir -p ~/.kaggle
    !cp kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    !pip install -q kaggle
    !kaggle datasets download -d paramaggarwal/fashion-product-images-small
    !unzip -q fashion-product-images-small.zip
else:
    print("✅ Dataset found.")

⚠️ Data not found. Please run Notebook 01 first, or run the cells below:
Upload your kaggle.json file:


Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small
License(s): MIT
Downloading fashion-product-images-small.zip to /content
 96% 545M/565M [00:02<00:00, 213MB/s]
100% 565M/565M [00:03<00:00, 150MB/s]


---

## 3. Model Initialization

We load the **CLIP ViT-B/32** model from Hugging Face. This model uses a Vision Transformer (ViT) as the image encoder.

### Model Specifications

| Parameter | Value |
|-----------|-------|
| Model ID | `openai/clip-vit-base-patch32` |
| Architecture | Vision Transformer (ViT-B/32) |
| Patch Size | 32×32 pixels |
| Embedding Dimension | 512 |
| Input Resolution | 224×224 (auto-resized) |
| Parameters | ~150M |

### Hardware Detection

The code automatically detects CUDA availability and uses GPU acceleration when possible.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model_id = "openai/clip-vit-base-patch32"
print(f"Loading model: {model_id}...")

model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)
print("✅ Model loaded successfully!")

Using device: cuda
Loading model: openai/clip-vit-base-patch32...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

✅ Model loaded successfully!


---

## 4. Data Preparation

This section loads the cleaned metadata and prepares the image paths for batch processing.

### Data Cleaning Steps

1. Load `styles.csv` with bad line handling
2. Construct full image paths from product IDs
3. Filter to only include existing image files
4. Extract parallel lists of paths and IDs for tracking

In [None]:
# Load the cleaned dataframe from CSV again to get the file paths
csv_path = 'styles.csv'
images_dir = 'images/'

df = pd.read_csv(csv_path, on_bad_lines='skip')
df['image_path'] = df.apply(lambda row: os.path.join(images_dir, str(row['id']) + '.jpg'), axis=1)

# Filter for existing images only
df['exists'] = df['image_path'].apply(os.path.exists)
df_clean = df[df['exists']].copy()

# We need a list of paths and IDs to keep track of what we are embedding
image_paths = df_clean['image_path'].tolist()
image_ids = df_clean['id'].tolist()

print(f"Ready to process {len(image_paths)} images.")

Ready to process 44419 images.


---

## 5. Batch Embedding Generation

This is the core processing loop that generates embeddings for all images.

### Why Batch Processing?

Processing images one-by-one is inefficient due to:
- GPU memory transfer overhead
- Underutilization of parallel compute
- Python loop overhead

**Batch processing** (32 images at a time) provides:
- ~10x speedup on GPU
- Better memory utilization
- Progress visibility with tqdm

### Processing Pipeline

For each batch of 32 images:

```
1. Load Images      → Open JPEG files, convert to RGB
2. Preprocess       → Resize to 224×224, normalize pixels
3. Forward Pass     → Run through CLIP vision encoder
4. L2 Normalize     → Unit-length vectors for cosine similarity
5. Store Results    → Append to embeddings list
```

### Normalization

L2 normalization is **critical** for vector search:
- Converts vectors to unit length (||v|| = 1)
- Enables cosine similarity via inner product
- Makes all vectors comparable regardless of magnitude

In [None]:
# Processing 44k images one by one is slow. We process in batches (e.g., 32 at a time).

batch_size = 32
embeddings = []

print("Starting embedding process... (This will take time!)")

# Loop through the data in chunks of 'batch_size'
for i in tqdm(range(0, len(image_paths), batch_size)):
    # 1. Get the batch of file paths
    batch_paths = image_paths[i : i + batch_size]

    # 2. Open images (handling potential errors)
    images = []
    valid_indices = [] # Keep track of which ones in this batch actually opened
    for idx, path in enumerate(batch_paths):
        try:
            image = Image.open(path).convert("RGB")
            images.append(image)
            valid_indices.append(idx)
        except Exception as e:
            print(f"Skipping corrupt image: {path}")

    if not images:
        continue

    # 3. Preprocess images for the model
    # return_tensors="pt" gives us PyTorch tensors
    inputs = processor(images=images, return_tensors="pt", padding=True).to(device)

    # 4. Generate Embeddings (Inference)
    with torch.no_grad(): # Disable gradient calculation to save memory
        outputs = model.get_image_features(**inputs)

    # 5. Normalize vectors (Crucial for cosine similarity later!)
    # This makes the "length" of every arrow 1, so we only compare direction.
    outputs = outputs / outputs.norm(p=2, dim=-1, keepdim=True)

    # 6. Move to CPU and store in list
    embeddings.append(outputs.cpu().numpy())

Starting embedding process... (This will take time!)


  0%|          | 0/1389 [00:00<?, ?it/s]

---

## 6. Export Artifacts

This section saves the generated embeddings and metadata to disk for use in the search demo.

### Output Files

| File | Format | Description | Size |
|------|--------|-------------|------|
| `image_embeddings.npy` | NumPy binary | 44,419 × 512 float32 matrix | ~86 MB |
| `image_ids.npy` | NumPy binary | Product ID array | ~175 KB |
| `metadata_clean.csv` | CSV | ID, name, path mapping | ~2 MB |

### Data Shapes

```
image_embeddings.npy: (44419, 512)
                       ↑       ↑
                   # images  embedding dim
```

In [None]:
# Concatenate all batch outputs into one big array
final_embeddings = np.concatenate(embeddings)
final_ids = np.array(image_ids)

print(f"\n✅ Processing complete!")
print(f"Embeddings Shape: {final_embeddings.shape}")
# Expected shape: (num_images, 512)

# Save to disk
np.save("image_embeddings.npy", final_embeddings)
np.save("image_ids.npy", final_ids)
# Save a simplified CSV for the demo later
df_clean[['id', 'productDisplayName', 'image_path']].to_csv("metadata_clean.csv", index=False)

print("Files saved: image_embeddings.npy, image_ids.npy, metadata_clean.csv")


✅ Processing complete!
Embeddings Shape: (44419, 512)
Files saved: image_embeddings.npy, image_ids.npy, metadata_clean.csv


---

## 7. Summary & Performance

### Processing Statistics

| Metric | Value |
|--------|-------|
| Total Images Processed | 44,419 |
| Batch Size | 32 |
| Total Batches | 1,389 |
| Processing Time | ~1h 54m (CPU) |
| Embedding Dimension | 512 |
| Output Size | ~86 MB |

### Performance Notes

- **GPU Acceleration**: With CUDA-enabled GPU, processing time drops to ~15-20 minutes
- **Memory Usage**: Peak ~4GB RAM for batch processing
- **Error Handling**: Corrupt images are skipped without stopping the pipeline

### Verification

To verify the embeddings are correctly normalized:
```python
# All vectors should have unit length (≈1.0)
norms = np.linalg.norm(final_embeddings, axis=1)
print(f"Norm range: [{norms.min():.4f}, {norms.max():.4f}]")
```

---

## Next Steps

**Notebook 03 - Search Demo** to:
1. Build a FAISS index from the embeddings
2. Implement text and image search functions
3. Create an interactive Gradio web interface