 AI/ML Engineer Assignment - Multi-Modal Fashion Recommendation Engine

## Assignment Overview

This notebook implements **Task 1** of the fashion recommendation engine assignment:

# Objective
Train a **multi-modal embedding model** that combines fashion product text + images into a shared semantic embedding space.

### Components Covered
1. ** Multi-Modal Model Training**: Fine-tune CLIP using contrastive learning
2. ** Evaluation Metrics**: Cosine similarity and Top-k accuracy for retrieval  
3. ** Model Optimization**: Domain-specific adaptation for fashion terminology
4. ** Model Deployment**: Prepare for vector database integration


### Dataset
- **Source**: Fashion Product Images Dataset (Small) from Kaggle
- **Content**: Product images, titles, descriptions, categories, gender, price
- **Size**: Optimized subset for efficient training and evaluation

---

### imports

In [1]:
# First, install compatible NumPy version
%pip install numpy==1.26.4
%pip install jupyterlab ipykernel ipywidgets requests isodate pandas datasets torch==2.3.0 torchvision torchaudio transformers==4.48.0 sentence-transformers==3.3.1 accelerate

Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy
Successfully installed numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.
Collecting jupyterlab
  Downloading jupyterlab-4.4.5-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting isodate
  Downloading isodate-0.7.2-py3-none-any.whl (22 kB)
Collecting pandas
  Downloading pandas-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting datasets
  Downloading datasets-4.0.0-py3-none-

In [None]:
# ONLY if you're on TPU (Kaggle/Colab TPU)
!pip install torch==2.0.1 torchvision==0.15.2
!pip install -U torch_xla==2.0 -f https://storage.googleapis.com/libtpu-releases/index.html


In [1]:
from datasets import load_dataset

from PIL import Image
import requests

# Add this import at the top
import numpy as np

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.evaluation import TripletEvaluator, SentenceEvaluator

from typing import List, Dict
import torch

  from .autonotebook import tqdm as notebook_tqdm


### import model and dataset

##  Load Pre-trained CLIP Model and Fashion Dataset

Load the pre-trained CLIP ViT-L-14 model and prepare the fashion product images dataset for multi-modal embedding training.

In [2]:
model_name = "sentence-transformers/clip-ViT-L-14"
model = SentenceTransformer(model_name)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


# download the dataset from my public huggingface space

In [3]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("dejasi5459/fashion-product-images-small")

# Explore the available splits
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['image', 'styles', 'final_style'],
        num_rows: 44419
    })
})


### freeze model params

## Configure Model Parameters for Fashion Fine-tuning

Freeze most model parameters and only train specific layers (projection layer) to efficiently adapt CLIP for fashion product embeddings.

In [4]:
# pick specific layers to train (note: you can add more layers to this list)
trainable_layers_list = ['projection']

# Apply freezing configuration
for name, param in model.named_parameters():
    # freeze all params
    param.requires_grad = False
    

    # unfreeze layers in trainable_layers_list
    if any(layer in name for layer in trainable_layers_list):
        param.requires_grad = True

In [5]:
# Verify trainable parameters
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")

Trainable: 0.model.visual_projection.weight
Trainable: 0.model.text_projection.weight


In [6]:
# Count total and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Percentage of trainable parameters: {100 * trainable_params / total_params:.2f}%")

Total parameters: 427,616,513
Trainable parameters: 1,376,256
Percentage of trainable parameters: 0.32%


### preprocess data

##  Preprocess Fashion Product Dataset

Load and preprocess the Fashion Product Images Dataset for multi-modal training. Create triplets of:
- **Anchor**: Product images
- **Positive**: Matching product descriptions (title + description)  
- **Negative**: Non-matching product descriptions

This enables contrastive learning to align similar fashion items in the embedding space.

In [9]:
import torch
import torchvision.transforms as T
from PIL import Image

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Define GPU-accelerated transforms
transform = T.Compose([
    T.Resize((224, 224), interpolation=T.InterpolationMode.LANCZOS),  # Must be before ToTensor
    T.ToTensor()
])
def preprocess_gpu(batch):
    anchor_images = []
    
    for img in batch["image"]:
        try:
            image = img.convert("RGB")
            # Convert to tensor and move to GPU
            tensor = transform(image).to(device)
            # Convert back to PIL if needed (or keep as tensor)
            image = T.ToPILImage()(tensor.cpu())
            anchor_images.append(image)
        except Exception as e:
            print(f"Error processing image: {e}")
            placeholder = Image.new('RGB', (224, 224), color='black')
            anchor_images.append(placeholder)
    
    return {
        "anchor": anchor_images,
        "positive": [fs["Positive"] for fs in batch["final_style"]],
        "negative": [fs["NEGATIVE"] for fs in batch["final_style"]],
    }

# Select first 10000 examples from the train split because of memory constraints, it can be increased later
small_dataset = dataset["train"].select(range(10000))
columns_to_remove = [col for col in dataset['train'].column_names if col not in ['image', 'final_style']]

# Apply preprocessing
processed_dataset = small_dataset.map(
    preprocess_gpu,
    batched=True,
    batch_size=32,    # Efficient batch size for GPU
    # num_proc=4,       # Parallel processing with multiple workers
    remove_columns=columns_to_remove
)


Using device: cuda


Map: 100%|██████████| 10000/10000 [03:27<00:00, 48.17 examples/s]


In [10]:
print(processed_dataset[0])  # for a single dataset


{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=60x80 at 0x27E0755A150>, 'final_style': {'Unnamed: 0': 4682, 'gender': 'Women', 'masterCategory': 'Apparel', 'subCategory': 'Topwear', 'articleType': 'Tops', 'baseColour': 'White', 'season': 'Summer', 'year': 2011.0, 'usage': 'Casual', 'productDisplayName': 'UCB Women Sleeveless White Top', 'Positive': 'Women , Apparel , Topwear , Tops , White , Summer , Casual , UCB Women Sleeveless White Top', 'NEGATIVE': 'Men , Lounge Pants , Winter , Formal'}, 'anchor': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=224x224 at 0x27E07559250>, 'positive': 'Women , Apparel , Topwear , Tops , White , Summer , Casual , UCB Women Sleeveless White Top', 'negative': 'Men , Lounge Pants , Winter , Formal'}


In [11]:
dataset=processed_dataset

In [12]:
dataset

Dataset({
    features: ['image', 'final_style', 'anchor', 'positive', 'negative'],
    num_rows: 10000
})

In [13]:
from datasets import Dataset, DatasetDict

# STEP 1: Split into train/valid/test
train_valid = processed_dataset.train_test_split(test_size=0.2, seed=42)
valid_test = train_valid["test"].train_test_split(test_size=0.5, seed=42)

dataset = DatasetDict({
    "train": train_valid["train"],
    "valid": valid_test["train"],
    "test":  valid_test["test"]
})

# STEP 2: Keep only necessary columns
for split in ["train", "valid", "test"]:
    dataset[split] = dataset[split].select_columns(['anchor', 'positive', 'negative'])

# STEP 3: Ensure 'positive' and 'negative' are strings (for text encoders)
def ensure_text(example):
    return {
        "anchor": example["anchor"],  # Usually a PIL.Image or tensor
        "positive": str(example["positive"]),
        "negative": str(example["negative"])
    }

for split in ["train", "valid", "test"]:
    dataset[split] = dataset[split].map(ensure_text)

#  Final confirmation
print(dataset)


Map: 100%|██████████| 8000/8000 [02:21<00:00, 56.62 examples/s] 
Map: 100%|██████████| 1000/1000 [00:17<00:00, 57.73 examples/s]
Map: 100%|██████████| 1000/1000 [00:17<00:00, 56.01 examples/s]

DatasetDict({
    train: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 8000
    })
    valid: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 1000
    })
})





### eval pre-trained model

## Evaluate Pre-trained Model on Fashion Data

Test the baseline performance of the pre-trained CLIP model on fashion product data before fine-tuning. This establishes our baseline metrics for fashion product retrieval performance.

In [14]:
# Updated create_triplet_evaluator with smaller batch size to avoid memory issues
def create_triplet_evaluator(set_name):
    """
    Create triplet evaluator for "train", "valid", or "test" split
    """
    data = dataset[set_name]
    # Take smaller subset for evaluation to avoid memory issues
    max_samples = min(100, len(data["anchor"]))  # Limit to 100 samples for evaluation
    
    anchors = list(data["anchor"][:max_samples])
    positives = list(data["positive"][:max_samples])
    negatives = list(data["negative"][:max_samples])
    
    return TripletEvaluator(
        anchors=anchors,
        positives=positives,
        negatives=negatives,
        name=f"fashion-{set_name}",
        batch_size=4,  # Use smaller batch size
        show_progress_bar=True
    )

In [15]:
evaluator_train = create_triplet_evaluator("train")
evaluator_valid = create_triplet_evaluator("valid")
print("Train:", evaluator_train(model))
print("Valid:", evaluator_valid(model))

Batches: 100%|██████████| 25/25 [00:05<00:00,  4.46it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 47.18it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 63.12it/s]


Train: {'fashion-train_cosine_accuracy': 0.9700000286102295}


Batches: 100%|██████████| 25/25 [00:04<00:00,  5.50it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 55.71it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 58.68it/s]

Valid: {'fashion-valid_cosine_accuracy': 1.0}





In [16]:
class ImageTextRetrievalEvaluator(SentenceEvaluator):
    """
    
    Custom evaluator for fashion product image-text retrieval performance 
    Measures Recall k: how often the correct product description is found in the top-k most similar items 
    for each fashion product image
    
    """
    def __init__(
        self,
        images: List,
        texts: List[str],
        name: str = '',
        k: int = 1,
        batch_size: int = 32,
        show_progress_bar: bool = False
    ):
        # Limit dataset size for evaluation to avoid memory issues
        max_samples = min(100, len(images))
        self.images = list(images[:max_samples])
        self.texts = list(texts[:max_samples])
        self.name = name
        self.k = k
        self.batch_size = batch_size
        self.show_progress_bar = show_progress_bar

    def __call__(self,
        model: SentenceTransformer,
        output_path: str = None,
        epoch: int = -1,
        steps: int = -1) -> Dict[str, float]:
        
        # Get embeddings for all images
        img_embeddings = model.encode(
            self.images,
            batch_size=self.batch_size,
            show_progress_bar=self.show_progress_bar,
            convert_to_tensor=True
        )
        
        # Get embeddings for all texts
        text_embeddings = model.encode(
            self.texts,
            batch_size=self.batch_size,
            show_progress_bar=self.show_progress_bar,
            convert_to_tensor=True
        )
        
        # Compute similarity matrix
        cos_scores = torch.nn.functional.cosine_similarity(
            img_embeddings.unsqueeze(1),
            text_embeddings.unsqueeze(0),
            dim=2
        )
        
        # Get indices of top k predictions for each image
        _, top_indices = torch.topk(cos_scores, k=self.k, dim=1)
        
        # Calculate Recall@k (correct if ground truth index is in top k predictions)
        correct = sum(i in top_indices[i].tolist() for i in range(len(self.images)))
        recall_at_k = correct / len(self.images)

        return {f'{self.name}_Recall@{self.k}': recall_at_k}

In [17]:
def create_recall_evaluator(set_name, k=1):
    """
        Create recall evaluator for "train", "valid", or "test" split
    """
    # Convert to lists to avoid numpy indexing issues
    data = dataset[set_name]
    
    return ImageTextRetrievalEvaluator(
        images=list(data["anchor"]),
        texts=list(data["positive"]),
        name=f"faahion-recall-{set_name}",
        k=k,
        batch_size=4  # Smaller batch size to avoid memory issues
    )

In [18]:
# Create new evaluator with Recall@k
evaluator_recall_train = create_recall_evaluator("train", k=1)
evaluator_recall_valid = create_recall_evaluator("valid", k=1)

print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))

Train: {'faahion-recall-train_Recall@1': 0.57}
Valid: {'faahion-recall-valid_Recall@1': 0.65}


### define training args

### fine-tune model

In [19]:
# define loss (note: loss expects columns to be ordered as anchor-positive-negative)
loss = MultipleNegativesRankingLoss(model)

# hyperparameters
num_epochs = 2
batch_size = 16
lr = 1e-4
finetuned_model_name = "clip-fashion-embeddings-10k-ft"

train_args = SentenceTransformerTrainingArguments(
    output_dir=f"models/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    # Evaluation settings
    eval_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
)

In [20]:
%%time
trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
trainer.train()

                                                                             

Epoch,Training Loss,Validation Loss,Faahion-recall-train Recall@1,Faahion-recall-valid Recall@1,Sequential Score
1,0.1202,0.158124,0.7,0.74,0.74
2,0.0665,0.127873,0.75,0.75,0.75


CPU times: total: 29min 58s
Wall time: 16min 27s


TrainOutput(global_step=1000, training_loss=0.184337726441212, metrics={'train_runtime': 978.381, 'train_samples_per_second': 16.354, 'train_steps_per_second': 1.022, 'total_flos': 0.0, 'train_loss': 0.184337726441212, 'epoch': 2.0})

In [21]:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

from sentence_transformers.evaluation import TripletEvaluator  # or any evaluator you're using

import torch

# Confirm GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training will run on: {device}")

# Move model to GPU
model = model.to(device)

#  Define loss
loss = MultipleNegativesRankingLoss(model)

#  Hyperparameters
num_epochs = 10
batch_size = 32
lr = 1e-5
finetuned_model_name = "clip-fashionAssign-embeddings"

# Training arguments
train_args = SentenceTransformerTrainingArguments(
    output_dir=f"models/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    save_strategy="epoch",
    eval_strategy="epoch",
    logging_steps=1,
    save_total_limit=2,
    fp16=True  # Optional: Enable mixed precision (faster on modern GPUs)
)

# Training
trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
print("test")
trainer.train()


Training will run on: cuda
test


Epoch,Training Loss,Validation Loss,Faahion-recall-train Recall@1,Faahion-recall-valid Recall@1,Sequential Score
1,0.2256,0.248443,0.72,0.78,0.78
2,0.085,0.245227,0.75,0.77,0.77
3,0.1758,0.240613,0.76,0.78,0.78
4,0.142,0.239504,0.76,0.79,0.79
5,0.191,0.237289,0.77,0.78,0.78
6,0.0553,0.238423,0.77,0.78,0.78
7,0.1824,0.232531,0.78,0.78,0.78
8,0.2342,0.232669,0.79,0.77,0.77
9,0.327,0.231574,0.79,0.76,0.76
10,0.0668,0.231298,0.79,0.76,0.76


TrainOutput(global_step=2500, training_loss=0.1683788457810879, metrics={'train_runtime': 1601.6873, 'train_samples_per_second': 49.947, 'train_steps_per_second': 1.561, 'total_flos': 0.0, 'train_loss': 0.1683788457810879, 'epoch': 10.0})

### evaluate fine-tuned model

### Load Epoch 4 Model for Evaluation

Load the model from epoch 4 (checkpoint-1000) from the clip-fashion-embeddings-10k-ft training for evaluation and hf upload.

In [22]:
# Load the epoch 4 model from checkpoint-1000 (clip-fashion-embeddings-10k-ft)
epoch_4_model_path = "models/clip-fashion-embeddings-10k-ft/checkpoint-1000"
epoch_4_model = SentenceTransformer(epoch_4_model_path)

print(f"Loaded epoch 4 model from: {epoch_4_model_path}")


Loaded epoch 4 model from: models/clip-fashion-embeddings-10k-ft/checkpoint-1000


In [23]:
evaluator_test = create_triplet_evaluator("test")

# Evaluate using the epoch 4 model (checkpoint-1000)
print(" Evaluation with Epoch 4 Model (checkpoint-1000)")
print("Train:", evaluator_train(epoch_4_model))
print("Valid:", evaluator_valid(epoch_4_model))
print("Test:", evaluator_test(epoch_4_model))

 Evaluation with Epoch 4 Model (checkpoint-1000)


Batches: 100%|██████████| 25/25 [00:05<00:00,  4.96it/s]
Batches:   0%|          | 0/25 [00:00<?, ?it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 50.88it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 50.88it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 59.95it/s]



Train: {'fashion-train_cosine_accuracy': 1.0}


Batches: 100%|██████████| 25/25 [00:04<00:00,  5.42it/s]
Batches: 100%|██████████| 25/25 [00:04<00:00,  5.42it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 60.68it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 60.68it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 63.78it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 63.78it/s]


Valid: {'fashion-valid_cosine_accuracy': 1.0}


Batches: 100%|██████████| 25/25 [00:04<00:00,  5.38it/s]
Batches: 100%|██████████| 25/25 [00:04<00:00,  5.38it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 62.54it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 62.54it/s]
Batches: 100%|██████████| 25/25 [00:00<00:00, 64.69it/s]

Test: {'fashion-test_cosine_accuracy': 1.0}





In [24]:
evaluator_recall_test = create_recall_evaluator("test")

# Evaluate using the epoch 4 model (checkpoint-1000)
print("Recall Evaluation with Epoch 4 Model (checkpoint-1000)")
print("Train:", evaluator_recall_train(epoch_4_model))
print("Valid:", evaluator_recall_valid(epoch_4_model))
print("Test:", evaluator_recall_test(epoch_4_model))

Recall Evaluation with Epoch 4 Model (checkpoint-1000)
Train: {'faahion-recall-train_Recall@1': 0.75}
Train: {'faahion-recall-train_Recall@1': 0.75}
Valid: {'faahion-recall-valid_Recall@1': 0.75}
Valid: {'faahion-recall-valid_Recall@1': 0.75}
Test: {'faahion-recall-test_Recall@1': 0.78}
Test: {'faahion-recall-test_Recall@1': 0.78}


In [25]:
# Push the epoch 4 model to Hugging Face Hub
epoch_4_model_name = "clip-fashion-embeddings-final-10k-ft"

try:
    epoch_4_model.push_to_hub(f"dejasi5459/{epoch_4_model_name}")
    print(f"Successfully pushed dejasi5459/{epoch_4_model_name}")
except Exception as e:
    print(f"Error pushing model to hub: {e}")
    print("Make sure you're logged in to Hugging Face")

model.safetensors: 100%|██████████| 1.71G/1.71G [02:59<00:00, 9.50MB/s] 



Successfully pushed dejasi5459/clip-fashion-embeddings-final-10k-ft
