# Vision Language Model as a Judge
In this notebook we are using a VLM as a judge to evaluate the recognition portion of our pipeline. We pass the original clothing image and the predicted features into our VLM judge, and then ask it to mark each predicted feature as correct or incorrect based on the input image. Using this setup, we ran the recognition model to generate predictions on around 3000 images (in Generated_Recognition_Eval_Set.ipynb), and then passed these predictions and the original images into the llm judge.

The model we chose to use for this task was the Llama 3.2 11B Vision Instruct Model. Although the Qwen VL model is rated slightly higher in several vision understanding benchmarks, the model may have inherent biases that make it might be more likely to favor its own output during evaluation.


As a future step, we would use an ensemble of models to each independently evaluate the generated descriptions, and then aggregate the evaluations to create a final more reliable evaluation.

In [None]:
%%capture
# Install required packages
!pip install git+https://github.com/huggingface/transformers
!pip install accelerate
!pip install torch
!pip install pandas
!pip install numpy
!pip install huggingface_hub[hf_transfer]

In [None]:
import pandas as pd
import numpy as np

import os
import random
from PIL import Image as PIL_Image
import tempfile
import io
from pathlib import Path
from IPython.display import Image
from PIL import ImageEnhance
import json
from typing import Union, Dict, List

import torch
from huggingface_hub import hf_hub_download, snapshot_download
from transformers import MllamaForConditionalGeneration, AutoProcessor

## Load Datasets
1. Load the images from the github repo
2. Load csv of predicted values

In [None]:
# clone the repository of images
!git clone https://github.com/alexeygrigorev/clothing-dataset-small.git

Cloning into 'clothing-dataset-small'...
remote: Enumerating objects: 3839, done.[K
remote: Counting objects: 100% (400/400), done.[K
remote: Compressing objects: 100% (400/400), done.[K
remote: Total 3839 (delta 9), reused 385 (delta 0), pack-reused 3439 (from 1)[K
Receiving objects: 100% (3839/3839), 100.58 MiB | 14.00 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [None]:
# load in csv of predictions
df = pd.read_csv("/content/drive/MyDrive/MIDS Capstone/eval_set_predictions_1.csv")
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head(5)

Unnamed: 0,image_path,json_string,Category,Type of clothing,Primary Color,Accent Color(s),Pattern,Shape,Fit,Neckline,Key design elements,Brand,Style,Occasion suitability,Weather appropriateness,Fabric Material,Fabric Weight,Functional Features,Additional Notes
0,/content/clothing-dataset-small/train/pants/9f...,"```json\n{\n ""Category"": ""Bottom"",\n ""Type o...",Bottom,Jeans,Dark Blue,,Solid,Straight-Leg,Relaxed,Not applicable,"Belt loops, button fly, pockets",Not specified,Classic/Traditional,Casual,All-season,Denim,Medium,,These jeans feature a relaxed fit with straigh...
1,/content/clothing-dataset-small/train/pants/65...,"```json\n{\n ""Category"": ""Bottom"",\n ""Type o...",Bottom,Pants,Blue,Black,Solid,Straight-leg,Relaxed,Not applicable,"Belt loops, pockets",Not specified,Casual,Casual,All-season,Cotton,Medium,Not applicable,The blue pants feature a relaxed fit with stra...
2,/content/clothing-dataset-small/train/pants/f6...,"```json\n{\n ""Category"": ""Bottom"",\n ""Type o...",Bottom,Jeans,Blue,White,Ripped,Straight,Slim-Fit,Not applicable,"Ripped details, button fly, pockets",Not specified,Casual,Casual,All-season,Denim,Medium,Not applicable,These jeans feature a slim fit with distinctiv...
3,/content/clothing-dataset-small/train/pants/17...,"```json\n{\n ""Category"": ""Bottom"",\n ""Type o...",Bottom,Pants,Dark Blue,Not applicable,Solid,Straight,Slim-fit,Not applicable,"Side pockets, subtle stitching details",Not specified,Minimalist,Casual,All-season,Cotton,Medium,Not applicable,These dark blue slim-fit pants feature subtle ...
4,/content/clothing-dataset-small/train/pants/4e...,"```json\n{\n ""Category"": ""Bottom"",\n ""Type o...",Bottom,Pants,Black,,Solid,Straight leg,Relaxed,Not applicable,"Elastic waistband, side pockets",Not specified,Athleisure,Casual,All-season,Cotton,Medium,"Elastic waistband for comfort, side pockets fo...",These pants are designed for comfort and funct...


## Prepare Model for Judging

For judging we will use the Llama 3.2 11B Vision Instruct Model. Although the Qwen VL model is rated higher in its vision understanding capabilities, the model may have inherent biases or tendencies. If the model generates a description, it might be more likely to favor its own output during evaluation, leading to a lack of objectivity. Thus we opt to use a different but still powerful VL model in Llama to act as the judge, rather than use the same model for both generation and evaluation.

Future: We can use an ensemble of models to each indenpendently evaluate the generated descriptions, and then aggregate the evaluations to create a final more reliable evaluation.

### Load Model

In [None]:
# Define the model repository ID
repo_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Download the model file using hf_transfer
file_path = snapshot_download(repo_id)

print(f"Model downloaded to: {file_path}")

Fetching 20 files:   0%|          | 0/20 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/5.07k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/6.02k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

LICENSE.txt:   0%|          | 0.00/7.71k [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/40.4k [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

consolidated.pth:   0%|          | 0.00/21.3G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/89.4k [00:00<?, ?B/s]

original%2Fparams.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.8k [00:00<?, ?B/s]

Model downloaded to: /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-11B-Vision-Instruct/snapshots/9eb2daaa8597bf192a8b0e73f848f3a102794df5


In [None]:
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)



OutOfMemoryError: CUDA out of memory. Tried to allocate 39.75 GiB. GPU 0 has a total capacity of 39.56 GiB of which 39.14 GiB is free. Process 8605 has 414.00 MiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

### Define System Prompt

In [None]:
sys_prompt = """
"You are a meticulous and objective visual-language model (VLM) judge.
Your task is to evaluate a generated product description for a clothing item against the actual image of the item.
The description is provided as a JSON string, where each key-value pair represents a specific feature of the clothing item.
For each key-value pair in the description, compare it to the visual details in the image and determine if the feature is Correct or Incorrect.
Return your evaluation as a JSON string, where each key is "Evaluation of {feature name}" and the value is either "Correct" or "Incorrect".

Instructions:
1. Carefully analyze the image and the provided JSON description.
2. For each key-value pair, determine if the feature described matches the visual details in the image.
3. If the feature matches, label it as "Correct". If it does not match, label it as "Incorrect".

Return your evaluation in the exact JSON format specified below.
{
  "Evaluation of Category": "Correct/Incorrect",
  "Evaluation of Type of clothing": "Correct/Incorrect",
  "Evaluation of Primary Color": "Correct/Incorrect",
  "Evaluation of Accent Color(s)": "Correct/Incorrect",
  "Evaluation of Pattern": "Correct/Incorrect",
  "Evaluation of Shape": "Correct/Incorrect",
  "Evaluation of Fit": "Correct/Incorrect",
  "Evaluation of Neckline": "Correct/Incorrect",
  "Evaluation of Key design elements": "Correct/Incorrect",
  "Evaluation of Brand": "Correct/Incorrect",
  "Evaluation of Style": "Correct/Incorrect",
  "Evaluation of Occasion suitability": "Correct/Incorrect",
  "Evaluation of Weather appropriateness": "Correct/Incorrect",
  "Evaluation of Fabric Material": "Correct/Incorrect",
  "Evaluation of Fabric Weight": "Correct/Incorrect",
  "Evaluation of Functional Features": "Correct/Incorrect",
  "Evaluation of Additional Notes": "Correct/Incorrect"
}
"""

### Helper Functions for eval

In [None]:
def run_one_eval(image_path, pred_json_string):
  # define message with image and prompt
  messages = [
      {
          "role": "system",
          "content": [
              {"type": "text", "text": sys_prompt},
          ],
      },
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  "image": image_path,
              },
              {"type": "text", "text": pred_json_string},
          ],
      }
  ]

  # Preparation for inference
  text = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
  )

  inputs = processor(
      image_path,
      text,
      add_special_tokens=False,
      return_tensors="pt"
  ).to("cuda")

  # Inference: Generation of the output
  output = model.generate(**inputs, max_new_tokens=512)

  # return description
  return processor.decode(output[0])

## Run VLM as judge evaluation

In [None]:
# Initialize lists to store image paths and eval description json strings
image_paths = []
eval_descriptions = []

# Loop through all class folders in the 'train' directory
for idx, row in df.iterrows():

    # Pass in image path and predicted json string into model run
    description = run_one_eval(row['image_path'], row['json_string'])

    # Append the file path and description to the lists
    eval_descriptions.append(description)
    image_paths.append(row['image_path'])

# Create a DataFrame from the collected data
eval_set = pd.DataFrame(
    {'image_path': image_paths,
     'eval_json_string': eval_descriptions
    })

# Display the first 5 rows of the DataFrame
eval_set.head(5)

In [None]:
def clean_json_string(json_string):
    # Find the start and end of the valid JSON string
    start_index = json_string.find("{")
    end_index = json_string.rfind("}") + 1
    valid_json_string = json_string[start_index:end_index]
    return valid_json_string

def parse_output(json_string: str) -> pd.DataFrame:
    """
    Parse a JSON string into a pandas DataFrame.

    Args:
        json_string (str): A string containing JSON data

    Returns:
        pd.DataFrame: A pandas DataFrame containing the parsed data

    Raises:
        ValueError: If the JSON string is invalid or empty
        TypeError: If the JSON data structure is not compatible with DataFrame conversion
    """
    # Check if input is empty or None
    if not json_string or json_string.isspace():
        raise ValueError("Input JSON string is empty or None")

    try:
        # Extract the valid JSON string
        valid_json_string = clean_json_string(json_string)

        # Parse JSON string into Python object
        data = json.loads(valid_json_string)

        # Data should be a single dictionary
        if isinstance(data, dict):
            return pd.DataFrame([data])
        else:
            raise TypeError("JSON structure is not a dictionary.")

    except json.JSONDecodeError as e:
        print(f"Error parsing JSON for index {index}: {e}")
        print(f"Offending string: {json_string}")
        raise ValueError(f"Invalid JSON string: {str(e)}")
    except Exception as e:
        print(f"Error parsing JSON for index {index}: {e}")
        print(f"Offending string: {row['json_string']}")
        raise Exception(f"Error parsing JSON: {str(e)}")

In [None]:
# add each attribute/property as a column to eval_set dataframe
for index, row in eval_set.iterrows():
  df = parse_output(row['json_string'])
  for column in df.columns:
    eval_set.loc[index, column] = df.loc[0, column]

eval_set.head(5)

In [None]:
# save output as csv
eval_set.to_csv("/content/drive/MyDrive/MIDS Capstone/eval_set_predictions_1.csv")

## Analysis on feature accuracy
- accuracy of each feature / category
- look at some of the low accuracy features and investigate if predictions are actually problematic or if the llm judge is making the mistake

## Combine dataframe of predictions with their respective evals

In [None]:
#