# Color Detection Pipeline vs Ground Truth (v4 - Dual Model + Structural)

This notebook samples **200** random items with **known color labels** from the training data (`item data 2026_AW` + `item data 2026_SS`), finds their images in `pics/`, runs them through the LangGraph color detection pipeline, and compares predicted colors against ground truth.

**Improvements in this run (v4 - structural overhaul):**
- **Dual-model architecture**: gpt-4o for description + primary color (vision-critical), gpt-4o-mini for multi-gate, secondary colors, neutral verify
- **Dedicated multi-gate node**: separate binary classifier before secondary colors, with hard code guards
- **Neutral-family verifier**: conditional re-examination when primary is neutral (Natural/Beige/Off-white/Brown/etc.)
- **Multi stripped from secondary classifier**: secondary now purely picks colors 2 and 3
- **Conditional graph routing**: multi=true skips secondary entirely; neutral primary triggers verification

**Ground truth columns:** `color1`, `color2`, `color3` (detail color names from Colors.xlsx)  
**Pipeline output:** `detail_color_1`, `detail_color_2`, `detail_color_3` + mapped `main_color_1/2/3`

In [1]:
import sys
import os
import glob
import random
import base64
import time
import json
from pathlib import Path
from datetime import datetime

import pandas as pd
import numpy as np
from IPython.display import display, HTML, Image as IPImage

# Add parent directory to path for imports
sys.path.insert(0, os.path.abspath('..'))

print(f"Working directory: {os.getcwd()}")
print(f"Python version: {sys.version}")

Working directory: /Users/szma/Desktop/black-hippo/black-hippo-experiments/notebooks
Python version: 3.13.2 (v3.13.2:4f8bb3947cf, Feb  4 2025, 11:51:10) [Clang 15.0.0 (clang-1500.3.9.4)]


## 1. Load Item Data with Known Colors

In [2]:
# Load both season datasets
aw = pd.read_csv('../resources/item data 2026_AW(Sheet1).csv', encoding='latin-1')
ss = pd.read_csv('../resources/item data 2026_SS(Sheet1).csv', encoding='latin-1')

# Tag each with its season source
aw['source_file'] = 'AW'
ss['source_file'] = 'SS'

# Keep only items that have at least color_1 assigned
aw_with_color = aw[aw['color_1_id'].notna()].copy()
ss_with_color = ss[ss['color_1_id'].notna()].copy()

items_df = pd.concat([aw_with_color, ss_with_color], ignore_index=True)
print(f"Items with color labels: {len(items_df):,}")
print(f"  AW: {len(aw_with_color):,}")
print(f"  SS: {len(ss_with_color):,}")
print(f"\nColor fill rates:")
print(f"  color1: {items_df['color1'].notna().sum():,} / {len(items_df):,}")
print(f"  color2: {items_df['color2'].notna().sum():,} / {len(items_df):,}")
print(f"  color3: {items_df['color3'].notna().sum():,} / {len(items_df):,}")

Items with color labels: 6,238
  AW: 3,272
  SS: 2,966

Color fill rates:
  color1: 6,238 / 6,238
  color2: 2,506 / 6,238
  color3: 26 / 6,238


## 2. Build Image Index and Match Items to Images

In [3]:
def build_image_index(season: str) -> dict:
    """Build mapping from item_id -> first matching image path for a season."""
    index = {}
    pattern = f'../pics/{season}/*/images/*'
    all_files = glob.glob(pattern)
    for f in all_files:
        basename = os.path.basename(f)
        if '_thumbnail' in basename:
            continue
        parts = basename.split('_')
        if parts[0].isdigit():
            iid = int(parts[0])
            if iid not in index:
                index[iid] = f
    return index


aw_image_index = build_image_index('2026_AW')
ss_image_index = build_image_index('2026_SS')

# Merge indexes
image_index = {**aw_image_index, **ss_image_index}
print(f"Image index size: {len(image_index):,}")

# Match items to images
items_df['image_path'] = items_df['item_id'].apply(
    lambda iid: image_index.get(int(iid))
)
items_with_images = items_df[items_df['image_path'].notna()].copy()
print(f"Items with color labels AND images: {len(items_with_images):,}")

Image index size: 30,851
Items with color labels AND images: 5,766


## 3. Sample 200 Random Items

In [4]:
SAMPLE_SIZE = 200
random.seed(42)

sample_df = items_with_images.sample(n=min(SAMPLE_SIZE, len(items_with_images)), random_state=42).copy()
sample_df = sample_df.reset_index(drop=True)

print(f"Sampled {len(sample_df)} items")
print(f"\nSeason distribution:")
print(sample_df['source_file'].value_counts())
print(f"\nGround truth color1 distribution (top 10):")
print(sample_df['color1'].value_counts().head(10))
print(f"\nItems with color2: {sample_df['color2'].notna().sum()}")
print(f"Items with color3: {sample_df['color3'].notna().sum()}")

Sampled 200 items

Season distribution:
source_file
SS    100
AW    100
Name: count, dtype: int64

Ground truth color1 distribution (top 10):
color1
White         32
Natural       22
Multi         14
Beige         12
Orange        11
Gold          10
Green         10
Dark green     9
Blue           8
Brown          7
Name: count, dtype: int64

Items with color2: 90
Items with color3: 1


## 4. Initialize Color Detection Pipeline

In [5]:
from dotenv import load_dotenv
load_dotenv(os.path.join('..', '.env'))

from app.langchain_modules.graph.colorDetectionGraph import ColorDetectionGraphBuilder
from app.langchain_modules.llm_definitions.openrouter_client import create_dual_clients

llm_vision, llm_fast = create_dual_clients()
color_graph = ColorDetectionGraphBuilder(llm_vision, llm_fast)
color_graph.build()

print("Color detection graph built successfully (v4 - dual model)")
print(f"Vision model: {getattr(llm_vision, 'model_name', getattr(llm_vision, 'model', 'unknown'))}")
print(f"Fast model:   {getattr(llm_fast, 'model_name', getattr(llm_fast, 'model', 'unknown'))}")

Color detection graph built successfully
LLM model: openai/gpt-4o-mini


## 5. Run Pipeline on All 100 Samples

In [6]:
def image_to_data_uri(image_path: str) -> str:
    """Convert an image file to a data URI string."""
    suffix = Path(image_path).suffix.lower()
    mime_map = {'.jpg': 'image/jpeg', '.jpeg': 'image/jpeg', '.png': 'image/png'}
    mime_type = mime_map.get(suffix, 'image/jpeg')
    with open(image_path, 'rb') as f:
        encoded = base64.b64encode(f.read()).decode('utf-8')
    return f"data:{mime_type};base64,{encoded}"


results = []

for i, (_, row) in enumerate(sample_df.iterrows()):
    print(f"\r[{i+1}/{len(sample_df)}] item_id={int(row['item_id'])} ...", end='', flush=True)
    
    start_time = time.time()
    
    try:
        data_uri = image_to_data_uri(row['image_path'])
        # Pass item metadata for improved accuracy
        desc = row.get('supplier_reference_description', '') or ''
        mats = row.get('materials', '') or ''
        result = color_graph.detect_colors(
            image_data=data_uri,
            supplier_reference_description=desc if desc else None,
            materials=mats if mats else None,
        )
        elapsed = time.time() - start_time
        result_dict = result.to_dict()
        
        results.append({
            'item_id': int(row['item_id']),
            'season': row['season'],
            'supplier_name': row['supplier_name'],
            'supplier_reference_description': row.get('supplier_reference_description', ''),
            'image_path': row['image_path'],
            # Ground truth
            'gt_color1': row['color1'],
            'gt_color2': row.get('color2') if pd.notna(row.get('color2')) else None,
            'gt_color3': row.get('color3') if pd.notna(row.get('color3')) else None,
            # Predictions
            'pred_detail_1': result_dict['detail_color_1'],
            'pred_detail_2': result_dict['detail_color_2'],
            'pred_detail_3': result_dict['detail_color_3'],
            'pred_main_1': result_dict['main_color_1'],
            'pred_main_2': result_dict['main_color_2'],
            'pred_main_3': result_dict['main_color_3'],
            'is_multi': result_dict['is_multi'],
            # Confidence
            'confidence_1': result_dict['metadata']['confidence'].get('color_1', {}).get('detail_color_confidence'),
            'confidence_2': result_dict['metadata']['confidence'].get('color_2', {}).get('detail_color_confidence'),
            'confidence_3': result_dict['metadata']['confidence'].get('color_3', {}).get('detail_color_confidence'),
            # Meta
            'image_description': result_dict['metadata'].get('image_description'),
            'errors': '; '.join(result_dict.get('errors', [])) if result_dict.get('errors') else None,
            'latency_seconds': round(elapsed, 2),
            'success': True,
        })
    
    except Exception as e:
        elapsed = time.time() - start_time
        error_msg = str(e)
        results.append({
            'item_id': int(row['item_id']),
            'season': row['season'],
            'supplier_name': row['supplier_name'],
            'supplier_reference_description': row.get('supplier_reference_description', ''),
            'image_path': row['image_path'],
            'gt_color1': row['color1'],
            'gt_color2': row.get('color2') if pd.notna(row.get('color2')) else None,
            'gt_color3': row.get('color3') if pd.notna(row.get('color3')) else None,
            'pred_detail_1': None, 'pred_detail_2': None, 'pred_detail_3': None,
            'pred_main_1': None, 'pred_main_2': None, 'pred_main_3': None,
            'is_multi': None,
            'confidence_1': None, 'confidence_2': None, 'confidence_3': None,
            'image_description': None,
            'errors': error_msg,
            'latency_seconds': round(elapsed, 2),
            'success': False,
        })
        
        if '429' in error_msg or 'rate' in error_msg.lower():
            print(f"\n  Rate limited, waiting 30s...")
            time.sleep(30)

print(f"\n\nDone! Processed {len(results)} items.")
print(f"Successes: {sum(1 for r in results if r['success'])}")
print(f"Failures: {sum(1 for r in results if not r['success'])}")

[200/200] item_id=47536 ...

Done! Processed 200 items.
Successes: 200
Failures: 0


## 6. Build Comparison DataFrame

In [7]:
comp_df = pd.DataFrame(results)
success_df = comp_df[comp_df['success'] == True].copy()

print(f"Total: {len(comp_df)}, Successful: {len(success_df)}")
print(f"Mean latency: {success_df['latency_seconds'].mean():.1f}s")
print()
display(success_df[['item_id', 'gt_color1', 'gt_color2', 'gt_color3',
                     'pred_detail_1', 'pred_detail_2', 'pred_detail_3',
                     'confidence_1']].head(15))

Total: 200, Successful: 200
Mean latency: 8.7s



Unnamed: 0,item_id,gt_color1,gt_color2,gt_color3,pred_detail_1,pred_detail_2,pred_detail_3,confidence_1
0,56067,Multi,,,Multi,,,1.0
1,57664,Red,White,,Red,White,,0.95
2,73523,Black,,,Black,,,0.95
3,54177,Green,,,Green,Black,,0.9
4,57769,Beige,,,Off-white,Brown,White,0.85
5,66938,Silver,,,Silver,,,0.95
6,73645,Camel,,,Beige,Natural,,0.9
7,57221,Black,,,Black,,,0.95
8,59248,Natural,White,,Natural,White,,0.95
9,49895,Off-white,Orange,,Orange,White,,0.9


## 7. Accuracy Metrics

In [8]:
# Load main color mapping for main-level comparison
with open('../resources/color_main_mapping.json') as f:
    color_mapping = json.load(f)


def get_main_color(detail_color):
    """Map a detail color to its main color."""
    if detail_color and detail_color in color_mapping:
        return color_mapping[detail_color]['main_color']
    return detail_color  # fallback


# --- DETAIL-LEVEL METRICS ---
# Exact match: predicted color1 == ground truth color1
exact_match_1 = (success_df['pred_detail_1'] == success_df['gt_color1']).sum()
print(f"=== Detail Color Accuracy (Color 1) ===")
print(f"Exact match: {exact_match_1} / {len(success_df)} ({exact_match_1/len(success_df)*100:.1f}%)")

# Set-level match: does predicted color1 appear anywhere in ground truth {color1, color2, color3}?
def gt_color_set(row):
    """Get the set of non-null ground truth colors."""
    colors = set()
    for col in ['gt_color1', 'gt_color2', 'gt_color3']:
        if row[col] and pd.notna(row[col]):
            colors.add(row[col])
    return colors

def pred_color_set(row):
    """Get the set of non-null predicted colors."""
    colors = set()
    for col in ['pred_detail_1', 'pred_detail_2', 'pred_detail_3']:
        if row[col] and pd.notna(row[col]) and row[col] != 'Multi':
            colors.add(row[col])
    return colors

success_df['gt_set'] = success_df.apply(gt_color_set, axis=1)
success_df['pred_set'] = success_df.apply(pred_color_set, axis=1)

# Pred color1 in GT set
pred1_in_gt = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_detail_1'] and r['pred_detail_1'] in r['gt_set']
)
print(f"Pred color1 in GT set: {pred1_in_gt} / {len(success_df)} ({pred1_in_gt/len(success_df)*100:.1f}%)")

# Any overlap between predicted and GT sets
any_overlap = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_set'] & r['gt_set']
)
print(f"Any set overlap: {any_overlap} / {len(success_df)} ({any_overlap/len(success_df)*100:.1f}%)")

# Full set match (order-independent)
full_set_match = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_set'] == r['gt_set']
)
print(f"Full set match: {full_set_match} / {len(success_df)} ({full_set_match/len(success_df)*100:.1f}%)")

# Jaccard similarity (average)
jaccards = []
for _, r in success_df.iterrows():
    gt = r['gt_set']
    pred = r['pred_set']
    if gt or pred:
        jaccard = len(gt & pred) / len(gt | pred) if (gt | pred) else 0.0
    else:
        jaccard = 1.0  # both empty
    jaccards.append(jaccard)
success_df['jaccard'] = jaccards
print(f"Mean Jaccard similarity: {np.mean(jaccards):.3f}")

print()

# --- MAIN COLOR LEVEL METRICS ---
success_df['gt_main_1'] = success_df['gt_color1'].apply(get_main_color)
success_df['gt_main_2'] = success_df['gt_color2'].apply(lambda c: get_main_color(c) if c and pd.notna(c) else None)

main_exact_1 = (success_df['pred_main_1'] == success_df['gt_main_1']).sum()
print(f"=== Main Color Accuracy (Color 1) ===")
print(f"Exact match: {main_exact_1} / {len(success_df)} ({main_exact_1/len(success_df)*100:.1f}%)")

# Main-level set overlap
def gt_main_set(row):
    colors = set()
    for col in ['gt_color1', 'gt_color2', 'gt_color3']:
        if row[col] and pd.notna(row[col]):
            colors.add(get_main_color(row[col]))
    return colors

def pred_main_set(row):
    colors = set()
    for col in ['pred_main_1', 'pred_main_2', 'pred_main_3']:
        if row[col] and pd.notna(row[col]) and row[col] != 'Multi':
            colors.add(row[col])
    return colors

success_df['gt_main_set'] = success_df.apply(gt_main_set, axis=1)
success_df['pred_main_set'] = success_df.apply(pred_main_set, axis=1)

main_any_overlap = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_main_set'] & r['gt_main_set']
)
print(f"Any main-level overlap: {main_any_overlap} / {len(success_df)} ({main_any_overlap/len(success_df)*100:.1f}%)")

main_jaccards = []
for _, r in success_df.iterrows():
    gt = r['gt_main_set']
    pred = r['pred_main_set']
    if gt or pred:
        j = len(gt & pred) / len(gt | pred) if (gt | pred) else 0.0
    else:
        j = 1.0
    main_jaccards.append(j)
print(f"Mean main-level Jaccard: {np.mean(main_jaccards):.3f}")

=== Detail Color Accuracy (Color 1) ===
Exact match: 87 / 200 (43.5%)
Pred color1 in GT set: 101 / 200 (50.5%)
Any set overlap: 124 / 200 (62.0%)
Full set match: 42 / 200 (21.0%)
Mean Jaccard similarity: 0.377

=== Main Color Accuracy (Color 1) ===
Exact match: 115 / 200 (57.5%)
Any main-level overlap: 155 / 200 (77.5%)
Mean main-level Jaccard: 0.531


## 8. Multi-Color Analysis

In [9]:
# Ground truth multi items
gt_multi = success_df[success_df['gt_color1'] == 'Multi']
pred_multi = success_df[success_df['is_multi'] == True]

print(f"Ground truth Multi items: {len(gt_multi)}")
print(f"Predicted Multi items: {len(pred_multi)}")

# Overlap
if len(gt_multi) > 0:
    both_multi = sum(
        1 for _, r in success_df.iterrows()
        if r['gt_color1'] == 'Multi' and r['is_multi'] == True
    )
    print(f"Correctly predicted Multi: {both_multi} / {len(gt_multi)} ({both_multi/len(gt_multi)*100:.1f}%)")

# False multi (predicted multi but GT is not)
false_multi = sum(
    1 for _, r in success_df.iterrows()
    if r['is_multi'] == True and r['gt_color1'] != 'Multi'
)
print(f"False Multi predictions: {false_multi}")

Ground truth Multi items: 14
Predicted Multi items: 14
Correctly predicted Multi: 4 / 14 (28.6%)
False Multi predictions: 10


## 9. Confusion Analysis - Where Does It Go Wrong?

In [10]:
# Items where primary color is wrong
wrong_primary = success_df[
    (success_df['pred_detail_1'] != success_df['gt_color1']) &
    (success_df['gt_color1'] != 'Multi')  # Exclude multi from mismatch analysis
].copy()

print(f"Wrong primary color: {len(wrong_primary)} / {len(success_df[success_df['gt_color1'] != 'Multi'])}")
print()

if len(wrong_primary) > 0:
    # Most common confusions
    wrong_primary['confusion'] = wrong_primary.apply(
        lambda r: f"{r['gt_color1']} -> {r['pred_detail_1']}", axis=1
    )
    print("Top 15 confusions (GT -> Predicted):")
    print(wrong_primary['confusion'].value_counts().head(15))
    print()
    
    # Check if wrong at detail but correct at main level
    wrong_primary['main_correct'] = wrong_primary.apply(
        lambda r: get_main_color(r['pred_detail_1']) == get_main_color(r['gt_color1'])
        if r['pred_detail_1'] else False, axis=1
    )
    main_correct_count = wrong_primary['main_correct'].sum()
    print(f"Wrong detail but correct main color: {main_correct_count} / {len(wrong_primary)} ({main_correct_count/len(wrong_primary)*100:.1f}%)")
    print("(These are 'close misses' - right color family, wrong shade)")
    
    # --- BACKGROUND BIAS DIAGNOSTIC ---
    bg_colors = {'White', 'Off-white', 'Light grey', 'Grey'}
    bg_false_pos = wrong_primary[
        wrong_primary['pred_detail_1'].isin(bg_colors) & 
        ~wrong_primary['gt_color1'].isin(bg_colors)
    ]
    print(f"\n--- Background Bias Diagnostic ---")
    print(f"Predicted background-like color (White/Grey/Off-white) when GT is non-grey/white: {len(bg_false_pos)}")
    if len(bg_false_pos) > 0:
        for _, r in bg_false_pos.iterrows():
            print(f"  item {int(r['item_id'])}: GT={r['gt_color1']}, Pred={r['pred_detail_1']}")
    
    # --- TRANSPARENT DRIFT DIAGNOSTIC ---
    trans_false = wrong_primary[wrong_primary['pred_detail_1'] == 'Transparent']
    print(f"\nFalse Transparent predictions: {len(trans_false)}")
    if len(trans_false) > 0:
        for _, r in trans_false.iterrows():
            print(f"  item {int(r['item_id'])}: GT={r['gt_color1']}, Desc='{str(r.get('supplier_reference_description', ''))[:60]}'")
    
    # --- NEUTRAL FAMILY CONFUSION ---
    neutral_set = {'Natural', 'Beige', 'Off-white', 'Ecru', 'Light beige', 'Greige', 'Taupe', 'Camel', 'Brown', 'Dark brown'}
    neutral_wrong = wrong_primary[
        wrong_primary['gt_color1'].isin(neutral_set) & 
        wrong_primary['pred_detail_1'].isin(neutral_set)
    ]
    print(f"\nNeutral-family confusion (both GT and Pred are neutral, but different): {len(neutral_wrong)}")
    if len(neutral_wrong) > 0:
        for _, r in neutral_wrong.iterrows():
            print(f"  item {int(r['item_id'])}: GT={r['gt_color1']} -> Pred={r['pred_detail_1']}")

Wrong primary color: 103 / 186

Top 15 confusions (GT -> Predicted):
confusion
Dark green -> Green     5
White -> Off-white      5
Natural -> Beige        4
Light pink -> Pink      3
Beige -> Light beige    3
Orange -> Multi         2
White -> Natural        2
Fuchsia -> Pink         2
Orange -> Natural       2
Beige -> Off-white      2
Camel -> Beige          2
Gold -> Transparent     2
Ecru -> Off-white       2
Brown -> Camel          1
Black -> Taupe          1
Name: count, dtype: int64

Wrong detail but correct main color: 28 / 103 (27.2%)
(These are 'close misses' - right color family, wrong shade)

--- Background Bias Diagnostic ---
Predicted background-like color (White/Grey/Off-white) when GT is non-grey/white: 10
  item 57769: GT=Beige, Pred=Off-white
  item 43639: GT=Ecru, Pred=Off-white
  item 62478: GT=Brown, Pred=Grey
  item 67846: GT=Gold, Pred=White
  item 56217: GT=Blue, Pred=White
  item 54923: GT=Brown, Pred=Off-white
  item 63102: GT=Ecru, Pred=Off-white
  item 71508

## 10. Color 2 Accuracy (where ground truth exists)

In [11]:
has_gt_color2 = success_df[success_df['gt_color2'].notna()].copy()

if len(has_gt_color2) > 0:
    print(f"Items with ground truth color2: {len(has_gt_color2)}")
    
    # Did pipeline predict a color2?
    pred_has_color2 = has_gt_color2['pred_detail_2'].notna().sum()
    print(f"Pipeline also predicted color2: {pred_has_color2} / {len(has_gt_color2)}")
    
    # Exact match on color2 (order-sensitive)
    exact_2 = (has_gt_color2['pred_detail_2'] == has_gt_color2['gt_color2']).sum()
    print(f"Exact color2 match: {exact_2} / {len(has_gt_color2)} ({exact_2/len(has_gt_color2)*100:.1f}%)")
    
    # GT color2 appears anywhere in predicted set
    gt2_in_pred = sum(
        1 for _, r in has_gt_color2.iterrows()
        if r['gt_color2'] in r['pred_set']
    )
    print(f"GT color2 in predicted set: {gt2_in_pred} / {len(has_gt_color2)} ({gt2_in_pred/len(has_gt_color2)*100:.1f}%)")
else:
    print("No items in sample have ground truth color2.")

Items with ground truth color2: 90
Pipeline also predicted color2: 79 / 90
Exact color2 match: 31 / 90 (34.4%)
GT color2 in predicted set: 53 / 90 (58.9%)


## 11. Visual Spot Check - 10 Random Comparisons

In [12]:
spot_check = success_df.sample(min(10, len(success_df)), random_state=123)

with open('../resources/color_main_mapping.json') as f:
    color_map = json.load(f)

html_parts = ['<div style="display: flex; flex-wrap: wrap; gap: 16px;">']

for _, row in spot_check.iterrows():
    # Ground truth colors
    gt_swatches = ''
    for col in ['gt_color1', 'gt_color2', 'gt_color3']:
        c = row[col]
        if c and pd.notna(c):
            hex_val = color_map.get(c, {}).get('hex_value', '#ccc') or '#ccc'
            gt_swatches += f'<span style="display:inline-block;width:16px;height:16px;background:{hex_val};border:1px solid #999;vertical-align:middle;margin-right:3px;"></span>'
            gt_swatches += f'<small>{c}</small> '
    
    # Predicted colors
    pred_swatches = ''
    for i in range(1, 4):
        dc = row[f'pred_detail_{i}']
        conf = row.get(f'confidence_{i}')
        if dc and pd.notna(dc):
            hex_val = color_map.get(dc, {}).get('hex_value', '#ccc') or '#ccc'
            conf_str = f" ({conf:.0%})" if conf and pd.notna(conf) else ''
            pred_swatches += f'<span style="display:inline-block;width:16px;height:16px;background:{hex_val};border:1px solid #999;vertical-align:middle;margin-right:3px;"></span>'
            pred_swatches += f'<small>{dc}{conf_str}</small> '
    
    # Match indicator
    match_color = '#4CAF50' if row['pred_detail_1'] == row['gt_color1'] else '#FF5722'
    match_text = 'MATCH' if row['pred_detail_1'] == row['gt_color1'] else 'MISMATCH'
    
    html_parts.append(f'''
    <div style="width: 300px; border: 1px solid #ddd; padding: 8px; border-radius: 8px;">
        <img src="{row['image_path']}" style="width:100%;max-height:180px;object-fit:contain;" />
        <p style="font-size:10px;color:#666;margin:4px 0;">item {int(row['item_id'])} | {row['supplier_name'][:25]}</p>
        <div style="margin:4px 0;"><strong style="font-size:11px;">GT:</strong> {gt_swatches}</div>
        <div style="margin:4px 0;"><strong style="font-size:11px;">Pred:</strong> {pred_swatches}</div>
        <span style="color:{match_color};font-size:11px;font-weight:bold;">{match_text}</span>
    </div>
    ''')

html_parts.append('</div>')
display(HTML(''.join(html_parts)))

## 12. Summary Metrics Table

In [13]:
n = len(success_df)
n_non_multi = len(success_df[success_df['gt_color1'] != 'Multi'])

# Compute additional metrics for summary
_false_multi = sum(1 for _, r in success_df.iterrows() if r['is_multi'] == True and r['gt_color1'] != 'Multi')
_invalid_outputs = success_df['errors'].notna().sum() + success_df[success_df['pred_detail_1'].isna()].shape[0]

# Background bias count
bg_colors = {'White', 'Off-white', 'Light grey', 'Grey'}
_bg_false_pos = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_detail_1'] in bg_colors
    and r['gt_color1'] not in bg_colors
    and r['gt_color1'] != 'Multi'
    and r['pred_detail_1'] != r['gt_color1']
)

# False transparent count
_false_transparent = sum(
    1 for _, r in success_df.iterrows()
    if r['pred_detail_1'] == 'Transparent'
    and r['gt_color1'] != 'Transparent'
)

# Neutral confusion count
neutral_set = {'Natural', 'Beige', 'Off-white', 'Ecru', 'Light beige', 'Greige', 'Taupe', 'Camel', 'Brown', 'Dark brown'}
_neutral_confusion = sum(
    1 for _, r in success_df.iterrows()
    if r['gt_color1'] in neutral_set
    and r['pred_detail_1'] in neutral_set
    and r['pred_detail_1'] != r['gt_color1']
)

summary = {
    'Metric': [
        'Total items processed',
        'Success rate',
        '--- PRIMARY COLOR ---',
        'Detail color 1 exact match',
        'Detail color 1 in GT set',
        'Main color 1 exact match',
        'Wrong detail, correct main (close miss)',
        '--- SET-LEVEL ---',
        'Any detail set overlap',
        'Any main-level set overlap',
        'Full detail set match',
        'Mean Jaccard (detail)',
        'Mean Jaccard (main)',
        '--- DIAGNOSTIC ---',
        'False Multi predictions',
        'Background-color false positives',
        'False Transparent predictions',
        'Neutral-family confusions',
        'Invalid/null primary outputs',
        '--- PERFORMANCE ---',
        'Mean latency (s)',
        'Mean confidence (color 1)',
    ],
    'Value': [
        f"{len(comp_df)}",
        f"{n}/{len(comp_df)} ({n/len(comp_df)*100:.0f}%)",
        '',
        f"{exact_match_1}/{n} ({exact_match_1/n*100:.1f}%)",
        f"{pred1_in_gt}/{n} ({pred1_in_gt/n*100:.1f}%)",
        f"{main_exact_1}/{n} ({main_exact_1/n*100:.1f}%)",
        f"{wrong_primary['main_correct'].sum() if len(wrong_primary) > 0 else 0}/{len(wrong_primary) if len(wrong_primary) > 0 else 0}",
        '',
        f"{any_overlap}/{n} ({any_overlap/n*100:.1f}%)",
        f"{main_any_overlap}/{n} ({main_any_overlap/n*100:.1f}%)",
        f"{full_set_match}/{n} ({full_set_match/n*100:.1f}%)",
        f"{np.mean(jaccards):.3f}",
        f"{np.mean(main_jaccards):.3f}",
        '',
        f"{_false_multi}",
        f"{_bg_false_pos}",
        f"{_false_transparent}",
        f"{_neutral_confusion}",
        f"{_invalid_outputs}",
        '',
        f"{success_df['latency_seconds'].mean():.1f}",
        f"{success_df['confidence_1'].dropna().mean():.3f}",
    ]
}

summary_df = pd.DataFrame(summary)
display(summary_df.style.hide(axis='index'))

Metric,Value
Total items processed,200
Success rate,200/200 (100%)
--- PRIMARY COLOR ---,
Detail color 1 exact match,87/200 (43.5%)
Detail color 1 in GT set,101/200 (50.5%)
Main color 1 exact match,115/200 (57.5%)
"Wrong detail, correct main (close miss)",28/103
--- SET-LEVEL ---,
Any detail set overlap,124/200 (62.0%)
Any main-level set overlap,155/200 (77.5%)


## 13. Export Results

In [14]:
timestamp = datetime.now().strftime('%Y%m%d_%H%M')
export_path = f'../resources/color_detection_vs_gt_{timestamp}.csv'

export_cols = [
    'item_id', 'season', 'supplier_name',
    'gt_color1', 'gt_color2', 'gt_color3',
    'pred_detail_1', 'pred_detail_2', 'pred_detail_3',
    'pred_main_1', 'pred_main_2', 'pred_main_3',
    'is_multi',
    'confidence_1', 'confidence_2', 'confidence_3',
    'image_description',
    'errors', 'latency_seconds', 'success',
    'image_path'
]

comp_df[export_cols].to_csv(export_path, index=False)
print(f"Results exported to: {export_path}")
print(f"Shape: {comp_df[export_cols].shape}")

Results exported to: ../resources/color_detection_vs_gt_20260211_2352.csv
Shape: (200, 21)
