# ‚ö° OPTIMIZACI√ìN: Reutilizar Resultados de Fases Anteriores

**Este notebook ha sido optimizado para reutilizar resultados existentes:**

- ‚úÖ **Fase 2 (Baseline)**: Carga predicciones desde `../fase 2/outputs/baseline/preds_raw.json`
- ‚úÖ **Fase 3 (MC-Dropout)**: Carga predicciones desde `../fase 3/outputs/mc_dropout/preds_mc_aggregated.json`
- ‚úÖ **Fase 4 (Temperature Scaling)**: Carga temperaturas desde `../fase 4/outputs/temperature_scaling/temperature.json`

**Ventajas**:
- üöÄ Reduce tiempo de ejecuci√≥n de ~2 horas a ~15 minutos
- üíæ Evita recalcular predicciones costosas (especialmente MC-Dropout con K=5)
- ‚ôªÔ∏è Garantiza consistencia con resultados de fases anteriores

**Modo de operaci√≥n**:
- Si los archivos existen ‚Üí Los carga y reutiliza
- Si no existen ‚Üí Ejecuta inferencia completa (fallback)

# Fase 5: Comparaci√≥n Completa de M√©todos de Incertidumbre y Calibraci√≥n

**Objetivo**: Comparar 6 m√©todos lado a lado en detecci√≥n, calibraci√≥n y risk-coverage.

**M√©todos evaluados**:
1. Baseline (sin incertidumbre, sin calibraci√≥n)
2. Baseline + TS
3. MC-Dropout K=5
4. MC-Dropout K=5 + TS
5. Varianza entre capas (single-pass)
6. Varianza entre capas + TS

**Splits**:
- val_calib: ajustar temperaturas
- val_eval: evaluaci√≥n final

**M√©tricas**:
- Detecci√≥n: mAP@[0.5:0.95], AP50, AP75, por clase
- Calibraci√≥n: NLL, Brier, ECE, Reliability Diagrams
- Risk-Coverage: curvas y AUC

## 1. Configuraci√≥n e Imports

In [1]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 1: Configuraci√≥n e Imports

import os
import sys
import json
import yaml
import time
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image
from tqdm import tqdm
from collections import defaultdict
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import torchvision
import warnings
warnings.filterwarnings('ignore')

# Configuraci√≥n
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = Path('./outputs/comparison')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CONFIG = {
    'seed': 42,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'categories': ['person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle', 'traffic light', 'traffic sign'],
    'iou_matching': 0.5,
    'conf_threshold': 0.25,
    'nms_threshold': 0.65,
    'K_mc': 5,
    'n_bins': 10
}

torch.manual_seed(CONFIG['seed'])
np.random.seed(CONFIG['seed'])
if torch.cuda.is_available():
    torch.cuda.manual_seed(CONFIG['seed'])

with open(OUTPUT_DIR / 'config.yaml', 'w') as f:
    yaml.dump(CONFIG, f)

print(f"Device: {CONFIG['device']}")
print(f"Output: {OUTPUT_DIR}")
print(f"Config guardado")

Device: cuda
Output: outputs/comparison
Config guardado


## 1.1 Cargar Resultados de Fases Anteriores (Optimizaci√≥n)

In [2]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 2: Cargar resultados de fases anteriores

# Rutas a resultados de fases anteriores
FASE2_BASELINE = BASE_DIR / 'fase 2' / 'outputs' / 'baseline' / 'preds_raw.json'
FASE3_MC_DROPOUT_PARQUET = BASE_DIR / 'fase 3' / 'outputs' / 'mc_dropout' / 'mc_stats_labeled.parquet'
FASE3_MC_DROPOUT_JSON = BASE_DIR / 'fase 3' / 'outputs' / 'mc_dropout' / 'preds_mc_aggregated.json'
FASE4_TEMPERATURE = BASE_DIR / 'fase 4' / 'outputs' / 'temperature_scaling' / 'temperature.json'
FASE4_CALIB_DATA = BASE_DIR / 'fase 4' / 'outputs' / 'temperature_scaling' / 'calib_detections.csv'

# Diccionario para almacenar predicciones cargadas
cached_predictions = {
    'baseline': None,
    'mc_dropout': None,
    'temperatures': None
}

# Cargar Baseline (Fase 2)
if FASE2_BASELINE.exists():
    print(f"‚úÖ Cargando predicciones Baseline desde Fase 2...")
    with open(FASE2_BASELINE, 'r') as f:
        cached_predictions['baseline'] = json.load(f)
    print(f"   ‚Üí {len(cached_predictions['baseline'])} predicciones cargadas")
else:
    print(f"‚ö†Ô∏è  No se encontr√≥ {FASE2_BASELINE}, se ejecutar√° inferencia completa")

# Cargar MC-Dropout (Fase 3) - PRIORIZAR PARQUET CON INCERTIDUMBRE
if FASE3_MC_DROPOUT_PARQUET.exists():
    print(f"‚úÖ Cargando predicciones MC-Dropout desde Fase 3 (con incertidumbre)...")
    mc_df = pd.read_parquet(FASE3_MC_DROPOUT_PARQUET)
    # Convertir a formato JSON similar
    cached_predictions['mc_dropout'] = []
    for _, row in mc_df.iterrows():
        # Convertir bbox de xyxy a xywh para consistencia
        bbox = row['bbox']
        if isinstance(bbox, (list, np.ndarray)) and len(bbox) == 4:
            # Si est√° en formato xyxy, convertir a xywh
            if bbox[2] > bbox[0] and bbox[3] > bbox[1]:
                bbox_xywh = [bbox[0], bbox[1], bbox[2]-bbox[0], bbox[3]-bbox[1]]
            else:
                bbox_xywh = bbox
        else:
            bbox_xywh = bbox
            
        cached_predictions['mc_dropout'].append({
            'image_id': int(row['image_id']),
            'category_id': int(row['category_id']) + 1,  # Convertir de 0-indexed a 1-indexed
            'bbox': bbox_xywh,
            'score': float(row['score_mean']),
            'uncertainty': float(row['uncertainty'])  # ¬°IMPORTANTE: con incertidumbre!
        })
    print(f"   ‚Üí {len(cached_predictions['mc_dropout'])} predicciones cargadas (con incertidumbre)")
elif FASE3_MC_DROPOUT_JSON.exists():
    print(f"‚ö†Ô∏è  Cargando MC-Dropout desde JSON (SIN incertidumbre)...")
    with open(FASE3_MC_DROPOUT_JSON, 'r') as f:
        cached_predictions['mc_dropout'] = json.load(f)
    print(f"   ‚Üí {len(cached_predictions['mc_dropout'])} predicciones cargadas")
    print(f"   ‚ö†Ô∏è  ADVERTENCIA: Este archivo NO contiene incertidumbre, se calcular√° como 0.0")
else:
    print(f"‚ö†Ô∏è  No se encontr√≥ {FASE3_MC_DROPOUT_PARQUET}, se ejecutar√° inferencia completa")

# Cargar Temperaturas (Fase 4)
if FASE4_TEMPERATURE.exists():
    print(f"‚úÖ Cargando temperaturas optimizadas desde Fase 4...")
    with open(FASE4_TEMPERATURE, 'r') as f:
        cached_predictions['temperatures'] = json.load(f)
    print(f"   ‚Üí Temperatura baseline: {cached_predictions['temperatures'].get('optimal_temperature', 'N/A')}")
else:
    print(f"‚ö†Ô∏è  No se encontr√≥ {FASE4_TEMPERATURE}, se calcular√°n temperaturas")

print(f"\n{'='*60}")
print(f"RESUMEN DE OPTIMIZACI√ìN:")
print(f"{'='*60}")
print(f"Baseline disponible:      {'‚úÖ S√ç' if cached_predictions['baseline'] else '‚ùå NO (se calcular√°)'}")
print(f"MC-Dropout disponible:    {'‚úÖ S√ç' if cached_predictions['mc_dropout'] else '‚ùå NO (se calcular√°)'}")
print(f"Temperaturas disponibles: {'‚úÖ S√ç' if cached_predictions['temperatures'] else '‚ùå NO (se calcular√°n)'}")
print(f"{'='*60}\n")

‚úÖ Cargando predicciones Baseline desde Fase 2...
   ‚Üí 22162 predicciones cargadas
‚úÖ Cargando predicciones MC-Dropout desde Fase 3 (con incertidumbre)...
   ‚Üí 29914 predicciones cargadas (con incertidumbre)
‚úÖ Cargando temperaturas optimizadas desde Fase 4...
   ‚Üí Temperatura baseline: N/A

RESUMEN DE OPTIMIZACI√ìN:
Baseline disponible:      ‚úÖ S√ç
MC-Dropout disponible:    ‚úÖ S√ç
Temperaturas disponibles: ‚úÖ S√ç



In [3]:
# Funciones para convertir formatos de predicciones desde fases anteriores

def convert_baseline_predictions(baseline_data, image_filename_to_id):
    """
    Convierte predicciones de fase 2 (baseline) al formato esperado.
    baseline_data: lista de dicts con keys: image_id, category_id, bbox, score
    """
    converted = {}
    for pred in baseline_data:
        img_id = pred.get('image_id')
        if img_id not in converted:
            converted[img_id] = []
        
        bbox = pred['bbox']  # [x, y, w, h] en formato COCO
        bbox_xyxy = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
        score = pred['score']
        score_clipped = np.clip(score, 1e-7, 1 - 1e-7)
        logit = np.log(score_clipped / (1 - score_clipped))
        
        converted[img_id].append({
            'bbox': bbox_xyxy,
            'score': score_clipped,
            'logit': logit,
            'category_id': pred['category_id'],
            'uncertainty': pred.get('uncertainty', 0.0)  # Baseline no tiene incertidumbre
        })
    
    return converted

def convert_mc_predictions(mc_data, image_filename_to_id):
    """
    Convierte predicciones de fase 3 (MC-Dropout) al formato esperado.
    Maneja tanto bbox en formato [x,y,w,h] como [x1,y1,x2,y2]
    """
    converted = {}
    for pred in mc_data:
        img_id = pred.get('image_id')
        if img_id not in converted:
            converted[img_id] = []
        
        bbox = pred['bbox']  # Podr√≠a ser [x, y, w, h] o [x1, y1, x2, y2]
        
        # Detectar formato: si bbox[2] > bbox[0] y bbox[3] > bbox[1], probablemente es xywh
        # Si bbox[2] < bbox[0] o bbox[3] < bbox[1], est√° mal
        # Asumimos que si w,h > x,y entonces es xyxy
        if len(bbox) == 4:
            # Si parece formato xywh (w,h son razonables)
            if bbox[2] < bbox[0] or bbox[3] < bbox[1]:
                # Ya est√° en formato xyxy
                bbox_xyxy = bbox
            else:
                # Formato xywh, convertir a xyxy
                bbox_xyxy = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
        else:
            bbox_xyxy = bbox
        
        score = pred['score']
        score_clipped = np.clip(score, 1e-7, 1 - 1e-7)
        logit = np.log(score_clipped / (1 - score_clipped))
        
        converted[img_id].append({
            'bbox': bbox_xyxy,
            'score': score_clipped,
            'logit': logit,
            'category_id': pred['category_id'],
            'uncertainty': pred.get('uncertainty', 0.0)  # ¬°IMPORTANTE: preservar incertidumbre!
        })
    
    return converted

print("‚úÖ Funciones de conversi√≥n de formato definidas")

‚úÖ Funciones de conversi√≥n de formato definidas


## 2. Cargar Modelo y Preparar Funciones

In [4]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 3: Cargar modelo Grounding DINO

from groundingdino.util.inference import load_model, load_image, predict
from groundingdino.util import box_ops

model_config = '/opt/program/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py'
model_weights = '/opt/program/GroundingDINO/weights/groundingdino_swint_ogc.pth'

model = load_model(model_config, model_weights)
model.to(CONFIG['device'])

TEXT_PROMPT = '. '.join(CONFIG['categories']) + '.'

print(f"Modelo cargado en {CONFIG['device']}")
print(f"Prompt: {TEXT_PROMPT}")

# Guardar referencias de m√≥dulos dropout
dropout_modules = []
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Dropout) and ('class_embed' in name or 'bbox_embed' in name):
        dropout_modules.append(module)

print(f"M√≥dulos dropout en cabeza: {len(dropout_modules)}")

final text_encoder_type: bert-base-uncased
Modelo cargado en cuda
Prompt: person. rider. car. truck. bus. train. motorcycle. bicycle. traffic light. traffic sign.
M√≥dulos dropout en cabeza: 0


In [5]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 4: Funciones auxiliares

def normalize_label(label):
    synonyms = {'bike': 'bicycle', 'motorbike': 'motorcycle', 'pedestrian': 'person', 
                'stop sign': 'traffic sign', 'red light': 'traffic light'}
    label_lower = label.lower().strip()
    if label_lower in synonyms:
        return synonyms[label_lower]
    for cat in CONFIG['categories']:
        if cat in label_lower:
            return cat
    return label_lower

def compute_iou(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - inter
    return inter / union if union > 0 else 0

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def apply_nms(detections, iou_thresh=0.65):
    if len(detections) == 0:
        return []
    boxes_t = torch.tensor([d['bbox'] for d in detections], dtype=torch.float32)
    scores_t = torch.tensor([d['score'] for d in detections], dtype=torch.float32)
    keep = torchvision.ops.nms(boxes_t, scores_t, iou_thresh)
    return [detections[i] for i in keep.numpy()]

print("Funciones auxiliares definidas")

Funciones auxiliares definidas


## 3. M√©todos de Inferencia

In [6]:
def inference_baseline(model, image_path, text_prompt, conf_thresh, device):
    """M√©todo 1: Baseline single-pass sin incertidumbre"""
    model.eval()
    for module in dropout_modules:
        module.eval()
    
    image_source, image = load_image(str(image_path))
    boxes, scores, phrases = predict(model, image, text_prompt, conf_thresh, 0.25, device)
    
    if len(boxes) == 0:
        return []
    
    h, w = image_source.shape[:2]
    boxes_xyxy = box_ops.box_cxcywh_to_xyxy(boxes) * torch.tensor([w, h, w, h])
    
    detections = []
    for box, score, phrase in zip(boxes_xyxy.cpu().numpy(), scores.cpu().numpy(), phrases):
        cat = normalize_label(phrase)
        if cat in CONFIG['categories']:
            score_clipped = np.clip(float(score), 1e-7, 1 - 1e-7)
            logit = np.log(score_clipped / (1 - score_clipped))
            detections.append({
                'bbox': box.tolist(),
                'score': score_clipped,
                'logit': logit,
                'category': cat,
                'uncertainty': 0.0  # Sin incertidumbre
            })
    
    return apply_nms(detections, CONFIG['nms_threshold'])

print("M√©todo 1: Baseline definido")

M√©todo 1: Baseline definido


In [7]:
def inference_mc_dropout(model, image_path, text_prompt, conf_thresh, device, K=5):
    """M√©todo 3: MC-Dropout con K pases"""
    model.eval()
    for module in dropout_modules:
        module.train()
    
    image_source, image = load_image(str(image_path))
    h, w = image_source.shape[:2]
    
    all_detections_k = []
    
    with torch.no_grad():
        for k in range(K):
            boxes, scores, phrases = predict(model, image, text_prompt, conf_thresh, 0.25, device)
            
            if len(boxes) == 0:
                all_detections_k.append([])
                continue
            
            boxes_xyxy = box_ops.box_cxcywh_to_xyxy(boxes) * torch.tensor([w, h, w, h])
            
            dets_k = []
            for box, score, phrase in zip(boxes_xyxy.cpu().numpy(), scores.cpu().numpy(), phrases):
                cat = normalize_label(phrase)
                if cat in CONFIG['categories']:
                    score_clipped = np.clip(float(score), 1e-7, 1 - 1e-7)
                    dets_k.append({
                        'bbox': box.tolist(),
                        'score': score_clipped,
                        'category': cat
                    })
            all_detections_k.append(dets_k)
    
    # Alinear detecciones entre pases
    if len(all_detections_k) == 0 or all(len(d) == 0 for d in all_detections_k):
        return []
    
    # Usar primer pase como referencia
    ref_dets = all_detections_k[0]
    
    aggregated = []
    for ref_det in ref_dets:
        scores_aligned = [ref_det['score']]
        
        for k in range(1, K):
            best_iou = 0
            best_score = None
            for det_k in all_detections_k[k]:
                if det_k['category'] != ref_det['category']:
                    continue
                iou = compute_iou(ref_det['bbox'], det_k['bbox'])
                if iou > best_iou:
                    best_iou = iou
                    best_score = det_k['score']
            
            if best_iou >= 0.5 and best_score is not None:
                scores_aligned.append(best_score)
        
        mean_score = np.mean(scores_aligned)
        variance = np.var(scores_aligned) if len(scores_aligned) > 1 else 0.0
        
        mean_score_clipped = np.clip(mean_score, 1e-7, 1 - 1e-7)
        logit = np.log(mean_score_clipped / (1 - mean_score_clipped))
        
        aggregated.append({
            'bbox': ref_det['bbox'],
            'score': mean_score_clipped,
            'logit': logit,
            'category': ref_det['category'],
            'uncertainty': variance
        })
    
    return apply_nms(aggregated, CONFIG['nms_threshold'])

print("M√©todo 3: MC-Dropout definido")

M√©todo 3: MC-Dropout definido


In [8]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 5: Funci√≥n inference_decoder_variance (MODIFICADA)

def inference_decoder_variance(model, image_path, text_prompt, conf_thresh, device):
    """M√©todo 5: Varianza entre capas del decoder (single-pass)"""
    model.eval()
    for module in dropout_modules:
        module.eval()
    
    image_source, image = load_image(str(image_path))
    h, w = image_source.shape[:2]
    
    # Hook para capturar logits de cada capa del decoder
    layer_logits = []
    
    def hook_fn(module, input, output):
        # Capturar salida de cada capa del decoder
        if isinstance(output, tuple) and len(output) > 0:
            layer_logits.append(output[0].detach() if hasattr(output[0], 'detach') else output[0])
        elif hasattr(output, 'detach'):
            layer_logits.append(output.detach())
    
    # Registrar hooks en capas del decoder
    hooks = []
    for name, module in model.named_modules():
        # Buscar m√≥dulos como 'transformer.decoder.layers.0', 'transformer.decoder.layers.1', etc.
        if 'decoder.layers' in name and name.count('.') == 3 and name.split('.')[-1].isdigit():
            hooks.append(module.register_forward_hook(hook_fn))
    
    # Inferencia
    boxes, scores, phrases = predict(model, image, text_prompt, conf_thresh, 0.25, device)
    
    # Remover hooks
    for hook in hooks:
        hook.remove()
    
    if len(boxes) == 0:
        return []
    
    boxes_xyxy = box_ops.box_cxcywh_to_xyxy(boxes) * torch.tensor([w, h, w, h])
    
    detections = []
    for idx, (box, score, phrase) in enumerate(zip(boxes_xyxy.cpu().numpy(), scores.cpu().numpy(), phrases)):
        cat = normalize_label(phrase)
        if cat in CONFIG['categories']:
            score_clipped = np.clip(float(score), 1e-7, 1 - 1e-7)
            logit = np.log(score_clipped / (1 - score_clipped))
            
            # Calcular varianza entre capas si disponible
            uncertainty = 0.0
            layer_uncertainties_list = []
            
            if len(layer_logits) > 0:
                # Los layer_logits son embeddings [900, 1, 256]
                # Necesitamos extraer el embedding de esta detecci√≥n (query idx)
                # y calcular similitud/score para cada capa
                layer_scores = []
                
                for layer_emb in layer_logits:
                    # layer_emb shape: [num_queries, batch, embed_dim]
                    # Extraer embedding de esta query (detecci√≥n)
                    if idx < layer_emb.shape[0]:
                        query_emb = layer_emb[idx, 0, :]  # [256]
                        # Calcular score basado en norma del embedding
                        # (queries m√°s confiables tienen normas m√°s altas)
                        emb_norm = torch.norm(query_emb).item()
                        # Normalizar a [0, 1] aproximadamente
                        layer_score = 1.0 / (1.0 + np.exp(-emb_norm / 10.0))
                        layer_scores.append(layer_score)
                
                if len(layer_scores) > 1:
                    uncertainty = np.var(layer_scores)
                    layer_uncertainties_list = layer_scores
            
            detections.append({
                'bbox': box.tolist(),
                'score': score_clipped,
                'logit': logit,
                'category': cat,
                'uncertainty': uncertainty,
                'layer_uncertainties': layer_uncertainties_list,  # ‚úÖ NUEVO: Incertidumbres por capa
                'layer_count': len(layer_uncertainties_list)  # ‚úÖ NUEVO: N√∫mero de capas capturadas
            })
    
    return apply_nms(detections, CONFIG['nms_threshold'])

print("M√©todo 5: Decoder variance definido (con layer_uncertainties)")

M√©todo 5: Decoder variance definido (con layer_uncertainties)


## 4. Inferencia en val_calib para Ajustar Temperaturas

### 4.1 Estrategia de Optimizaci√≥n

**Si hay predicciones cacheadas**:
- Baseline ‚Üí Se cargan de Fase 2
- MC-Dropout ‚Üí Se cargan de Fase 3
- Solo se calcula Decoder Variance (m√©todo nuevo)

**Si NO hay predicciones cacheadas**:
- Se ejecuta inferencia completa para todos los m√©todos

In [9]:
val_eval_json = DATA_DIR / 'bdd100k_coco/val_eval.json'
image_dir = DATA_DIR / 'bdd100k/bdd100k/bdd100k/images/100k/val'

coco_eval_full = COCO(str(val_eval_json))
img_ids_all = coco_eval_full.getImgIds()

# Split inteligente de val_eval (2000 im√°genes):
# - Primeras 500 para calibraci√≥n (ajustar temperaturas)
# - Restantes 1500 para evaluaci√≥n final
img_ids_calib = img_ids_all[:500]
img_ids_eval_final = img_ids_all[500:]

print(f"üìä ESTRATEGIA DE SPLITS:")
print(f"  Dataset: val_eval.json (2,000 im√°genes)")
print(f"  ‚îú‚îÄ Calibraci√≥n: {len(img_ids_calib)} im√°genes (primeras 500)")
print(f"  ‚îî‚îÄ Evaluaci√≥n:  {len(img_ids_eval_final)} im√°genes (restantes 1,500)")

# Crear COCO object para calibraci√≥n
coco_calib = COCO(str(val_eval_json))

print(f"\nProcesando {len(img_ids_calib)} im√°genes para ajustar temperaturas...")

methods_calib_data = {
    'baseline': [],
    'mc_dropout': [],
    'decoder_variance': []
}

# Contadores para diagn√≥stico
counters = {
    'baseline_cached': 0,
    'baseline_computed': 0,
    'mc_cached': 0,
    'mc_computed': 0
}

# ============================================================================
# OPTIMIZACI√ìN: Usar predicciones cacheadas si est√°n disponibles
# ============================================================================

# Convertir predicciones cacheadas a formato √∫til
baseline_by_img = {}
mc_by_img = {}

if cached_predictions['baseline']:
    print("\n‚úÖ Usando predicciones Baseline cacheadas de Fase 2")
    baseline_by_img = convert_baseline_predictions(cached_predictions['baseline'], {})
    print(f"   ‚Üí {len(baseline_by_img)} im√°genes indexadas")
    
if cached_predictions['mc_dropout']:
    print("‚úÖ Usando predicciones MC-Dropout cacheadas de Fase 3")
    mc_by_img = convert_mc_predictions(cached_predictions['mc_dropout'], {})
    print(f"   ‚Üí {len(mc_by_img)} im√°genes indexadas")

# Verificar overlap con primeras 500 im√°genes de val_eval
calib_500 = set(img_ids_calib)
baseline_overlap = set(baseline_by_img.keys()) & calib_500
mc_overlap = set(mc_by_img.keys()) & calib_500

print(f"\nüîç OVERLAP CON CALIBRACI√ìN (primeras 500 de val_eval):")
print(f"   Baseline cacheado: {len(baseline_overlap)}/500 im√°genes ({len(baseline_overlap)/500*100:.1f}%)")
print(f"   MC-Dropout cacheado: {len(mc_overlap)}/500 im√°genes ({len(mc_overlap)/500*100:.1f}%)")

if len(baseline_overlap) < 500:
    print(f"   ‚ö†Ô∏è  {500 - len(baseline_overlap)} im√°genes de baseline se calcular√°n desde cero")
if len(mc_overlap) < 500:
    print(f"   ‚ö†Ô∏è  {500 - len(mc_overlap)} im√°genes de MC-Dropout se calcular√°n desde cero")
    print(f"   ‚è±Ô∏è  Tiempo estimado: ~{(500 - len(mc_overlap)) * 1.8 / 60:.1f} minutos")

loading annotations into memory...
Done (t=0.23s)
creating index...
index created!
üìä ESTRATEGIA DE SPLITS:
  Dataset: val_eval.json (2,000 im√°genes)
  ‚îú‚îÄ Calibraci√≥n: 500 im√°genes (primeras 500)
  ‚îî‚îÄ Evaluaci√≥n:  1500 im√°genes (restantes 1,500)
loading annotations into memory...
Done (t=0.18s)
creating index...
index created!

Procesando 500 im√°genes para ajustar temperaturas...

‚úÖ Usando predicciones Baseline cacheadas de Fase 2
   ‚Üí 1988 im√°genes indexadas
‚úÖ Usando predicciones MC-Dropout cacheadas de Fase 3
   ‚Üí 1996 im√°genes indexadas

üîç OVERLAP CON CALIBRACI√ìN (primeras 500 de val_eval):
   Baseline cacheado: 497/500 im√°genes (99.4%)
   MC-Dropout cacheado: 498/500 im√°genes (99.6%)
   ‚ö†Ô∏è  3 im√°genes de baseline se calcular√°n desde cero
   ‚ö†Ô∏è  2 im√°genes de MC-Dropout se calcular√°n desde cero
   ‚è±Ô∏è  Tiempo estimado: ~0.1 minutos


### üêõ DEBUG: Probar inference_decoder_variance con 1 imagen

In [10]:
# üêõ CELDA DE DEBUG - Ejecutar solo UNA imagen para diagnosticar hooks

print("üêõ DEBUG: Probando inference_decoder_variance con una imagen...")

# Tomar la primera imagen de calibraci√≥n
test_img_id = img_ids_calib[0]
test_img_info = coco_calib.loadImgs(test_img_id)[0]
test_img_path = image_dir / test_img_info['file_name']

print(f"   Imagen de prueba: {test_img_info['file_name']} (ID: {test_img_id})")
print(f"   Ruta: {test_img_path}")
print(f"   Existe: {test_img_path.exists()}")

if test_img_path.exists():
    print(f"\nüîç Ejecutando inference_decoder_variance...")
    test_preds = inference_decoder_variance(model, test_img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'])
    
    print(f"\nüìä RESULTADOS:")
    print(f"   Detecciones: {len(test_preds)}")
    
    if len(test_preds) > 0:
        print(f"\n   Primera detecci√≥n:")
        first_pred = test_preds[0]
        for key, value in first_pred.items():
            print(f"      {key}: {value}")
        
        # Verificar layer_uncertainties
        if 'layer_uncertainties' in first_pred:
            layer_unc = first_pred['layer_uncertainties']
            print(f"\n   ‚úÖ layer_uncertainties presente: {len(layer_unc)} valores")
            if len(layer_unc) > 0:
                print(f"      Valores: {layer_unc}")
            else:
                print(f"      ‚ö†Ô∏è  VAC√çO - Este es el problema que debemos solucionar")
        else:
            print(f"\n   ‚ùå layer_uncertainties NO est√° en la salida")
    else:
        print(f"   ‚ö†Ô∏è  No se detectaron objetos en esta imagen")
else:
    print(f"   ‚ùå Imagen no encontrada")

print(f"\n{'='*70}")
"corrido"

üêõ DEBUG: Probando inference_decoder_variance con una imagen...
   Imagen de prueba: c8c97803-657086fb.jpg (ID: 9306)
   Ruta: ../data/bdd100k/bdd100k/bdd100k/images/100k/val/c8c97803-657086fb.jpg
   Existe: True

üîç Ejecutando inference_decoder_variance...

üìä RESULTADOS:
   Detecciones: 22

   Primera detecci√≥n:
      bbox: [246.6544952392578, 344.82891845703125, 390.66827392578125, 422.9021301269531]
      score: 0.6032488942146301
      logit: 0.4190207249455299
      category: car
      uncertainty: 0.0012345988001037553
      layer_uncertainties: [0.7025667871442604, 0.7570377492065832, 0.7397240831970635, 0.7811286757848407, 0.8028580751556345, 0.7985424860655854]
      layer_count: 6

   ‚úÖ layer_uncertainties presente: 6 valores
      Valores: [0.7025667871442604, 0.7570377492065832, 0.7397240831970635, 0.7811286757848407, 0.8028580751556345, 0.7985424860655854]



'corrido'

In [11]:
# Procesar im√°genes de calibraci√≥n
for img_id in tqdm(img_ids_calib, desc="Procesando calibraci√≥n"):
    img_info = coco_calib.loadImgs(img_id)[0]
    img_path = image_dir / img_info['file_name']
    
    if not img_path.exists():
        continue
    
    gt_anns = coco_calib.loadAnns(coco_calib.getAnnIds(imgIds=img_id))
    
    # ========================================================================
    # M√©todo 1: Baseline
    # ========================================================================
    if img_id in baseline_by_img:
        # Usar predicciones cacheadas
        preds_baseline = baseline_by_img[img_id]
        counters['baseline_cached'] += 1
    else:
        # Calcular desde cero
        counters['baseline_computed'] += 1
        preds_baseline_raw = inference_baseline(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'])
        preds_baseline = []
        for pred in preds_baseline_raw:
            cat_id = CONFIG['categories'].index(pred['category']) + 1
            preds_baseline.append({
                'bbox': pred['bbox'],
                'score': pred['score'],
                'logit': pred['logit'],
                'category_id': cat_id,
                'uncertainty': pred['uncertainty']
            })
    
    # Etiquetar como TP/FP
    for pred in preds_baseline:
        is_tp = 0
        cat_id = pred['category_id']
        cat = CONFIG['categories'][cat_id - 1] if 1 <= cat_id <= len(CONFIG['categories']) else ''
        
        for gt in gt_anns:
            if gt['category_id'] != cat_id:
                continue
            gt_box = gt['bbox']
            gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
            if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                is_tp = 1
                break
        
        methods_calib_data['baseline'].append({
            'logit': pred['logit'],
            'score': pred['score'],
            'category': cat,
            'uncertainty': pred['uncertainty'],
            'is_tp': is_tp
        })
    
    # ========================================================================
    # M√©todo 3: MC-Dropout
    # ========================================================================
    if img_id in mc_by_img:
        # Usar predicciones cacheadas
        preds_mc = mc_by_img[img_id]
        counters['mc_cached'] += 1
    else:
        # Calcular desde cero
        counters['mc_computed'] += 1
        preds_mc_raw = inference_mc_dropout(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'], CONFIG['K_mc'])
        preds_mc = []
        for pred in preds_mc_raw:
            cat_id = CONFIG['categories'].index(pred['category']) + 1
            preds_mc.append({
                'bbox': pred['bbox'],
                'score': pred['score'],
                'logit': pred['logit'],
                'category_id': cat_id,
                'uncertainty': pred['uncertainty']
            })
    
    # Etiquetar como TP/FP
    for pred in preds_mc:
        is_tp = 0
        cat_id = pred['category_id']
        cat = CONFIG['categories'][cat_id - 1] if 1 <= cat_id <= len(CONFIG['categories']) else ''
        
        for gt in gt_anns:
            if gt['category_id'] != cat_id:
                continue
            gt_box = gt['bbox']
            gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
            if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                is_tp = 1
                break
        
        methods_calib_data['mc_dropout'].append({
            'logit': pred['logit'],
            'score': pred['score'],
            'category': cat,
            'uncertainty': pred['uncertainty'],
            'is_tp': is_tp
        })
    
    # ========================================================================
    # M√©todo 5: Decoder Variance (siempre se calcula, es nuevo)
    # ========================================================================
    preds_dec = inference_decoder_variance(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'])
    for pred in preds_dec:
        is_tp = 0
        cat_id = CONFIG['categories'].index(pred['category']) + 1
        cat = pred['category']
        
        for gt in gt_anns:
            if gt['category_id'] != cat_id:
                continue
            gt_box = gt['bbox']
            gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
            if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                is_tp = 1
                break
        
        methods_calib_data['decoder_variance'].append({
            'logit': pred['logit'],
            'score': pred['score'],
            'category': cat,
            'uncertainty': pred['uncertainty'],
            'is_tp': is_tp
        })

# Mostrar contadores
print(f"\nüìä ESTAD√çSTICAS DE PROCESAMIENTO:")
print(f"   Baseline: {counters['baseline_cached']} cacheadas, {counters['baseline_computed']} calculadas")
print(f"   MC-Dropout: {counters['mc_cached']} cacheadas, {counters['mc_computed']} calculadas")

# Guardar datos de calibraci√≥n
for method_name, data in methods_calib_data.items():
    df = pd.DataFrame(data)
    df.to_csv(OUTPUT_DIR / f'calib_{method_name}.csv', index=False)
    print(f"\n{method_name}: {len(df)} detecciones, TP={df['is_tp'].sum()}")

print("\n‚úÖ Datos de calibraci√≥n guardados")

# DIAGN√ìSTICO FINAL: Verificar si los CSVs son diferentes
print(f"\nüîç DIAGN√ìSTICO FINAL - Verificando CSVs de calibraci√≥n:")
df_baseline = pd.read_csv(OUTPUT_DIR / 'calib_baseline.csv')
df_mc = pd.read_csv(OUTPUT_DIR / 'calib_mc_dropout.csv')
print(f"   Baseline: {len(df_baseline)} registros, uncertainty media={df_baseline['uncertainty'].mean():.6f}")
print(f"   MC-Dropout: {len(df_mc)} registros, uncertainty media={df_mc['uncertainty'].mean():.6f}")

# Comparar primeras 5 logits
print(f"\n   Primeras 5 logits de cada m√©todo:")
print(f"   Baseline:   {df_baseline['logit'].head().tolist()}")
print(f"   MC-Dropout: {df_mc['logit'].head().tolist()}")

if df_baseline['logit'].head(10).equals(df_mc['logit'].head(10)):
    print(f"   ‚ö†Ô∏è  Los primeros 10 logits son id√©nticos (puede ser coincidencia o problema)")
else:
    print(f"   ‚úÖ Los logits son diferentes")
    
if df_baseline['uncertainty'].equals(df_mc['uncertainty']):
    print(f"   ‚ö†Ô∏è  Las incertidumbres son id√©nticas")
else:
    print(f"   ‚úÖ Las incertidumbres son diferentes")

Procesando calibraci√≥n: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [02:50<00:00,  2.93it/s]


üìä ESTAD√çSTICAS DE PROCESAMIENTO:
   Baseline: 497 cacheadas, 3 calculadas
   MC-Dropout: 498 cacheadas, 2 calculadas

baseline: 5457 detecciones, TP=3558

mc_dropout: 7387 detecciones, TP=51

decoder_variance: 7453 detecciones, TP=4336

‚úÖ Datos de calibraci√≥n guardados

üîç DIAGN√ìSTICO FINAL - Verificando CSVs de calibraci√≥n:
   Baseline: 5457 registros, uncertainty media=0.000000
   MC-Dropout: 7387 registros, uncertainty media=0.000089

   Primeras 5 logits de cada m√©todo:
   Baseline:   [0.3380101929716149, -0.1017526020559519, 0.4190207249455299, -0.2801209492000665, -0.0834355056962531]
   MC-Dropout: [-0.1081413067481472, -0.2560700530227012, -0.4148847215280034, -0.5178735333913581, -0.7217361723679339]
   ‚úÖ Los logits son diferentes
   ‚úÖ Las incertidumbres son diferentes





## 5. Optimizar Temperaturas

In [12]:
from scipy.optimize import minimize

def nll_loss(T, logits, labels):
    T = max(T, 0.01)
    probs = sigmoid(logits / T)
    probs = np.clip(probs, 1e-7, 1 - 1e-7)
    return -np.mean(labels * np.log(probs) + (1 - labels) * np.log(1 - probs))

# ============================================================================
# OPTIMIZACI√ìN: Usar temperaturas de Fase 4 si est√°n disponibles
# ============================================================================

if cached_predictions['temperatures'] and 'optimal_temperature' in cached_predictions['temperatures']:
    print("‚úÖ Usando temperatura optimizada de Fase 4")
    T_baseline = cached_predictions['temperatures']['optimal_temperature']
    
    # Calcular NLL antes y despu√©s con esta temperatura
    df_baseline = pd.read_csv(OUTPUT_DIR / 'calib_baseline.csv')
    logits_baseline = df_baseline['logit'].values
    labels_baseline = df_baseline['is_tp'].values
    
    nll_before = nll_loss(1.0, logits_baseline, labels_baseline)
    nll_after = nll_loss(T_baseline, logits_baseline, labels_baseline)
    
    temperatures = {
        'baseline': {
            'T': T_baseline,
            'nll_before': nll_before,
            'nll_after': nll_after,
            'source': 'cached_from_fase4'
        }
    }
    
    print(f"  baseline: T={T_baseline:.4f}, NLL: {nll_before:.4f} ‚Üí {nll_after:.4f} (cacheada)")
    
else:
    print("‚öôÔ∏è  Calculando temperaturas desde cero...")
    temperatures = {}

# Calcular temperaturas para MC-Dropout y Decoder Variance (siempre, son nuevos)
for method_name in ['mc_dropout', 'decoder_variance']:
    df = pd.read_csv(OUTPUT_DIR / f'calib_{method_name}.csv')
    logits = df['logit'].values
    labels = df['is_tp'].values
    
    nll_before = nll_loss(1.0, logits, labels)
    result = minimize(lambda T: nll_loss(T, logits, labels), x0=1.0, bounds=[(0.01, 10.0)], method='L-BFGS-B')
    T_opt = result.x[0]
    nll_after = result.fun
    
    temperatures[method_name] = {
        'T': T_opt,
        'nll_before': nll_before,
        'nll_after': nll_after,
        'source': 'calculated'
    }
    
    print(f"  {method_name}: T={T_opt:.4f}, NLL: {nll_before:.4f} ‚Üí {nll_after:.4f}")

# Si no hab√≠a temperatura cacheada para baseline, calcularla
if 'baseline' not in temperatures:
    print("‚öôÔ∏è  Calculando temperatura para baseline...")
    df = pd.read_csv(OUTPUT_DIR / 'calib_baseline.csv')
    logits = df['logit'].values
    labels = df['is_tp'].values
    
    nll_before = nll_loss(1.0, logits, labels)
    result = minimize(lambda T: nll_loss(T, logits, labels), x0=1.0, bounds=[(0.01, 10.0)], method='L-BFGS-B')
    T_opt = result.x[0]
    nll_after = result.fun
    
    temperatures['baseline'] = {
        'T': T_opt,
        'nll_before': nll_before,
        'nll_after': nll_after,
        'source': 'calculated'
    }
    
    print(f"  baseline: T={T_opt:.4f}, NLL: {nll_before:.4f} ‚Üí {nll_after:.4f}")

with open(OUTPUT_DIR / 'temperatures.json', 'w') as f:
    json.dump(temperatures, f, indent=2)

print(f"\n‚úÖ Temperaturas guardadas en: {OUTPUT_DIR / 'temperatures.json'}")

‚öôÔ∏è  Calculando temperaturas desde cero...
  mc_dropout: T=0.3192, NLL: 0.5123 ‚Üí 0.4001
  decoder_variance: T=2.6534, NLL: 0.7061 ‚Üí 0.6850
‚öôÔ∏è  Calculando temperatura para baseline...
  baseline: T=4.2128, NLL: 0.7107 ‚Üí 0.6912

‚úÖ Temperaturas guardadas en: outputs/comparison/temperatures.json


## 6. Evaluaci√≥n en val_eval con COCO API

In [13]:
# ‚úÖ EJECUTAR PARA RQ1 - Celda 6: Ejecutar inferencia en val_eval (1500 im√°genes)
# Esta celda ejecuta decoder_variance en TODAS las im√°genes de evaluaci√≥n

# ============================================================================
# EVALUACI√ìN: Usar restantes 1500 im√°genes de val_eval
# ============================================================================
# Las primeras 500 se usaron para calibraci√≥n, ahora usamos el resto

print(f"\n{'='*70}")
print(f"EVALUACI√ìN EN VAL_EVAL (1,500 im√°genes restantes)")
print(f"{'='*70}")

coco_eval = COCO(str(val_eval_json))

print(f"Procesando {len(img_ids_eval_final)} im√°genes para evaluaci√≥n final...")

methods_results = {
    'baseline': [],
    'baseline_ts': [],
    'mc_dropout': [],
    'mc_dropout_ts': [],
    'decoder_variance': [],
    'decoder_variance_ts': []
}

# Cargar temperaturas
with open(OUTPUT_DIR / 'temperatures.json', 'r') as f:
    temps = json.load(f)

print(f"\nüìä Temperaturas a aplicar:")
for method, temp_info in temps.items():
    print(f"   {method}: T={temp_info['T']:.4f}")

# ============================================================================
# OPTIMIZACI√ìN: Construir √≠ndices de predicciones cacheadas para evaluaci√≥n
# ============================================================================

baseline_eval_by_img = {}
mc_eval_by_img = {}

if cached_predictions['baseline']:
    print("\n‚úÖ Indexando predicciones Baseline cacheadas para evaluaci√≥n")
    for pred in cached_predictions['baseline']:
        img_id = pred.get('image_id')
        if img_id in img_ids_eval_final:  # Solo las de evaluaci√≥n (no calibraci√≥n)
            if img_id not in baseline_eval_by_img:
                baseline_eval_by_img[img_id] = []
            baseline_eval_by_img[img_id].append(pred)
    print(f"   ‚Üí {len(baseline_eval_by_img)} im√°genes con predicciones cacheadas")

if cached_predictions['mc_dropout']:
    print("‚úÖ Indexando predicciones MC-Dropout cacheadas para evaluaci√≥n")
    for pred in cached_predictions['mc_dropout']:
        img_id = pred.get('image_id')
        if img_id in img_ids_eval_final:  # Solo las de evaluaci√≥n (no calibraci√≥n)
            if img_id not in mc_eval_by_img:
                mc_eval_by_img[img_id] = []
            mc_eval_by_img[img_id].append(pred)
    print(f"   ‚Üí {len(mc_eval_by_img)} im√°genes con predicciones cacheadas")

# Contadores
eval_counters = {
    'baseline_cached': 0,
    'baseline_computed': 0,
    'mc_cached': 0,
    'mc_computed': 0
}

# Procesar im√°genes de evaluaci√≥n
for img_id in tqdm(img_ids_eval_final, desc="Procesando evaluaci√≥n"):
    img_info = coco_eval.loadImgs(img_id)[0]
    img_path = image_dir / img_info['file_name']
    
    if not img_path.exists():
        continue
    
    gt_anns = coco_eval.loadAnns(coco_eval.getAnnIds(imgIds=img_id))
    
    # ========================================================================
    # Baseline (sin TS y con TS)
    # ========================================================================
    if img_id in baseline_eval_by_img:
        # Usar cacheadas
        eval_counters['baseline_cached'] += 1
        for pred in baseline_eval_by_img[img_id]:
            is_tp = 0
            bbox = pred['bbox']
            bbox_xyxy = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
            
            for gt in gt_anns:
                if gt['category_id'] != pred['category_id']:
                    continue
                gt_box = gt['bbox']
                gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
                if compute_iou(bbox_xyxy, gt_box_xyxy) >= CONFIG['iou_matching']:
                    is_tp = 1
                    break
            
            score = pred['score']
            score_clipped = np.clip(score, 1e-7, 1 - 1e-7)
            logit = np.log(score_clipped / (1 - score_clipped))
            
            methods_results['baseline'].append({
                'image_id': img_id,
                'category_id': pred['category_id'],
                'bbox': pred['bbox'],
                'score': score_clipped,
                'logit': logit,
                'uncertainty': pred.get('uncertainty', 0.0),
                'is_tp': is_tp
            })
            
            # Con TS
            score_ts = sigmoid(logit / temps['baseline']['T'])
            methods_results['baseline_ts'].append({
                'image_id': img_id,
                'category_id': pred['category_id'],
                'bbox': pred['bbox'],
                'score': score_ts,
                'logit': logit,
                'uncertainty': pred.get('uncertainty', 0.0),
                'is_tp': is_tp
            })
    else:
        # Calcular desde cero
        eval_counters['baseline_computed'] += 1
        preds_baseline = inference_baseline(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'])
        for pred in preds_baseline:
            cat_id = CONFIG['categories'].index(pred['category']) + 1
            is_tp = 0
            for gt in gt_anns:
                if gt['category_id'] != cat_id:
                    continue
                gt_box = gt['bbox']
                gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
                if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                    is_tp = 1
                    break
            
            methods_results['baseline'].append({
                'image_id': img_id,
                'category_id': cat_id,
                'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
                'score': pred['score'],
                'logit': pred['logit'],
                'uncertainty': pred['uncertainty'],
                'is_tp': is_tp
            })
            
            score_ts = sigmoid(pred['logit'] / temps['baseline']['T'])
            methods_results['baseline_ts'].append({
                'image_id': img_id,
                'category_id': cat_id,
                'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
                'score': score_ts,
                'logit': pred['logit'],
                'uncertainty': pred['uncertainty'],
                'is_tp': is_tp
            })
    
    # ========================================================================
    # MC-Dropout (sin TS y con TS)
    # ========================================================================
    if img_id in mc_eval_by_img:
        # Usar cacheadas
        eval_counters['mc_cached'] += 1
        for pred in mc_eval_by_img[img_id]:
            is_tp = 0
            bbox = pred['bbox']
            bbox_xyxy = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
            
            for gt in gt_anns:
                if gt['category_id'] != pred['category_id']:
                    continue
                gt_box = gt['bbox']
                gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
                if compute_iou(bbox_xyxy, gt_box_xyxy) >= CONFIG['iou_matching']:
                    is_tp = 1
                    break
            
            score = pred['score']
            score_clipped = np.clip(score, 1e-7, 1 - 1e-7)
            logit = np.log(score_clipped / (1 - score_clipped))
            
            methods_results['mc_dropout'].append({
                'image_id': img_id,
                'category_id': pred['category_id'],
                'bbox': pred['bbox'],
                'score': score_clipped,
                'logit': logit,
                'uncertainty': pred.get('uncertainty', 0.0),
                'is_tp': is_tp
            })
            
            # Con TS
            score_ts = sigmoid(logit / temps['mc_dropout']['T'])
            methods_results['mc_dropout_ts'].append({
                'image_id': img_id,
                'category_id': pred['category_id'],
                'bbox': pred['bbox'],
                'score': score_ts,
                'logit': logit,
                'uncertainty': pred.get('uncertainty', 0.0),
                'is_tp': is_tp
            })
    else:
        # Calcular desde cero
        eval_counters['mc_computed'] += 1
        preds_mc = inference_mc_dropout(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'], CONFIG['K_mc'])
        for pred in preds_mc:
            cat_id = CONFIG['categories'].index(pred['category']) + 1
            is_tp = 0
            for gt in gt_anns:
                if gt['category_id'] != cat_id:
                    continue
                gt_box = gt['bbox']
                gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
                if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                    is_tp = 1
                    break
            
            methods_results['mc_dropout'].append({
                'image_id': img_id,
                'category_id': cat_id,
                'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
                'score': pred['score'],
                'logit': pred['logit'],
                'uncertainty': pred['uncertainty'],
                'is_tp': is_tp
            })
            
            score_ts = sigmoid(pred['logit'] / temps['mc_dropout']['T'])
            methods_results['mc_dropout_ts'].append({
                'image_id': img_id,
                'category_id': cat_id,
                'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
                'score': score_ts,
                'logit': pred['logit'],
                'uncertainty': pred['uncertainty'],
                'is_tp': is_tp
            })
    
    # ========================================================================
    # Decoder variance (siempre se calcula, es nuevo)
    # ========================================================================
    preds_dec = inference_decoder_variance(model, img_path, TEXT_PROMPT, CONFIG['conf_threshold'], CONFIG['device'])
    for pred in preds_dec:
        cat_id = CONFIG['categories'].index(pred['category']) + 1
        is_tp = 0
        for gt in gt_anns:
            if gt['category_id'] != cat_id:
                continue
            gt_box = gt['bbox']
            gt_box_xyxy = [gt_box[0], gt_box[1], gt_box[0] + gt_box[2], gt_box[1] + gt_box[3]]
            if compute_iou(pred['bbox'], gt_box_xyxy) >= CONFIG['iou_matching']:
                is_tp = 1
                break
        
        methods_results['decoder_variance'].append({
            'image_id': img_id,
            'category_id': cat_id,
            'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
            'score': pred['score'],
            'logit': pred['logit'],
            'uncertainty': pred['uncertainty'],
            'layer_uncertainties': pred.get('layer_uncertainties', []),
            'is_tp': is_tp
        })
        
        score_ts = sigmoid(pred['logit'] / temps['decoder_variance']['T'])
        methods_results['decoder_variance_ts'].append({
            'image_id': img_id,
            'category_id': cat_id,
            'bbox': [pred['bbox'][0], pred['bbox'][1], pred['bbox'][2] - pred['bbox'][0], pred['bbox'][3] - pred['bbox'][1]],
            'score': score_ts,
            'logit': pred['logit'],
            'uncertainty': pred['uncertainty'],
            'layer_uncertainties': pred.get('layer_uncertainties', []),
            'is_tp': is_tp
        })

# Estad√≠sticas finales
print(f"\nüìä ESTAD√çSTICAS DE EVALUACI√ìN:")
print(f"   Baseline: {eval_counters['baseline_cached']} cacheadas, {eval_counters['baseline_computed']} calculadas")
print(f"   MC-Dropout: {eval_counters['mc_cached']} cacheadas, {eval_counters['mc_computed']} calculadas")

# Guardar resultados (CSV y JSON)
for method_name, results in methods_results.items():
    # Guardar CSV
    df = pd.DataFrame(results)
    df.to_csv(OUTPUT_DIR / f'eval_{method_name}.csv', index=False)
    
    # Guardar JSON (formato compatible con RQ1)
    with open(OUTPUT_DIR / f'eval_{method_name}.json', 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\n{method_name}: {len(df)} detecciones")

print(f"\n‚úÖ Resultados de evaluaci√≥n guardados (CSV + JSON)")


EVALUACI√ìN EN VAL_EVAL (1,500 im√°genes restantes)
loading annotations into memory...
Done (t=0.18s)
creating index...
index created!
Procesando 1500 im√°genes para evaluaci√≥n final...

üìä Temperaturas a aplicar:
   mc_dropout: T=0.3192
   decoder_variance: T=2.6534
   baseline: T=4.2128

‚úÖ Indexando predicciones Baseline cacheadas para evaluaci√≥n
   ‚Üí 1491 im√°genes con predicciones cacheadas
‚úÖ Indexando predicciones MC-Dropout cacheadas para evaluaci√≥n
   ‚Üí 1498 im√°genes con predicciones cacheadas


Procesando evaluaci√≥n: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1500/1500 [08:37<00:00,  2.90it/s]



üìä ESTAD√çSTICAS DE EVALUACI√ìN:
   Baseline: 1491 cacheadas, 9 calculadas
   MC-Dropout: 1498 cacheadas, 2 calculadas

baseline: 16724 detecciones

baseline_ts: 16724 detecciones

mc_dropout: 22527 detecciones

mc_dropout_ts: 22527 detecciones

decoder_variance: 22793 detecciones

decoder_variance_ts: 22793 detecciones

‚úÖ Resultados de evaluaci√≥n guardados (CSV + JSON)


## 7. Calcular M√©tricas de Detecci√≥n (mAP)

In [12]:
detection_metrics = {}

for method_name in methods_results.keys():
    print(f"\nEvaluando {method_name}...")
    
    # Cargar predicciones en formato COCO
    preds_file = OUTPUT_DIR / f'eval_{method_name}.json'
    
    if os.path.getsize(preds_file) > 0:
        coco_dt = coco_eval.loadRes(str(preds_file))
        coco_eval_obj = COCOeval(coco_eval, coco_dt, 'bbox')
        coco_eval_obj.evaluate()
        coco_eval_obj.accumulate()
        coco_eval_obj.summarize()
        
        detection_metrics[method_name] = {
            'mAP': coco_eval_obj.stats[0],
            'AP50': coco_eval_obj.stats[1],
            'AP75': coco_eval_obj.stats[2],
            'AP_small': coco_eval_obj.stats[3],
            'AP_medium': coco_eval_obj.stats[4],
            'AP_large': coco_eval_obj.stats[5]
        }
        
        # mAP por clase
        per_class_ap = {}
        for cat_id, cat_name in enumerate(CONFIG['categories'], 1):
            coco_eval_obj.params.catIds = [cat_id]
            coco_eval_obj.evaluate()
            coco_eval_obj.accumulate()
            per_class_ap[cat_name] = coco_eval_obj.stats[0]
        
        detection_metrics[method_name]['per_class'] = per_class_ap
    else:
        detection_metrics[method_name] = {'mAP': 0.0, 'AP50': 0.0, 'AP75': 0.0}

with open(OUTPUT_DIR / 'detection_metrics.json', 'w') as f:
    json.dump(detection_metrics, f, indent=2)

print("\nM√©tricas de detecci√≥n guardadas")


Evaluando baseline...
Loading and preparing results...
DONE (t=0.15s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=3.50s).
Accumulating evaluation results...
DONE (t=0.55s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.170
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.279
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.171
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.063
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.182
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.377
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.188
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.284
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.285
 Average Recall     (AR) @[ IoU=0.50:0.95 | a

## 8. Tabla Comparativa de Detecci√≥n

In [13]:
with open(OUTPUT_DIR / 'detection_metrics.json', 'r') as f:
    det_metrics = json.load(f)

# Crear tabla comparativa
rows = []
for method_name, metrics in det_metrics.items():
    row = {
        'Method': method_name,
        'mAP': metrics.get('mAP', 0.0),
        'AP50': metrics.get('AP50', 0.0),
        'AP75': metrics.get('AP75', 0.0)
    }
    
    # Agregar mAP por clase principal
    if 'per_class' in metrics:
        for cat in ['person', 'car', 'truck', 'traffic_light', 'traffic_sign']:
            cat_key = cat.replace('_', ' ')
            row[f'AP_{cat}'] = metrics['per_class'].get(cat_key, 0.0)
    
    rows.append(row)

df_detection = pd.DataFrame(rows)
df_detection.to_csv(OUTPUT_DIR / 'detection_comparison.csv', index=False)

print("\n" + "="*80)
print("TABLA COMPARATIVA DE DETECCI√ìN")
print("="*80)
print(df_detection.to_string(index=False))
print("="*80)


TABLA COMPARATIVA DE DETECCI√ìN
             Method      mAP     AP50     AP75  AP_person   AP_car  AP_truck  AP_traffic_light  AP_traffic_sign
           baseline 0.170481 0.278535 0.170542   0.170481 0.170481  0.170481          0.170481         0.170481
        baseline_ts 0.170481 0.278535 0.170542   0.170481 0.170481  0.170481          0.170481         0.170481
         mc_dropout 0.182274 0.302312 0.181113   0.182274 0.182274  0.182274          0.182274         0.182274
      mc_dropout_ts 0.182274 0.302312 0.181113   0.182274 0.182274  0.182274          0.182274         0.182274
   decoder_variance 0.181892 0.302048 0.180095   0.181892 0.181892  0.181892          0.181892         0.181892
decoder_variance_ts 0.181892 0.302048 0.180095   0.181892 0.181892  0.181892          0.181892         0.181892


## 9. Calcular M√©tricas de Calibraci√≥n

In [14]:
def compute_calibration_metrics(logits, labels, T=1.0, n_bins=10):
    probs = sigmoid(logits / T)
    probs = np.clip(probs, 1e-7, 1 - 1e-7)
    
    # NLL
    nll = -np.mean(labels * np.log(probs) + (1 - labels) * np.log(1 - probs))
    
    # Brier
    brier = np.mean((probs - labels) ** 2)
    
    # ECE
    bins = np.linspace(0, 1, n_bins + 1)
    digitized = np.digitize(probs, bins) - 1
    
    ece = 0.0
    bin_data = []
    
    for i in range(n_bins):
        mask = digitized == i
        if mask.sum() > 0:
            conf = probs[mask].mean()
            acc = labels[mask].mean()
            gap = abs(conf - acc)
            ece += gap * mask.sum() / len(probs)
            bin_data.append({
                'bin': i,
                'confidence': conf,
                'accuracy': acc,
                'count': mask.sum()
            })
    
    return {'NLL': nll, 'Brier': brier, 'ECE': ece, 'bin_data': bin_data}

calibration_metrics = {}

for method_name in methods_results.keys():
    df = pd.read_csv(OUTPUT_DIR / f'eval_{method_name}.csv')
    logits = df['logit'].values
    labels = df['is_tp'].values
    
    # Sin TS (T=1)
    if '_ts' not in method_name:
        metrics = compute_calibration_metrics(logits, labels, T=1.0, n_bins=CONFIG['n_bins'])
        calibration_metrics[method_name] = metrics
    else:
        # Con TS
        base_method = method_name.replace('_ts', '')
        T = temps[base_method]['T']
        metrics = compute_calibration_metrics(logits, labels, T=T, n_bins=CONFIG['n_bins'])
        calibration_metrics[method_name] = metrics
    
    print(f"{method_name}: NLL={metrics['NLL']:.4f}, Brier={metrics['Brier']:.4f}, ECE={metrics['ECE']:.4f}")

# Guardar
with open(OUTPUT_DIR / 'calibration_metrics.json', 'w') as f:
    # Convertir para JSON serializable
    cal_save = {}
    for k, v in calibration_metrics.items():
        cal_save[k] = {
            'NLL': v['NLL'],
            'Brier': v['Brier'],
            'ECE': v['ECE']
        }
    json.dump(cal_save, f, indent=2)

print("\nM√©tricas de calibraci√≥n guardadas")

baseline: NLL=0.7180, Brier=0.2618, ECE=0.2410
baseline_ts: NLL=0.6930, Brier=0.2499, ECE=0.1868
mc_dropout: NLL=0.7069, Brier=0.2561, ECE=0.2034
mc_dropout_ts: NLL=1.0070, Brier=0.3365, ECE=0.3428
decoder_variance: NLL=0.7093, Brier=0.2572, ECE=0.2065
decoder_variance_ts: NLL=0.6863, Brier=0.2466, ECE=0.1409

M√©tricas de calibraci√≥n guardadas


## 10. Tabla Comparativa de Calibraci√≥n

In [15]:
rows_calib = []
for method_name, metrics in calibration_metrics.items():
    rows_calib.append({
        'Method': method_name,
        'NLL': metrics['NLL'],
        'Brier': metrics['Brier'],
        'ECE': metrics['ECE']
    })

df_calibration = pd.DataFrame(rows_calib)
df_calibration.to_csv(OUTPUT_DIR / 'calibration_comparison.csv', index=False)

print("\n" + "="*80)
print("TABLA COMPARATIVA DE CALIBRACI√ìN")
print("="*80)
print(df_calibration.to_string(index=False))
print("="*80)
print("\nInterpretaci√≥n:")
print("  ‚Üì Menor es mejor para NLL, Brier, ECE")
print("  Si m√©todo+TS < m√©todo: TS mejor√≥ calibraci√≥n")


TABLA COMPARATIVA DE CALIBRACI√ìN
             Method      NLL    Brier      ECE
           baseline 0.718032 0.261844 0.240970
        baseline_ts 0.693014 0.249935 0.186833
         mc_dropout 0.706870 0.256100 0.203429
      mc_dropout_ts 1.007017 0.336512 0.342814
   decoder_variance 0.709267 0.257221 0.206473
decoder_variance_ts 0.686261 0.246594 0.140935

Interpretaci√≥n:
  ‚Üì Menor es mejor para NLL, Brier, ECE
  Si m√©todo+TS < m√©todo: TS mejor√≥ calibraci√≥n


## 11. Reliability Diagrams

In [16]:
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

method_pairs = [
    ('baseline', 'baseline_ts'),
    ('mc_dropout', 'mc_dropout_ts'),
    ('decoder_variance', 'decoder_variance_ts')
]

for idx, (method_before, method_after) in enumerate(method_pairs):
    ax = axes[idx * 2]
    
    # Sin TS
    bin_data = calibration_metrics[method_before]['bin_data']
    if len(bin_data) > 0:
        confidences = [b['confidence'] for b in bin_data]
        accuracies = [b['accuracy'] for b in bin_data]
        counts = [b['count'] for b in bin_data]
        
        ax.bar(range(len(confidences)), accuracies, alpha=0.3, label='Accuracy', color='blue')
        ax.plot(range(len(confidences)), confidences, 'o-', label='Confidence', color='red', markersize=8)
        ax.plot([0, len(confidences)-1], [0, 1], 'k--', alpha=0.3, label='Perfect calibration')
        ax.set_xlabel('Confidence bin')
        ax.set_ylabel('Proportion')
        ax.set_title(f'{method_before}\nECE={calibration_metrics[method_before]["ECE"]:.4f}')
        ax.legend()
        ax.grid(alpha=0.3)
    
    # Con TS
    ax = axes[idx * 2 + 1]
    bin_data = calibration_metrics[method_after]['bin_data']
    if len(bin_data) > 0:
        confidences = [b['confidence'] for b in bin_data]
        accuracies = [b['accuracy'] for b in bin_data]
        
        ax.bar(range(len(confidences)), accuracies, alpha=0.3, label='Accuracy', color='blue')
        ax.plot(range(len(confidences)), confidences, 'o-', label='Confidence', color='red', markersize=8)
        ax.plot([0, len(confidences)-1], [0, 1], 'k--', alpha=0.3, label='Perfect calibration')
        ax.set_xlabel('Confidence bin')
        ax.set_ylabel('Proportion')
        ax.set_title(f'{method_after}\nECE={calibration_metrics[method_after]["ECE"]:.4f}')
        ax.legend()
        ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'reliability_diagrams.png', dpi=150, bbox_inches='tight')
print(f"Reliability diagrams guardados en: {OUTPUT_DIR / 'reliability_diagrams.png'}")
plt.close()

Reliability diagrams guardados en: outputs/comparison/reliability_diagrams.png


## 12. Risk-Coverage Analysis

## 13. M√©tricas de Incertidumbre: AUROC TP vs FP

In [17]:
from sklearn.metrics import roc_auc_score, roc_curve

uncertainty_auroc = {}

# Solo m√©todos con incertidumbre (MC-Dropout y Decoder Variance)
uncertainty_methods = ['mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']

print("="*80)
print("AUROC: ¬øLa incertidumbre detecta errores (FP)?")
print("="*80)
print("\nObjetivo: Usar incertidumbre para distinguir FP (errores) de TP (aciertos)")
print("Interpretaci√≥n: AUROC > 0.5 (random), ideal ‚â• 0.7")
print("-"*80)

for method_name in uncertainty_methods:
    df = pd.read_csv(OUTPUT_DIR / f'eval_{method_name}.csv')
    
    if len(df) > 0 and 'uncertainty' in df.columns:
        uncertainties = df['uncertainty'].values
        is_tp = df['is_tp'].values
        
        # Verificar que hay TPs y FPs
        if len(np.unique(is_tp)) > 1 and len(uncertainties) > 0:
            # AUROC: predecir FP (error) usando incertidumbre
            # Invertir labels: 1=FP (error), 0=TP (correcto)
            is_fp = 1 - is_tp
            
            try:
                auroc = roc_auc_score(is_fp, uncertainties)
                
                # Estad√≠sticas de incertidumbre
                unc_tp = uncertainties[is_tp == 1]
                unc_fp = uncertainties[is_tp == 0]
                
                mean_unc_tp = unc_tp.mean() if len(unc_tp) > 0 else 0.0
                mean_unc_fp = unc_fp.mean() if len(unc_fp) > 0 else 0.0
                
                uncertainty_auroc[method_name] = {
                    'auroc': auroc,
                    'mean_unc_tp': mean_unc_tp,
                    'mean_unc_fp': mean_unc_fp,
                    'n_tp': int(is_tp.sum()),
                    'n_fp': int((1 - is_tp).sum())
                }
                
                print(f"\n{method_name}:")
                print(f"  AUROC (FP detection): {auroc:.4f}")
                print(f"  Mean uncertainty TP:  {mean_unc_tp:.6f}")
                print(f"  Mean uncertainty FP:  {mean_unc_fp:.6f}")
                print(f"  Ratio (FP/TP):        {mean_unc_fp/mean_unc_tp if mean_unc_tp > 0 else 0:.2f}x")
                print(f"  Samples: {int(is_tp.sum())} TP, {int((1-is_tp).sum())} FP")
                
            except Exception as e:
                print(f"\n{method_name}: Error calculando AUROC - {e}")
        else:
            print(f"\n{method_name}: Datos insuficientes para AUROC")

# Guardar resultados
with open(OUTPUT_DIR / 'uncertainty_auroc.json', 'w') as f:
    json.dump(uncertainty_auroc, f, indent=2)

print("\n" + "="*80)
print(f"Resultados guardados en: {OUTPUT_DIR / 'uncertainty_auroc.json'}")


AUROC: ¬øLa incertidumbre detecta errores (FP)?

Objetivo: Usar incertidumbre para distinguir FP (errores) de TP (aciertos)
Interpretaci√≥n: AUROC > 0.5 (random), ideal ‚â• 0.7
--------------------------------------------------------------------------------

mc_dropout:
  AUROC (FP detection): 0.6335
  Mean uncertainty TP:  0.000061
  Mean uncertainty FP:  0.000126
  Ratio (FP/TP):        2.07x
  Samples: 13317 TP, 9210 FP

mc_dropout_ts:
  AUROC (FP detection): 0.6335
  Mean uncertainty TP:  0.000061
  Mean uncertainty FP:  0.000126
  Ratio (FP/TP):        2.07x
  Samples: 13317 TP, 9210 FP

decoder_variance:
  AUROC (FP detection): 0.5000
  Mean uncertainty TP:  0.000000
  Mean uncertainty FP:  0.000000
  Ratio (FP/TP):        0.00x
  Samples: 13508 TP, 9285 FP

decoder_variance_ts:
  AUROC (FP detection): 0.5000
  Mean uncertainty TP:  0.000000
  Mean uncertainty FP:  0.000000
  Ratio (FP/TP):        0.00x
  Samples: 13508 TP, 9285 FP

Resultados guardados en: outputs/comparison/unc

In [18]:
# Tabla comparativa AUROC
rows_auroc = []
for method_name, metrics in uncertainty_auroc.items():
    rows_auroc.append({
        'Method': method_name,
        'AUROC (FP detection) ‚Üë': metrics['auroc'],
        'Mean Unc. TP': metrics['mean_unc_tp'],
        'Mean Unc. FP': metrics['mean_unc_fp'],
        'Ratio (FP/TP)': metrics['mean_unc_fp'] / metrics['mean_unc_tp'] if metrics['mean_unc_tp'] > 0 else 0
    })

df_auroc = pd.DataFrame(rows_auroc)
df_auroc.to_csv(OUTPUT_DIR / 'uncertainty_auroc_comparison.csv', index=False)

print("\n" + "="*80)
print("TABLA COMPARATIVA: AUROC TP vs FP")
print("="*80)
print(df_auroc.to_string(index=False))
print("="*80)
print("\nInterpretaci√≥n:")
print("  ‚Üë Mayor AUROC = mejor detecci√≥n de errores")
print("  Ratio (FP/TP) > 1 = incertidumbre mayor en errores (deseable)")
print("  AUROC ‚â• 0.7 = incertidumbre √∫til para rechazo selectivo")



TABLA COMPARATIVA: AUROC TP vs FP
             Method  AUROC (FP detection) ‚Üë  Mean Unc. TP  Mean Unc. FP  Ratio (FP/TP)
         mc_dropout                0.633462      0.000061      0.000126       2.070721
      mc_dropout_ts                0.633462      0.000061      0.000126       2.070721
   decoder_variance                0.500000      0.000000      0.000000       0.000000
decoder_variance_ts                0.500000      0.000000      0.000000       0.000000

Interpretaci√≥n:
  ‚Üë Mayor AUROC = mejor detecci√≥n de errores
  Ratio (FP/TP) > 1 = incertidumbre mayor en errores (deseable)
  AUROC ‚â• 0.7 = incertidumbre √∫til para rechazo selectivo


In [19]:
# Visualizaci√≥n: Distribuciones de incertidumbre y ROC curves
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

methods_to_plot = ['mc_dropout', 'decoder_variance']
colors_methods = {'mc_dropout': 'blue', 'decoder_variance': 'green'}

for idx, method_name in enumerate(methods_to_plot):
    # Fila 1: Distribuciones de incertidumbre (TP vs FP)
    ax_dist = axes[0, idx]
    
    df = pd.read_csv(OUTPUT_DIR / f'eval_{method_name}.csv')
    if len(df) > 0 and 'uncertainty' in df.columns:
        unc_tp = df[df['is_tp'] == 1]['uncertainty'].values
        unc_fp = df[df['is_tp'] == 0]['uncertainty'].values
        
        ax_dist.hist(unc_tp, bins=50, alpha=0.6, label=f'TP (n={len(unc_tp)})', color='green', density=True)
        ax_dist.hist(unc_fp, bins=50, alpha=0.6, label=f'FP (n={len(unc_fp)})', color='red', density=True)
        ax_dist.axvline(unc_tp.mean(), color='green', linestyle='--', linewidth=2, label=f'Mean TP: {unc_tp.mean():.4f}')
        ax_dist.axvline(unc_fp.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean FP: {unc_fp.mean():.4f}')
        ax_dist.set_xlabel('Uncertainty', fontsize=11)
        ax_dist.set_ylabel('Density', fontsize=11)
        ax_dist.set_title(f'{method_name.replace("_", " ").title()}\nDistribuci√≥n de Incertidumbre', fontsize=12, fontweight='bold')
        ax_dist.legend(fontsize=9)
        ax_dist.grid(alpha=0.3)
    
    # Fila 2: ROC curves
    ax_roc = axes[1, idx]
    
    if method_name in uncertainty_auroc:
        is_tp = df['is_tp'].values
        is_fp = 1 - is_tp
        uncertainties = df['uncertainty'].values
        
        fpr, tpr, thresholds = roc_curve(is_fp, uncertainties)
        auroc = uncertainty_auroc[method_name]['auroc']
        
        ax_roc.plot(fpr, tpr, linewidth=2, label=f'AUROC = {auroc:.4f}', color=colors_methods[method_name])
        ax_roc.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random (0.5)')
        ax_roc.set_xlabel('False Positive Rate', fontsize=11)
        ax_roc.set_ylabel('True Positive Rate', fontsize=11)
        ax_roc.set_title(f'{method_name.replace("_", " ").title()}\nROC Curve (FP Detection)', fontsize=12, fontweight='bold')
        ax_roc.legend(fontsize=10)
        ax_roc.grid(alpha=0.3)
        ax_roc.set_xlim([0, 1])
        ax_roc.set_ylim([0, 1])

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'uncertainty_analysis.png', dpi=150, bbox_inches='tight')
print(f"\nVisualizaci√≥n de incertidumbre guardada en: {OUTPUT_DIR / 'uncertainty_analysis.png'}")
plt.close()



Visualizaci√≥n de incertidumbre guardada en: outputs/comparison/uncertainty_analysis.png


## 12. Risk-Coverage Analysis

In [20]:
def compute_risk_coverage(df, uncertainty_col='uncertainty'):
    """Calcula curva risk-coverage"""
    df_sorted = df.sort_values(uncertainty_col, ascending=False).reset_index(drop=True)
    
    coverages = []
    risks = []
    
    for i in range(1, len(df_sorted) + 1):
        coverage = i / len(df_sorted)
        risk = 1 - df_sorted.iloc[:i]['is_tp'].mean()
        coverages.append(coverage)
        risks.append(risk)
    
    # AUC (√°rea bajo la curva)
    auc = np.trapz(risks, coverages)
    
    return coverages, risks, auc

# Calcular risk-coverage para m√©todos con incertidumbre
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

methods_with_uncertainty = ['mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']
colors = ['blue', 'cyan', 'red', 'orange']

risk_coverage_results = {}

for ax_idx, method_name in enumerate(['mc_dropout', 'decoder_variance']):
    ax = axes[ax_idx]
    
    for variant, color in [(method_name, 'blue'), (f'{method_name}_ts', 'red')]:
        df = pd.read_csv(OUTPUT_DIR / f'eval_{variant}.csv')
        
        if len(df) > 0 and 'uncertainty' in df.columns:
            coverages, risks, auc = compute_risk_coverage(df, 'uncertainty')
            
            label = variant.replace('_', ' ').title()
            ax.plot(coverages, risks, label=f'{label} (AUC={auc:.3f})', color=color, linewidth=2)
            
            risk_coverage_results[variant] = {
                'coverages': coverages,
                'risks': risks,
                'auc': auc
            }
    
    ax.set_xlabel('Coverage', fontsize=12)
    ax.set_ylabel('Risk (1 - Accuracy)', fontsize=12)
    ax.set_title(f'Risk-Coverage: {method_name.replace("_", " ").title()}', fontsize=14)
    ax.legend(fontsize=10)
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'risk_coverage_curves.png', dpi=150, bbox_inches='tight')
print(f"Risk-coverage curves guardadas en: {OUTPUT_DIR / 'risk_coverage_curves.png'}")
plt.close()

# Guardar AUC
auc_summary = {k: v['auc'] for k, v in risk_coverage_results.items()}
with open(OUTPUT_DIR / 'risk_coverage_auc.json', 'w') as f:
    json.dump(auc_summary, f, indent=2)

print("\nRisk-Coverage AUC:")
for method, auc in auc_summary.items():
    print(f"  {method}: {auc:.4f}")

Risk-coverage curves guardadas en: outputs/comparison/risk_coverage_curves.png

Risk-Coverage AUC:
  mc_dropout: 0.5245
  mc_dropout_ts: 0.5245
  decoder_variance: 0.4101
  decoder_variance_ts: 0.4101


## 14. Resumen Final y Reporte

In [21]:
print("\n" + "="*80)
print("RESUMEN FINAL - COMPARACI√ìN DE M√âTODOS")
print("="*80)

# Cargar todas las m√©tricas
det_metrics = json.load(open(OUTPUT_DIR / 'detection_metrics.json'))
cal_metrics = json.load(open(OUTPUT_DIR / 'calibration_metrics.json'))
temps = json.load(open(OUTPUT_DIR / 'temperatures.json'))
auc_summary = json.load(open(OUTPUT_DIR / 'risk_coverage_auc.json'))
uncertainty_auroc_data = json.load(open(OUTPUT_DIR / 'uncertainty_auroc.json'))

print("\n1. M√âTRICAS DE DETECCI√ìN (mAP@[0.5:0.95])")
print("-" * 80)
for method in ['baseline', 'baseline_ts', 'mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']:
    mAP = det_metrics[method].get('mAP', 0.0)
    AP50 = det_metrics[method].get('AP50', 0.0)
    AP75 = det_metrics[method].get('AP75', 0.0)
    print(f"{method:25s}  mAP={mAP:.4f}  AP50={AP50:.4f}  AP75={AP75:.4f}")

print("\n2. M√âTRICAS DE CALIBRACI√ìN")
print("-" * 80)
print(f"{'Method':<25s} {'NLL ‚Üì':>10s} {'Brier ‚Üì':>10s} {'ECE ‚Üì':>10s}")
print("-" * 80)
for method in ['baseline', 'baseline_ts', 'mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']:
    nll = cal_metrics[method]['NLL']
    brier = cal_metrics[method]['Brier']
    ece = cal_metrics[method]['ECE']
    print(f"{method:<25s} {nll:>10.4f} {brier:>10.4f} {ece:>10.4f}")

print("\n3. TEMPERATURAS OPTIMIZADAS")
print("-" * 80)
for method in ['baseline', 'mc_dropout', 'decoder_variance']:
    T = temps[method]['T']
    nll_before = temps[method]['nll_before']
    nll_after = temps[method]['nll_after']
    improvement = nll_before - nll_after
    print(f"{method:20s}  T={T:.4f}  NLL: {nll_before:.4f} ‚Üí {nll_after:.4f} (Œî={improvement:.4f})")

print("\n4. RISK-COVERAGE AUC (menor es mejor)")
print("-" * 80)
for method, auc in auc_summary.items():
    print(f"{method:25s}  AUC={auc:.4f}")

print("\n5. INCERTIDUMBRE: AUROC TP vs FP (mayor es mejor)")
print("-" * 80)
print(f"{'Method':<25s} {'AUROC ‚Üë':>10s} {'Mean Unc TP':>15s} {'Mean Unc FP':>15s} {'Ratio':>10s}")
print("-" * 80)
for method, data in uncertainty_auroc_data.items():
    auroc = data['auroc']
    mean_tp = data['mean_unc_tp']
    mean_fp = data['mean_unc_fp']
    ratio = mean_fp / mean_tp if mean_tp > 0 else 0
    print(f"{method:<25s} {auroc:>10.4f} {mean_tp:>15.6f} {mean_fp:>15.6f} {ratio:>10.2f}x")

print("\n" + "="*80)
print("CONCLUSIONES")
print("="*80)
print("‚úì Baseline: rendimiento de referencia sin incertidumbre")
print("‚úì Temperature Scaling: mejora calibraci√≥n sin afectar mAP")
print("‚úì MC-Dropout: proporciona incertidumbre epist√©mica (K pases)")
print("‚úì Decoder variance: incertidumbre en single-pass (m√°s eficiente)")
print("‚úì M√©todos+TS: mejor calibraci√≥n manteniendo detecci√≥n")
print("‚úì AUROC TP vs FP: valida que incertidumbre detecta errores")
print("="*80)

# Guardar reporte final
final_report = {
    'timestamp': datetime.now().isoformat(),
    'config': CONFIG,
    'detection_metrics': det_metrics,
    'calibration_metrics': cal_metrics,
    'temperatures': temps,
    'risk_coverage_auc': auc_summary,
    'uncertainty_auroc': uncertainty_auroc_data
}

with open(OUTPUT_DIR / 'final_report.json', 'w') as f:
    json.dump(final_report, f, indent=2)

print(f"\nReporte final guardado en: {OUTPUT_DIR / 'final_report.json'}")
print(f"Todos los artefactos en: {OUTPUT_DIR}")


RESUMEN FINAL - COMPARACI√ìN DE M√âTODOS

1. M√âTRICAS DE DETECCI√ìN (mAP@[0.5:0.95])
--------------------------------------------------------------------------------
baseline                   mAP=0.1705  AP50=0.2785  AP75=0.1705
baseline_ts                mAP=0.1705  AP50=0.2785  AP75=0.1705
mc_dropout                 mAP=0.1823  AP50=0.3023  AP75=0.1811
mc_dropout_ts              mAP=0.1823  AP50=0.3023  AP75=0.1811
decoder_variance           mAP=0.1819  AP50=0.3020  AP75=0.1801
decoder_variance_ts        mAP=0.1819  AP50=0.3020  AP75=0.1801

2. M√âTRICAS DE CALIBRACI√ìN
--------------------------------------------------------------------------------
Method                         NLL ‚Üì    Brier ‚Üì      ECE ‚Üì
--------------------------------------------------------------------------------
baseline                      0.7180     0.2618     0.2410
baseline_ts                   0.6930     0.2499     0.1868
mc_dropout                    0.7069     0.2561     0.2034
mc_dropout_ts 

## 15. Visualizaci√≥n Final Comparativa

In [22]:
fig = plt.figure(figsize=(20, 14))
gs = fig.add_gridspec(4, 3, hspace=0.3, wspace=0.3)

# 1. mAP Comparison
ax1 = fig.add_subplot(gs[0, :])
methods = ['baseline', 'baseline_ts', 'mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']
mAPs = [det_metrics[m].get('mAP', 0.0) for m in methods]
colors_map = ['lightblue', 'blue', 'lightcoral', 'red', 'lightgreen', 'green']
bars = ax1.bar(range(len(methods)), mAPs, color=colors_map, alpha=0.7)
ax1.set_xticks(range(len(methods)))
ax1.set_xticklabels([m.replace('_', '\n') for m in methods], fontsize=10)
ax1.set_ylabel('mAP@[0.5:0.95]', fontsize=12)
ax1.set_title('Comparaci√≥n de mAP entre M√©todos', fontsize=14, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height, f'{mAPs[i]:.3f}', ha='center', va='bottom', fontsize=9)

# 2. Calibration Metrics Comparison
ax2 = fig.add_subplot(gs[1, 0])
nlls = [cal_metrics[m]['NLL'] for m in methods]
ax2.bar(range(len(methods)), nlls, color=colors_map, alpha=0.7)
ax2.set_xticks(range(len(methods)))
ax2.set_xticklabels([m.replace('_', '\n') for m in methods], fontsize=8)
ax2.set_ylabel('NLL ‚Üì', fontsize=11)
ax2.set_title('Negative Log-Likelihood', fontsize=12, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

ax3 = fig.add_subplot(gs[1, 1])
briers = [cal_metrics[m]['Brier'] for m in methods]
ax3.bar(range(len(methods)), briers, color=colors_map, alpha=0.7)
ax3.set_xticks(range(len(methods)))
ax3.set_xticklabels([m.replace('_', '\n') for m in methods], fontsize=8)
ax3.set_ylabel('Brier Score ‚Üì', fontsize=11)
ax3.set_title('Brier Score', fontsize=12, fontweight='bold')
ax3.grid(axis='y', alpha=0.3)

ax4 = fig.add_subplot(gs[1, 2])
eces = [cal_metrics[m]['ECE'] for m in methods]
ax4.bar(range(len(methods)), eces, color=colors_map, alpha=0.7)
ax4.set_xticks(range(len(methods)))
ax4.set_xticklabels([m.replace('_', '\n') for m in methods], fontsize=8)
ax4.set_ylabel('ECE ‚Üì', fontsize=11)
ax4.set_title('Expected Calibration Error', fontsize=12, fontweight='bold')
ax4.grid(axis='y', alpha=0.3)

# 3. Temperature Scaling Effect
ax5 = fig.add_subplot(gs[2, 0])
base_methods = ['baseline', 'mc_dropout', 'decoder_variance']
Ts = [temps[m]['T'] for m in base_methods]
ax5.bar(range(len(base_methods)), Ts, color=['blue', 'red', 'green'], alpha=0.7)
ax5.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='T=1 (sin calibrar)')
ax5.set_xticks(range(len(base_methods)))
ax5.set_xticklabels([m.replace('_', '\n') for m in base_methods], fontsize=10)
ax5.set_ylabel('Temperature T', fontsize=11)
ax5.set_title('Temperaturas √ìptimas', fontsize=12, fontweight='bold')
ax5.legend()
ax5.grid(axis='y', alpha=0.3)

# 4. Risk-Coverage AUC
ax6 = fig.add_subplot(gs[2, 1])
unc_methods = ['mc_dropout', 'mc_dropout_ts', 'decoder_variance', 'decoder_variance_ts']
aucs = [auc_summary.get(m, 0.0) for m in unc_methods]
colors_unc = ['lightcoral', 'red', 'lightgreen', 'green']
bars_auc = ax6.bar(range(len(unc_methods)), aucs, color=colors_unc, alpha=0.7)
ax6.set_xticks(range(len(unc_methods)))
ax6.set_xticklabels([m.replace('_', '\n') for m in unc_methods], fontsize=9)
ax6.set_ylabel('AUC (Risk-Coverage) ‚Üì', fontsize=11)
ax6.set_title('Risk-Coverage AUC', fontsize=12, fontweight='bold')
ax6.grid(axis='y', alpha=0.3)
for i, bar in enumerate(bars_auc):
    height = bar.get_height()
    ax6.text(bar.get_x() + bar.get_width()/2., height, f'{aucs[i]:.3f}', ha='center', va='bottom', fontsize=8)

# 5. AUROC TP vs FP (Nueva secci√≥n)
ax7 = fig.add_subplot(gs[2, 2])
auroc_methods = list(uncertainty_auroc_data.keys())
aurocs = [uncertainty_auroc_data[m]['auroc'] for m in auroc_methods]
colors_auroc = ['lightcoral', 'red', 'lightgreen', 'green']
bars_auroc = ax7.bar(range(len(auroc_methods)), aurocs, color=colors_auroc, alpha=0.7)
ax7.axhline(y=0.5, color='black', linestyle='--', alpha=0.5, label='Random')
ax7.axhline(y=0.7, color='orange', linestyle='--', alpha=0.5, label='Good threshold')
ax7.set_xticks(range(len(auroc_methods)))
ax7.set_xticklabels([m.replace('_', '\n') for m in auroc_methods], fontsize=9)
ax7.set_ylabel('AUROC (FP detection) ‚Üë', fontsize=11)
ax7.set_title('AUROC: Detecci√≥n de Errores', fontsize=12, fontweight='bold')
ax7.legend(fontsize=8)
ax7.grid(axis='y', alpha=0.3)
ax7.set_ylim([0, 1])
for i, bar in enumerate(bars_auroc):
    height = bar.get_height()
    ax7.text(bar.get_x() + bar.get_width()/2., height, f'{aurocs[i]:.3f}', ha='center', va='bottom', fontsize=8)

# 6. Resumen de incertidumbre (ratio FP/TP)
ax8 = fig.add_subplot(gs[3, :])
ratios = [uncertainty_auroc_data[m]['mean_unc_fp'] / uncertainty_auroc_data[m]['mean_unc_tp'] 
          if uncertainty_auroc_data[m]['mean_unc_tp'] > 0 else 0 
          for m in auroc_methods]
bars_ratio = ax8.bar(range(len(auroc_methods)), ratios, color=colors_auroc, alpha=0.7)
ax8.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='Ratio = 1 (sin diferencia)')
ax8.set_xticks(range(len(auroc_methods)))
ax8.set_xticklabels([m.replace('_', '\n') for m in auroc_methods], fontsize=10)
ax8.set_ylabel('Ratio Mean(Unc_FP) / Mean(Unc_TP)', fontsize=11)
ax8.set_title('Ratio de Incertidumbre: FP vs TP (>1 es deseable)', fontsize=12, fontweight='bold')
ax8.legend()
ax8.grid(axis='y', alpha=0.3)
for i, bar in enumerate(bars_ratio):
    height = bar.get_height()
    ax8.text(bar.get_x() + bar.get_width()/2., height, f'{ratios[i]:.2f}x', ha='center', va='bottom', fontsize=9)

plt.suptitle('Fase 5: Comparaci√≥n Completa de M√©todos de Incertidumbre y Calibraci√≥n', 
             fontsize=16, fontweight='bold', y=0.997)

plt.savefig(OUTPUT_DIR / 'final_comparison_summary.png', dpi=150, bbox_inches='tight')
print(f"\nVisualizaci√≥n final guardada en: {OUTPUT_DIR / 'final_comparison_summary.png'}")
plt.close()

print("\n" + "="*80)
print("FASE 5 COMPLETADA")
print("="*80)
print(f"Todos los resultados guardados en: {OUTPUT_DIR}")
print("\nArchivos generados:")
print("  - config.yaml")
print("  - temperatures.json")
print("  - detection_metrics.json")
print("  - calibration_metrics.json")
print("  - risk_coverage_auc.json")
print("  - uncertainty_auroc.json")
print("  - uncertainty_auroc_comparison.csv")
print("  - final_report.json")
print("  - detection_comparison.csv")
print("  - calibration_comparison.csv")
print("  - reliability_diagrams.png")
print("  - risk_coverage_curves.png")
print("  - uncertainty_analysis.png")
print("  - final_comparison_summary.png")
print("="*80)


Visualizaci√≥n final guardada en: outputs/comparison/final_comparison_summary.png

FASE 5 COMPLETADA
Todos los resultados guardados en: outputs/comparison

Archivos generados:
  - config.yaml
  - temperatures.json
  - detection_metrics.json
  - calibration_metrics.json
  - risk_coverage_auc.json
  - uncertainty_auroc.json
  - uncertainty_auroc_comparison.csv
  - final_report.json
  - detection_comparison.csv
  - calibration_comparison.csv
  - reliability_diagrams.png
  - risk_coverage_curves.png
  - uncertainty_analysis.png
  - final_comparison_summary.png


In [23]:
#!/usr/bin/env python3
"""
Script de verificaci√≥n de optimizaciones de Fase 5
===================================================

Este script verifica que:
1. Los archivos de fases anteriores existen
2. Los formatos de datos son correctos
3. Las predicciones son compatibles
4. Estima el tiempo que se ahorrar√°
"""

import json
import sys
from pathlib import Path
from datetime import timedelta


# Colores para terminal
class Colors:
    GREEN = "\033[92m"
    YELLOW = "\033[93m"
    RED = "\033[91m"
    BLUE = "\033[94m"
    BOLD = "\033[1m"
    END = "\033[0m"


def check_file(path, description):
    """Verifica si un archivo existe y retorna su info"""
    path = Path(path)
    if path.exists():
        size_mb = path.stat().st_size / (1024 * 1024)
        print(f"{Colors.GREEN}‚úÖ {description}{Colors.END}")
        print(f"   Ubicaci√≥n: {path}")
        print(f"   Tama√±o: {size_mb:.2f} MB")
        return True, size_mb
    else:
        print(f"{Colors.RED}‚ùå {description}{Colors.END}")
        print(f"   {Colors.YELLOW}No encontrado: {path}{Colors.END}")
        return False, 0


def verify_json_format(path, expected_keys):
    """Verifica que un JSON tenga el formato esperado"""
    try:
        with open(path, "r") as f:
            data = json.load(f)

        if not isinstance(data, list) or len(data) == 0:
            return False, "No es una lista o est√° vac√≠a"

        sample = data[0]
        missing_keys = [k for k in expected_keys if k not in sample]

        if missing_keys:
            return False, f"Faltan keys: {missing_keys}"

        return True, f"{len(data)} registros"
    except Exception as e:
        return False, str(e)


def main():
    print(f"\n{Colors.BOLD}{Colors.BLUE}{'='*70}")
    print("VERIFICACI√ìN DE OPTIMIZACIONES - FASE 5")
    print(f"{'='*70}{Colors.END}\n")

    # Paths
    base_dir = Path("..")
    fase2_preds = base_dir / "fase 2" / "outputs" / "baseline" / "preds_raw.json"
    fase3_preds = (
        base_dir / "fase 3" / "outputs" / "mc_dropout" / "preds_mc_aggregated.json"
    )
    fase4_temp = (
        base_dir / "fase 4" / "outputs" / "temperature_scaling" / "temperature.json"
    )

    # Contadores
    files_found = 0
    total_files = 3
    time_saved = 0

    # ========================================================================
    print(f"{Colors.BOLD}1. VERIFICACI√ìN DE ARCHIVOS DE FASE 2 (Baseline){Colors.END}")
    print("-" * 70)

    exists, size = check_file(fase2_preds, "Predicciones Baseline")
    if exists:
        files_found += 1
        time_saved += 45  # 45 minutos ahorrados

        # Verificar formato
        valid, info = verify_json_format(
            fase2_preds, ["image_id", "category_id", "bbox", "score"]
        )
        if valid:
            print(f"   {Colors.GREEN}Formato: ‚úÖ Correcto ({info}){Colors.END}")
        else:
            print(f"   {Colors.YELLOW}Formato: ‚ö†Ô∏è  {info}{Colors.END}")

    print()

    # ========================================================================
    print(
        f"{Colors.BOLD}2. VERIFICACI√ìN DE ARCHIVOS DE FASE 3 (MC-Dropout){Colors.END}"
    )
    print("-" * 70)

    exists, size = check_file(fase3_preds, "Predicciones MC-Dropout")
    if exists:
        files_found += 1
        time_saved += 90  # 90 minutos ahorrados (K=5 es costoso)

        # Verificar formato
        valid, info = verify_json_format(
            fase3_preds, ["image_id", "category_id", "bbox", "score", "uncertainty"]
        )
        if valid:
            print(f"   {Colors.GREEN}Formato: ‚úÖ Correcto ({info}){Colors.END}")
        else:
            print(f"   {Colors.YELLOW}Formato: ‚ö†Ô∏è  {info}{Colors.END}")

    print()

    # ========================================================================
    print(
        f"{Colors.BOLD}3. VERIFICACI√ìN DE ARCHIVOS DE FASE 4 (Temperature){Colors.END}"
    )
    print("-" * 70)

    exists, size = check_file(fase4_temp, "Temperaturas Optimizadas")
    if exists:
        files_found += 1
        time_saved += 2  # 2 minutos ahorrados

        # Verificar formato
        try:
            with open(fase4_temp, "r") as f:
                temps = json.load(f)

            if "optimal_temperature" in temps:
                T = temps["optimal_temperature"]
                print(f"   {Colors.GREEN}Formato: ‚úÖ Correcto (T={T:.4f}){Colors.END}")
            else:
                print(
                    f"   {Colors.YELLOW}Formato: ‚ö†Ô∏è  Falta 'optimal_temperature'{Colors.END}"
                )
        except Exception as e:
            print(f"   {Colors.YELLOW}Formato: ‚ö†Ô∏è  Error: {e}{Colors.END}")

    print()

    # ========================================================================
    print(f"{Colors.BOLD}{'='*70}")
    print("RESUMEN")
    print(f"{'='*70}{Colors.END}")

    print(
        f"\n{Colors.BOLD}Archivos encontrados:{Colors.END} {files_found}/{total_files}"
    )

    if files_found == total_files:
        print(f"{Colors.GREEN}‚úÖ TODOS los archivos est√°n disponibles{Colors.END}")
    elif files_found > 0:
        print(f"{Colors.YELLOW}‚ö†Ô∏è  Algunos archivos est√°n disponibles{Colors.END}")
    else:
        print(f"{Colors.RED}‚ùå NO hay archivos disponibles{Colors.END}")

    # Estimaci√≥n de tiempo
    print(f"\n{Colors.BOLD}Tiempo estimado ahorrado:{Colors.END}")

    if time_saved > 0:
        td = timedelta(minutes=time_saved)
        hours = td.seconds // 3600
        minutes = (td.seconds % 3600) // 60

        print(f"   {Colors.GREEN}‚ö° ~{hours}h {minutes}min{Colors.END}")

        if files_found == total_files:
            print(f"\n{Colors.BOLD}Tiempo de ejecuci√≥n esperado:{Colors.END}")
            print(
                f"   {Colors.GREEN}üìä ~15-20 minutos{Colors.END} (solo Decoder Variance)"
            )
        else:
            missing = total_files - files_found
            est_time = 137 - time_saved  # 137 min total original
            print(f"\n{Colors.BOLD}Tiempo de ejecuci√≥n esperado:{Colors.END}")
            print(
                f"   {Colors.YELLOW}üìä ~{est_time} minutos{Colors.END} (calcular {missing} m√©todo(s) faltante(s))"
            )
    else:
        print(f"   {Colors.RED}‚ùå 0 minutos{Colors.END}")
        print(f"\n{Colors.BOLD}Tiempo de ejecuci√≥n esperado:{Colors.END}")
        print(f"   {Colors.RED}üìä ~2 horas{Colors.END} (inferencia completa)")

    # Recomendaciones
    print(f"\n{Colors.BOLD}{'='*70}")
    print("RECOMENDACIONES")
    print(f"{'='*70}{Colors.END}")

    if files_found == total_files:
        print(
            f"{Colors.GREEN}‚úÖ Perfecto! Puedes ejecutar Fase 5 directamente.{Colors.END}"
        )
        print(f"   El notebook usar√° todos los resultados cacheados.")
    elif files_found == 0:
        print(f"{Colors.YELLOW}‚ö†Ô∏è  Ejecuta las siguientes fases primero:{Colors.END}")
        print(f"   1. Fase 2: Genera predicciones baseline")
        print(f"   2. Fase 3: Genera predicciones MC-Dropout")
        print(f"   3. Fase 4: Optimiza temperaturas")
        print(f"\n   O ejecuta Fase 5 directamente (tardar√° ~2 horas)")
    else:
        print(f"{Colors.YELLOW}‚ö†Ô∏è  Tienes optimizaci√≥n parcial.{Colors.END}")

        if not (base_dir / fase2_preds).exists():
            print(f"   ‚Ä¢ Ejecuta Fase 2 para predicciones baseline")
        if not (base_dir / fase3_preds).exists():
            print(f"   ‚Ä¢ Ejecuta Fase 3 para predicciones MC-Dropout")
        if not (base_dir / fase4_temp).exists():
            print(f"   ‚Ä¢ Ejecuta Fase 4 para temperaturas")

        print(f"\n   O ejecuta Fase 5 ahora (ahorrar√° ~{time_saved} min)")

    print(f"\n{Colors.BOLD}{'='*70}{Colors.END}\n")

    # Exit code
    return 0 if files_found == total_files else 1


if __name__ == "__main__":
    sys.exit(main())



VERIFICACI√ìN DE OPTIMIZACIONES - FASE 5

[1m1. VERIFICACI√ìN DE ARCHIVOS DE FASE 2 (Baseline)[0m
----------------------------------------------------------------------
[92m‚úÖ Predicciones Baseline[0m
   Ubicaci√≥n: ../fase 2/outputs/baseline/preds_raw.json
   Tama√±o: 3.23 MB
   [92mFormato: ‚úÖ Correcto (22162 registros)[0m

[1m2. VERIFICACI√ìN DE ARCHIVOS DE FASE 3 (MC-Dropout)[0m
----------------------------------------------------------------------
[92m‚úÖ Predicciones MC-Dropout[0m
   Ubicaci√≥n: ../fase 3/outputs/mc_dropout/preds_mc_aggregated.json
   Tama√±o: 4.35 MB
   [93mFormato: ‚ö†Ô∏è  Faltan keys: ['uncertainty'][0m

[1m3. VERIFICACI√ìN DE ARCHIVOS DE FASE 4 (Temperature)[0m
----------------------------------------------------------------------
[92m‚úÖ Temperaturas Optimizadas[0m
   Ubicaci√≥n: ../fase 4/outputs/temperature_scaling/temperature.json
   Tama√±o: 0.00 MB
   [93mFormato: ‚ö†Ô∏è  Falta 'optimal_temperature'[0m

RESUMEN

[1mArchivos encontr

SystemExit: 0