# BDD100K to COCO Format Conversion

Este notebook convierte el dataset BDD100K al formato COCO y lo divide en:
- **val_calib.json** (80%): Para calibraci√≥n de Temperature Scaling
- **val_eval.json** (20%): Para evaluaci√≥n final

## Fuentes del Dataset:
- https://www.kaggle.com/datasets/awsaf49/bdd100k-dataset
- https://www.kaggle.com/datasets/solesensei/solesensei_bdd100k?resource=download ‚úì (usado)

In [None]:
import json
import os
from pathlib import Path
from collections import defaultdict
from PIL import Image
import numpy as np
from tqdm import tqdm
import shutil

# Configuraci√≥n de rutas
BASE_DIR = Path(r"C:\Users\SP1VEVW\Desktop\projects\OVD-MODEL-EPISTEMIC-UNCERTAINTY\data")
BDD_DIR = BASE_DIR / "bdd100k"
COCO_DIR = BASE_DIR / "bdd100k_coco"

# Crear directorio de salida
COCO_DIR.mkdir(exist_ok=True, parents=True)

print(f"‚úì Base directory: {BASE_DIR}")
print(f"‚úì BDD100K directory: {BDD_DIR}")
print(f"‚úì COCO output directory: {COCO_DIR}")

## 1. Verificar Estructura del Dataset

Verificamos que tenemos las anotaciones y las im√°genes de validaci√≥n:

In [None]:
# Rutas esperadas
labels_val = BDD_DIR / "labels" / "det_20" / "det_val.json"
images_val_dir = BDD_DIR / "images" / "100k" / "val"

# Verificar existencia
if labels_val.exists():
    print(f"‚úì Anotaciones encontradas: {labels_val}")
    with open(labels_val) as f:
        annotations = json.load(f)
    print(f"  Total de im√°genes anotadas: {len(annotations)}")
else:
    print(f"‚úó No se encontraron anotaciones en: {labels_val}")

if images_val_dir.exists():
    image_files = list(images_val_dir.glob("*.jpg"))
    print(f"‚úì Directorio de im√°genes: {images_val_dir}")
    print(f"  Total de im√°genes: {len(image_files)}")
else:
    print(f"‚úó No se encontr√≥ el directorio de im√°genes: {images_val_dir}")

## 2. Mapeo de Categor√≠as BDD100K ‚Üí COCO

BDD100K tiene 10 clases de objetos. Las mapeamos a formato COCO con IDs consecutivos:

In [None]:
# Categor√≠as de BDD100K (10 clases)
BDD_CATEGORIES = [
    "pedestrian",
    "rider",
    "car",
    "truck",
    "bus",
    "train",
    "motorcycle",
    "bicycle",
    "traffic light",
    "traffic sign"
]

# Crear mapeo a formato COCO
categories_coco = []
category_name_to_id = {}

for idx, cat_name in enumerate(BDD_CATEGORIES, start=1):
    categories_coco.append({
        "id": idx,
        "name": cat_name,
        "supercategory": "object"
    })
    category_name_to_id[cat_name] = idx

print("Categor√≠as COCO:")
for cat in categories_coco:
    print(f"  ID {cat['id']}: {cat['name']}")

## 3. Funci√≥n de Conversi√≥n BDD100K ‚Üí COCO

Convertimos las anotaciones de BDD100K (formato personalizado) al formato est√°ndar COCO:

In [None]:
def convert_bdd_to_coco(bdd_annotations, images_dir, category_mapping):
    """
    Convierte anotaciones de BDD100K a formato COCO.
    
    Args:
        bdd_annotations: Lista de anotaciones en formato BDD100K
        images_dir: Directorio con las im√°genes
        category_mapping: Diccionario {nombre_categoria: id_coco}
    
    Returns:
        Diccionario en formato COCO
    """
    coco_output = {
        "images": [],
        "annotations": [],
        "categories": categories_coco
    }
    
    annotation_id = 1
    skipped_images = 0
    
    for img_idx, bdd_img in enumerate(tqdm(bdd_annotations, desc="Convirtiendo")):
        img_name = bdd_img["name"]
        img_path = images_dir / img_name
        
        # Verificar que la imagen existe
        if not img_path.exists():
            skipped_images += 1
            continue
        
        # Obtener dimensiones de la imagen
        try:
            with Image.open(img_path) as img:
                width, height = img.size
        except Exception as e:
            print(f"Error abriendo imagen {img_name}: {e}")
            skipped_images += 1
            continue
        
        # Agregar informaci√≥n de imagen
        image_id = img_idx + 1
        coco_output["images"].append({
            "id": image_id,
            "file_name": img_name,
            "width": width,
            "height": height
        })
        
        # Convertir anotaciones (labels)
        if "labels" in bdd_img:
            for label in bdd_img["labels"]:
                category = label.get("category")
                
                # Verificar que la categor√≠a existe en nuestro mapeo
                if category not in category_mapping:
                    continue
                
                # Obtener bounding box
                box2d = label.get("box2d")
                if not box2d:
                    continue
                
                x1 = box2d["x1"]
                y1 = box2d["y1"]
                x2 = box2d["x2"]
                y2 = box2d["y2"]
                
                # Calcular ancho y alto
                bbox_width = x2 - x1
                bbox_height = y2 - y1
                
                # Validar bbox
                if bbox_width <= 0 or bbox_height <= 0:
                    continue
                
                # Crear anotaci√≥n COCO
                coco_output["annotations"].append({
                    "id": annotation_id,
                    "image_id": image_id,
                    "category_id": category_mapping[category],
                    "bbox": [x1, y1, bbox_width, bbox_height],
                    "area": bbox_width * bbox_height,
                    "iscrowd": 0
                })
                annotation_id += 1
    
    print(f"\n‚úì Im√°genes procesadas: {len(coco_output['images'])}")
    print(f"‚úì Anotaciones creadas: {len(coco_output['annotations'])}")
    if skipped_images > 0:
        print(f"‚ö† Im√°genes omitidas: {skipped_images}")
    
    return coco_output

## 4. Convertir el Dataset Completo

Convertimos todas las im√°genes de validaci√≥n de BDD100K a formato COCO:

In [None]:
# Cargar anotaciones BDD100K
with open(labels_val) as f:
    bdd_val_annotations = json.load(f)

print(f"Total de im√°genes en BDD100K val: {len(bdd_val_annotations)}\n")

# Convertir a COCO
coco_val_full = convert_bdd_to_coco(
    bdd_val_annotations, 
    images_val_dir, 
    category_name_to_id
)

## 5. Dividir en Calibraci√≥n y Evaluaci√≥n

Dividimos el set de validaci√≥n en:
- **val_calib.json** (80%): Para calibrar Temperature Scaling
- **val_eval.json** (20%): Para evaluaci√≥n final del modelo

In [None]:
# Configurar split
np.random.seed(42)
split_ratio = 0.8

# Obtener IDs de im√°genes y mezclar
image_ids = [img["id"] for img in coco_val_full["images"]]
np.random.shuffle(image_ids)

# Dividir
split_idx = int(len(image_ids) * split_ratio)
calib_ids = set(image_ids[:split_idx])
eval_ids = set(image_ids[split_idx:])

print(f"Total de im√°genes: {len(image_ids)}")
print(f"Calibraci√≥n (80%): {len(calib_ids)} im√°genes")
print(f"Evaluaci√≥n (20%): {len(eval_ids)} im√°genes")

# Crear subsets
def create_subset(coco_data, image_ids_subset):
    """Crea un subset de COCO con las im√°genes especificadas"""
    subset = {
        "images": [],
        "annotations": [],
        "categories": coco_data["categories"]
    }
    
    for img in coco_data["images"]:
        if img["id"] in image_ids_subset:
            subset["images"].append(img)
    
    for ann in coco_data["annotations"]:
        if ann["image_id"] in image_ids_subset:
            subset["annotations"].append(ann)
    
    return subset

# Crear splits
coco_val_calib = create_subset(coco_val_full, calib_ids)
coco_val_eval = create_subset(coco_val_full, eval_ids)

print(f"\n‚úì val_calib: {len(coco_val_calib['images'])} im√°genes, {len(coco_val_calib['annotations'])} anotaciones")
print(f"‚úì val_eval: {len(coco_val_eval['images'])} im√°genes, {len(coco_val_eval['annotations'])} anotaciones")

## 6. Guardar Archivos COCO

Guardamos los tres archivos JSON en formato COCO:

In [None]:
# Guardar archivos
output_files = {
    "val_full.json": coco_val_full,
    "val_calib.json": coco_val_calib,
    "val_eval.json": coco_val_eval
}

for filename, data in output_files.items():
    output_path = COCO_DIR / filename
    with open(output_path, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"‚úì Guardado: {output_path}")
    print(f"  - {len(data['images'])} im√°genes")
    print(f"  - {len(data['annotations'])} anotaciones")
    print()

## 7. Verificaci√≥n Final

Verificamos que los archivos se hayan creado correctamente y mostramos estad√≠sticas:

In [None]:
# Estad√≠sticas por categor√≠a
def get_category_stats(coco_data):
    """Obtiene estad√≠sticas de anotaciones por categor√≠a"""
    category_counts = defaultdict(int)
    for ann in coco_data["annotations"]:
        category_counts[ann["category_id"]] += 1
    return category_counts

print("=" * 60)
print("ESTAD√çSTICAS POR CATEGOR√çA")
print("=" * 60)

for split_name, coco_data in [("val_calib", coco_val_calib), ("val_eval", coco_val_eval)]:
    print(f"\n{split_name.upper()}:")
    stats = get_category_stats(coco_data)
    for cat in categories_coco:
        count = stats.get(cat["id"], 0)
        print(f"  {cat['name']:20s}: {count:5d} anotaciones")

print("\n" + "=" * 60)
print("RESUMEN")
print("=" * 60)
print(f"‚úì Dataset convertido exitosamente a formato COCO")
print(f"‚úì Archivos guardados en: {COCO_DIR}")
print(f"‚úì val_calib: {len(coco_val_calib['images'])} im√°genes (80%)")
print(f"‚úì val_eval: {len(coco_val_eval['images'])} im√°genes (20%)")
print(f"‚úì Total de categor√≠as: {len(categories_coco)}")
print("=" * 60)

## üìå Pr√≥ximos Pasos

Los archivos COCO generados est√°n listos para ser usados:

1. **val_calib.json** ‚Üí Para entrenar/calibrar Temperature Scaling (Fase 5)
2. **val_eval.json** ‚Üí Para evaluaci√≥n final del modelo con incertidumbre calibrada

### Uso en Fases siguientes:

```python
# Fase 5: Cargar datos de calibraci√≥n
calib_data = "data/bdd100k_coco/val_calib.json"

# Fase 5: Cargar datos de evaluaci√≥n
eval_data = "data/bdd100k_coco/val_eval.json"
```

### Notas importantes:

- ‚úì Las im√°genes permanecen en `data/bdd100k/images/100k/val/`
- ‚úì Los archivos JSON solo contienen las anotaciones y referencias a las im√°genes
- ‚úì El split es reproducible (seed=42)
- ‚úì No hay overlap entre calibraci√≥n y evaluaci√≥n

In [3]:
import os
HOME = os.getcwd()
print(HOME)  
BDD100K_DIR = os.path.join(HOME, "bdd100k")
os.makedirs(BDD100K_DIR, exist_ok=True)
print(BDD100K_DIR, "; exist:", os.path.exists(BDD100K_DIR))

c:\Users\SP1VEVW\Desktop\projects\OVD-Model-ADAS\data
c:\Users\SP1VEVW\Desktop\projects\OVD-Model-ADAS\data\bdd100k ; exist: True


In [None]:
#pip install pycocotools

Collecting pycocotools
  Using cached pycocotools-2.0.10-cp312-abi3-win_amd64.whl.metadata (1.3 kB)
Using cached pycocotools-2.0.10-cp312-abi3-win_amd64.whl (76 kB)
Installing collected packages: pycocotools
Successfully installed pycocotools-2.0.10
Note: you may need to restart the kernel to use updated packages.


In [4]:
import json
import os
from pathlib import Path
from datetime import datetime
import random
from pycocotools.coco import COCO

# ====== 1. VERIFICAR DATOS ======
print("="*50)
print("1. VERIFICANDO DATOS")
print("="*50)

# Rutas
VAL_IMAGES_DIR = os.path.join(BDD100K_DIR, "bdd100k/bdd100k/images/100k/val")
VAL_LABELS_FILE = os.path.join(BDD100K_DIR, "bdd100k_labels_release/bdd100k/labels/bdd100k_labels_images_val.json")

# Verificar existencia
print(f"\nIm√°genes Val: {os.path.exists(VAL_IMAGES_DIR)}")
print(f"Labels Val: {os.path.exists(VAL_LABELS_FILE)}")

# Contar im√°genes
val_images = [f for f in os.listdir(VAL_IMAGES_DIR) if f.endswith('.jpg')]
print(f"Total im√°genes Val: {len(val_images)}")

# Cargar labels
with open(VAL_LABELS_FILE, 'r') as f:
    val_labels = json.load(f)
print(f"Total anotaciones Val: {len(val_labels)}")

# Categor√≠as en BDD100K
categories_bdd = {}
for item in val_labels[:500]:  # Verificar primeras 500
    for label in item.get('labels', []):
        cat = label.get('category')
        if cat:
            categories_bdd[cat] = categories_bdd.get(cat, 0) + 1

print(f"\nCategor√≠as encontradas: {len(categories_bdd)}")
for cat, count in sorted(categories_bdd.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  - {cat}: {count}")

# ====== 2. CONVERTIR A FORMATO COCO ======
print("\n" + "="*50)
print("2. CONVIRTIENDO A FORMATO COCO")
print("="*50)

# Mapeo de categor√≠as BDD100K a IDs
# Nota: BDD100K usa 'bike', no 'bicycle'
CATEGORY_MAP = {
    'person': 1, 'rider': 2, 'car': 3, 'truck': 4, 'bus': 5,
    'train': 6, 'motorcycle': 7, 'bicycle': 8, 'traffic light': 9,
    'traffic sign': 10
}

# Alias para categor√≠as alternativas
CATEGORY_ALIASES = {
    'bike': 'bicycle'
}

def convert_to_coco(bdd_labels, images_dir, split_name):
    """Convierte BDD100K a formato COCO"""
    coco_format = {
        "info": {
            "description": f"BDD100K {split_name} Dataset - COCO Format",
            "version": "1.0",
            "year": 2024,
            "date_created": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        },
        "licenses": [],
        "images": [],
        "annotations": [],
        "categories": []
    }
    
    # Categor√≠as
    for cat_name, cat_id in CATEGORY_MAP.items():
        coco_format["categories"].append({
            "id": cat_id,
            "name": cat_name,
            "supercategory": "object"
        })
    
    annotation_id = 0
    
    # Procesar cada imagen
    for img_idx, item in enumerate(bdd_labels):
        img_name = item['name']
        img_path = os.path.join(images_dir, img_name)
        
        # Verificar que la imagen existe
        if not os.path.exists(img_path):
            continue
        
        # Informaci√≥n de imagen
        image_info = {
            "id": img_idx,
            "file_name": img_name,
            "width": 1280,  # BDD100K est√°ndar
            "height": 720
        }
        coco_format["images"].append(image_info)
        
        # Procesar anotaciones
        for label in item.get('labels', []):
            category = label.get('category')
            
            # Aplicar alias si existe (ej: bike -> bicycle)
            if category in CATEGORY_ALIASES:
                category = CATEGORY_ALIASES[category]
            
            if category not in CATEGORY_MAP:
                continue
            
            box2d = label.get('box2d')
            if not box2d:
                continue
            
            # Calcular bbox COCO (x, y, width, height)
            x1 = box2d['x1']
            y1 = box2d['y1']
            x2 = box2d['x2']
            y2 = box2d['y2']
            
            width = x2 - x1
            height = y2 - y1
            area = width * height
            
            # Validar bbox
            if width <= 0 or height <= 0:
                continue
            
            annotation = {
                "id": annotation_id,
                "image_id": img_idx,
                "category_id": CATEGORY_MAP[category],
                "bbox": [x1, y1, width, height],
                "area": area,
                "iscrowd": 0,
                "segmentation": []
            }
            coco_format["annotations"].append(annotation)
            annotation_id += 1
    
    return coco_format

# Convertir dataset completo
print("\nConvirtiendo dataset completo...")
coco_val_full = convert_to_coco(val_labels, VAL_IMAGES_DIR, "validation_full")

print(f"Total im√°genes procesadas: {len(coco_val_full['images'])}")
print(f"Total anotaciones: {len(coco_val_full['annotations'])}")

# ====== 3. DIVIDIR EN 80% TRAIN / 20% VAL ======
print("\n" + "="*50)
print("3. DIVIDIENDO EN 80% TRAIN / 20% VAL")
print("="*50)

# Mezclar im√°genes
all_images = coco_val_full['images'].copy()
random.seed(42)  # Para reproducibilidad
random.shuffle(all_images)

# Calcular split
total_imgs = len(all_images)
train_size = int(total_imgs * 0.8)
val_size = total_imgs - train_size

train_images = all_images[:train_size]
val_images = all_images[train_size:]

print(f"\nTotal: {total_imgs} im√°genes")
print(f"Train: {train_size} im√°genes (80%)")
print(f"Val: {val_size} im√°genes (20%)")

# Crear diccionarios de IDs
train_img_ids = {img['id'] for img in train_images}
val_img_ids = {img['id'] for img in val_images}

# Funci√≥n para crear split
def create_split(images, img_ids, split_name):
    split_data = {
        "info": coco_val_full["info"].copy(),
        "licenses": coco_val_full["licenses"],
        "images": images,
        "annotations": [],
        "categories": coco_val_full["categories"]
    }
    split_data["info"]["description"] = f"BDD100K {split_name} Dataset - COCO Format"
    
    # Filtrar anotaciones
    for ann in coco_val_full['annotations']:
        if ann['image_id'] in img_ids:
            split_data['annotations'].append(ann)
    
    return split_data

# Crear splits
train_coco = create_split(train_images, train_img_ids, "train")
val_coco = create_split(val_images, val_img_ids, "val")

print(f"\nTrain: {len(train_coco['images'])} imgs, {len(train_coco['annotations'])} anns")
print(f"Val: {len(val_coco['images'])} imgs, {len(val_coco['annotations'])} anns")

# ====== 4. GUARDAR ARCHIVOS COCO ======
print("\n" + "="*50)
print("4. GUARDANDO ARCHIVOS COCO")
print("="*50)

# Crear directorio de salida
COCO_OUTPUT_DIR = os.path.join(HOME, "bdd100k_coco")
os.makedirs(COCO_OUTPUT_DIR, exist_ok=True)

# Guardar archivos
train_file = os.path.join(COCO_OUTPUT_DIR, "val_calib.json")
val_file = os.path.join(COCO_OUTPUT_DIR, "val_eval.json")

with open(train_file, 'w') as f:
    json.dump(train_coco, f)
print(f"\n‚úì Train guardado: {train_file}")

with open(val_file, 'w') as f:
    json.dump(val_coco, f)
print(f"‚úì Val guardado: {val_file}")

# ====== 5. ESTAD√çSTICAS FINALES ======
print("\n" + "="*50)
print("5. ESTAD√çSTICAS FINALES")
print("="*50)

def print_stats(data, name):
    print(f"\n{name}:")
    print(f"  Im√°genes: {len(data['images'])}")
    print(f"  Anotaciones: {len(data['annotations'])}")
    print(f"  Promedio ann/img: {len(data['annotations'])/len(data['images']):.2f}")
    
    # Distribuci√≥n por categor√≠a
    cat_dist = {}
    for ann in data['annotations']:
        cat_id = ann['category_id']
        cat_name = next(c['name'] for c in data['categories'] if c['id'] == cat_id)
        cat_dist[cat_name] = cat_dist.get(cat_name, 0) + 1
    
    print("  Top 5 categor√≠as:")
    for cat, count in sorted(cat_dist.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"    - {cat}: {count}")

print_stats(train_coco, "TRAIN")
print_stats(val_coco, "VAL")

print("\n" + "="*50)
print("‚úì PROCESO COMPLETADO")
print("="*50)
print(f"\nArchivos generados:")
print(f"1. {train_file}")
print(f"2. {val_file}")
print(f"\nAhora puedes usar COCOeval para calcular mAP, AP@50, F1, etc.")

# ====== 6. VALIDAR CON PYCOCOTOOLS ======
print("\n" + "="*50)
print("6. VALIDANDO CON PYCOCOTOOLS")
print("="*50)

# Validar archivo de calibraci√≥n (80%)
print("\n‚úì Validando val_calib.json (80%)...")
coco_calib = COCO(train_file)
print(f"  - Cargado exitosamente")
print(f"  - Im√°genes: {len(coco_calib.getImgIds())}")
print(f"  - Categor√≠as: {len(coco_calib.getCatIds())}")
print(f"  - Anotaciones: {len(coco_calib.getAnnIds())}")
    
# Validar archivo de evaluaci√≥n (20%)
print("\n‚úì Validando val_eval.json (20%)...")
coco_eval = COCO(val_file)
print(f"  - Cargado exitosamente")
print(f"  - Im√°genes: {len(coco_eval.getImgIds())}")
print(f"  - Categor√≠as: {len(coco_eval.getCatIds())}")
print(f"  - Anotaciones: {len(coco_eval.getAnnIds())}")
    
# Mostrar categor√≠as
print("\n‚úì Categor√≠as disponibles:")
for cat in coco_calib.loadCats(coco_calib.getCatIds()):
    print(f"  - ID {cat['id']}: {cat['name']}")
    
print("\n‚úì Archivos COCO validados correctamente con pycocotools")
print("  Listos para usar con COCOeval para calcular m√©tricas")
    

1. VERIFICANDO DATOS

Im√°genes Val: True
Labels Val: True
Total im√°genes Val: 10000
Total anotaciones Val: 10000

Categor√≠as encontradas: 12
  - car: 5062
  - lane: 3808
  - traffic sign: 1754
  - traffic light: 1374
  - drivable area: 873
  - person: 746
  - truck: 212
  - bus: 91
  - bike: 40
  - rider: 35

2. CONVIRTIENDO A FORMATO COCO

Convirtiendo dataset completo...
Total anotaciones Val: 10000

Categor√≠as encontradas: 12
  - car: 5062
  - lane: 3808
  - traffic sign: 1754
  - traffic light: 1374
  - drivable area: 873
  - person: 746
  - truck: 212
  - bus: 91
  - bike: 40
  - rider: 35

2. CONVIRTIENDO A FORMATO COCO

Convirtiendo dataset completo...
Total im√°genes procesadas: 10000
Total anotaciones: 185074

3. DIVIDIENDO EN 80% TRAIN / 20% VAL

Total: 10000 im√°genes
Train: 8000 im√°genes (80%)
Val: 2000 im√°genes (20%)

Train: 8000 imgs, 148515 anns
Val: 2000 imgs, 36559 anns

4. GUARDANDO ARCHIVOS COCO
Total im√°genes procesadas: 10000
Total anotaciones: 185074

3. DI