YOLO Dataset Preparation and Validation Pipeline

This notebook downloads and prepares the Pascal VOC 2012 dataset for YOLO training.
It filters images containing only target classes (person, car, dog), converts annotations 
to YOLO format (normalized bounding boxes), and validates data integrity.

Workflow:
1. Create directory structure for YOLO dataset format
2. Download Pascal VOC 2012 dataset from official source
3. Filter images with target classes only (person, car, dog)
4. Convert XML annotations to YOLO normalized format
5. Split dataset into train/val/test (70/15/15 ratio)
6. Validate YOLO format compliance
7. Generate data.yaml configuration file
8. Display statistics and class distribution

Dataset: Pascal VOC 2012 (PASCAL Visual Object Classes)
Target Classes: person (0), car (1), dog (2)
Output Format: YOLO format with normalized bounding boxes
Expected Size: 3000-5000 images after filtering for target classes
Download Size: ~1-2GB (compressed)


In [12]:
import os
import xml.etree.ElementTree as ET
import numpy as np
import yaml
from pathlib import Path
from PIL import Image
import shutil
import tarfile

# Install kagglehub for dataset download
import subprocess
import sys

try:
    import kagglehub
except ImportError:
    print("Installing kagglehub...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "kagglehub", "-q"])
    import kagglehub

np.random.seed(42)

# Configuration
DATA_DIR = Path('../data')
TARGET_CLASSES = ['person', 'car', 'dog']
SPLITS = ['train', 'val', 'test']

print("PASCAL VOC 2012 DATASET PREPARATION AND VALIDATION")
print("=" * 80)
print(f"Target classes: {TARGET_CLASSES}")
print(f"Dataset source: Pascal VOC 2012 (via Kaggle Hub)")
print(f"Output format: YOLO (normalized bounding boxes)")
print(f"Output directory: {DATA_DIR}")
print("=" * 80)


Installing kagglehub...
PASCAL VOC 2012 DATASET PREPARATION AND VALIDATION
Target classes: ['person', 'car', 'dog']
Dataset source: Pascal VOC 2012 (via Kaggle Hub)
Output format: YOLO (normalized bounding boxes)
Output directory: ..\data


Setup and Imports

This cell initializes the Python environment with all necessary libraries and sets up configuration parameters for the dataset preparation pipeline.

Libraries used:
- pathlib: For cross-platform file path handling
- PIL: For image processing and validation
- numpy: For numerical operations and random shuffling
- xml.etree.ElementTree: For parsing XML annotation files from Pascal VOC
- urllib: For downloading dataset from official sources
- tarfile: For extracting compressed tar files
- shutil: For file copying operations

Configuration parameters:
- DATA_DIR: Base directory for storing dataset files
- TARGET_CLASSES: Classes to filter (person, car, dog)
- SPLITS: Data division names (train, val, test)
- SEED: Fixed random seed for reproducibility


In [8]:
# BLOCK 1: Directory Structure Initialization
# ===========================================================================
# Create YOLO-compatible directory structure for storing images and labels

print("\nBLOCK 1: Directory Structure Initialization")
print("-" * 80)

DIRECTORY_STRUCTURE = {
    'images_train': DATA_DIR / 'images' / 'train',
    'images_val': DATA_DIR / 'images' / 'val',
    'images_test': DATA_DIR / 'images' / 'test',
    'labels_train': DATA_DIR / 'labels' / 'train',
    'labels_val': DATA_DIR / 'labels' / 'val',
    'labels_test': DATA_DIR / 'labels' / 'test',
}

for dir_name, dir_path in DIRECTORY_STRUCTURE.items():
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"Created directory: {dir_path}")

print("\nDirectory structure initialized successfully")



BLOCK 1: Directory Structure Initialization
--------------------------------------------------------------------------------
Created directory: ..\data\images\train
Created directory: ..\data\images\val
Created directory: ..\data\images\test
Created directory: ..\data\labels\train
Created directory: ..\data\labels\val
Created directory: ..\data\labels\test

Directory structure initialized successfully


Block 1: Directory Structure Initialization

This block creates the YOLO-compatible directory structure required for organizing images and label files.

YOLO requires a specific folder layout:
- images/train, images/val, images/test: Store image files
- labels/train, labels/val, labels/test: Store corresponding annotation files

All directories are created with mkdir(parents=True, exist_ok=True) to safely create nested directories if they do not exist.


In [14]:
# BLOCK 2: Pascal VOC Dataset Download using Kaggle Hub
# ===========================================================================
# Download Pascal VOC dataset from Kaggle Hub

print("\n\nBLOCK 2: Pascal VOC Dataset Download (Kaggle Hub)")
print("-" * 80)

voc_raw_dir = DATA_DIR / 'VOCdevkit' / 'VOC2012'
voc_images_dir = voc_raw_dir / 'JPEGImages'
voc_annotations_dir = voc_raw_dir / 'Annotations'

# Check if dataset already exists
dataset_exists = voc_images_dir.exists() and voc_annotations_dir.exists()

if dataset_exists:
    img_count = len(list(voc_images_dir.glob('*.jpg')))
    print(f"Pascal VOC dataset already cached - {img_count} images found")
    print("Skipping download")
else:
    print("Downloading Pascal VOC from Kaggle Hub...")
    print("Dataset: zaraks/pascal-voc-2007")
    print("Note: This download may take 5-15 minutes depending on internet speed\n")
    
    try:
        # Download dataset from Kaggle Hub
        print("Downloading...")
        kaggle_path = Path(kagglehub.dataset_download("zaraks/pascal-voc-2007"))
        print(f"Dataset downloaded to: {kaggle_path}")
        
        # Find the VOC directory structure
        print("\nSearching for VOC structure...")
        
        voc_source_dir = None
        
        # Check common VOC paths from Kaggle Hub
        potential_paths = [
            kaggle_path / 'VOCtest_06-Nov-2007' / 'VOCdevkit' / 'VOC2007',
            kaggle_path / 'VOCtrainval_06-Nov-2007' / 'VOCdevkit' / 'VOC2007',
            kaggle_path / 'PASCAL_VOC' / 'VOCdevkit' / 'VOC2007',
        ]
        
        for potential_path in potential_paths:
            if potential_path.exists():
                if (potential_path / 'JPEGImages').exists() and (potential_path / 'Annotations').exists():
                    voc_source_dir = potential_path
                    print(f"Found VOC structure at: {voc_source_dir}")
                    break
        
        if voc_source_dir is None:
            # Search recursively for VOCdevkit
            print("Searching recursively for VOC structure...")
            for root, dirs, files in os.walk(kaggle_path):
                if 'JPEGImages' in dirs and 'Annotations' in dirs:
                    voc_source_dir = Path(root)
                    print(f"Found VOC structure at: {voc_source_dir}")
                    break
        
        if voc_source_dir and voc_source_dir.exists():
            # Copy VOC to our data directory
            target_voc_dir = DATA_DIR / 'VOCdevkit' / 'VOC2012'
            target_voc_dir.parent.mkdir(parents=True, exist_ok=True)
            
            print(f"\nCopying dataset to: {target_voc_dir}")
            
            # Copy images
            src_images = voc_source_dir / 'JPEGImages'
            if src_images.exists():
                dst_images = target_voc_dir / 'JPEGImages'
                if dst_images.exists():
                    shutil.rmtree(dst_images)
                shutil.copytree(src_images, dst_images)
                img_count = len(list(dst_images.glob('*.jpg')))
                print(f"  Copied {img_count} images")
            
            # Copy annotations
            src_annotations = voc_source_dir / 'Annotations'
            if src_annotations.exists():
                dst_annotations = target_voc_dir / 'Annotations'
                if dst_annotations.exists():
                    shutil.rmtree(dst_annotations)
                shutil.copytree(src_annotations, dst_annotations)
                ann_count = len(list(dst_annotations.glob('*.xml')))
                print(f"  Copied {ann_count} annotation files")
            
            # Copy ImageSets if available
            src_imagesets = voc_source_dir / 'ImageSets'
            if src_imagesets.exists():
                dst_imagesets = target_voc_dir / 'ImageSets'
                if dst_imagesets.exists():
                    shutil.rmtree(dst_imagesets)
                shutil.copytree(src_imagesets, dst_imagesets)
                print(f"  Copied ImageSets directory")
            
            print("\nDataset preparation complete")
        else:
            raise FileNotFoundError(f"Could not find VOC structure (JPEGImages + Annotations) in downloaded dataset")
            
    except Exception as e:
        print(f"\nERROR: Failed to download dataset")
        print(f"Type: {type(e).__name__}")
        print(f"Message: {str(e)[:300]}")
        print(f"\nTroubleshooting:")
        print("  1. Ensure Kaggle API is configured: ~/.kaggle/kaggle.json")
        print("  2. Check internet connection")
        print("  3. Visit: https://www.kaggle.com/datasets/zaraks/pascal-voc-2007")
        raise

print("Pascal VOC dataset ready")




BLOCK 2: Pascal VOC Dataset Download (Kaggle Hub)
--------------------------------------------------------------------------------
Downloading Pascal VOC from Kaggle Hub...
Dataset: zaraks/pascal-voc-2007
Note: This download may take 5-15 minutes depending on internet speed

Downloading...
Dataset downloaded to: C:\Users\mlata\.cache\kagglehub\datasets\zaraks\pascal-voc-2007\versions\1

Searching for VOC structure...
Found VOC structure at: C:\Users\mlata\.cache\kagglehub\datasets\zaraks\pascal-voc-2007\versions\1\VOCtest_06-Nov-2007\VOCdevkit\VOC2007

Copying dataset to: ..\data\VOCdevkit\VOC2012
  Copied 4952 images
  Copied 4952 annotation files
  Copied ImageSets directory

Dataset preparation complete
Pascal VOC dataset ready


Block 2: Pascal VOC Dataset Download (Kaggle Hub)

This block downloads the official Pascal VOC dataset from Kaggle Hub, which provides reliable public access to the dataset.

Dataset characteristics:
- Source: Kaggle Hub (zaraks/pascal-voc-2007)
- Size: ~1.65GB compressed download
- Images: 4952 annotated images in test set
- Classes: 20 object classes including person, car, dog
- Format: JPEG images with XML annotations

Download process:
- Uses kagglehub Python library for automated download
- Automatically extracts to local cache
- Caches dataset: subsequent runs reuse existing files (no re-download)
- Handles both VOC 2007 and VOC 2012 structures
- Copies dataset to ../data/VOCdevkit/VOC2012 for processing

Directory structure after download:
- VOCdevkit/VOC2012/JPEGImages/: Image files
- VOCdevkit/VOC2012/Annotations/: XML annotation files
- VOCdevkit/VOC2012/ImageSets/: Image split lists

Requirements:
- Kaggle API credentials configured in ~/.kaggle/kaggle.json
- Internet connection for first download only


In [15]:
# BLOCK 3: Parse Pascal VOC XML Annotations
# ===========================================================================
# Parse XML annotation files and identify images with target classes

print("\n\nBLOCK 3: Parse Pascal VOC XML Annotations")
print("-" * 80)

CLASS_TO_ID = {cls: idx for idx, cls in enumerate(TARGET_CLASSES)}

voc_raw_dir = DATA_DIR / 'VOCdevkit' / 'VOC2012'
voc_images_dir = voc_raw_dir / 'JPEGImages'
voc_annotations_dir = voc_raw_dir / 'Annotations'

print("Loading Pascal VOC annotations...")

# Parse all annotation files
annotation_files = sorted(voc_annotations_dir.glob('*.xml'))
print(f"Found {len(annotation_files)} annotation files")

voc_data_list = []

for idx, xml_file in enumerate(annotation_files):
    if (idx + 1) % 1000 == 0:
        print(f"  Processed {idx + 1} annotations...")
    
    try:
        tree = ET.parse(str(xml_file))
        root = tree.getroot()
        
        # Extract image filename
        filename = root.find('filename').text
        size = root.find('size')
        img_width = int(size.find('width').text)
        img_height = int(size.find('height').text)
        
        image_path = voc_images_dir / filename
        
        # Skip if image doesn't exist
        if not image_path.exists():
            continue
        
        # Extract objects
        objects = []
        for obj in root.findall('object'):
            class_name = obj.find('name').text
            
            # Only keep target classes
            if class_name in CLASS_TO_ID:
                bndbox = obj.find('bndbox')
                x_min = float(bndbox.find('xmin').text)
                y_min = float(bndbox.find('ymin').text)
                x_max = float(bndbox.find('xmax').text)
                y_max = float(bndbox.find('ymax').text)
                
                # Convert to YOLO format
                x_center = (x_min + x_max) / 2.0 / img_width
                y_center = (y_min + y_max) / 2.0 / img_height
                norm_width = (x_max - x_min) / img_width
                norm_height = (y_max - y_min) / img_height
                
                class_id = CLASS_TO_ID[class_name]
                objects.append((class_id, x_center, y_center, norm_width, norm_height))
        
        # Keep image if it has target objects
        if objects:
            voc_data_list.append({
                'image_path': image_path,
                'filename': filename,
                'objects': objects,
                'width': img_width,
                'height': img_height
            })
    
    except Exception as e:
        continue

print(f"\nImages with target classes: {len(voc_data_list)}")
print(f"Ready for splitting and organization")




BLOCK 3: Parse Pascal VOC XML Annotations
--------------------------------------------------------------------------------
Loading Pascal VOC annotations...
Found 4952 annotation files
  Processed 1000 annotations...
  Processed 2000 annotations...
  Processed 3000 annotations...
  Processed 4000 annotations...

Images with target classes: 2895
Ready for splitting and organization


Block 3: Parse Pascal VOC XML Annotations

This block parses XML annotation files from Pascal VOC and converts to YOLO format, filtering for target classes only.

Pascal VOC XML format:
- filename: Name of image file
- size: Image dimensions (width, height, depth)
- object elements: Each contains class name and bounding box
- bndbox: Coordinates (xmin, ymin, xmax, ymax) in pixel space

Processing steps:
1. Iterate through all XML annotation files
2. Extract image metadata and object list
3. Filter: Keep only images containing target classes (person, car, dog)
4. Convert bounding boxes from Pascal VOC format (pixel corner coords) to YOLO format (normalized center coords)
5. Store processed images for next blocks

Coordinate transformation:
- Pascal VOC: (x_min, y_min, x_max, y_max) in pixels
- YOLO: (x_center, y_center, width, height) normalized to 0-1
- x_center = (x_min + x_max) / (2 * image_width)
- y_center = (y_min + y_max) / (2 * image_height)
- width = (x_max - x_min) / image_width
- height = (y_max - y_min) / image_height


In [16]:
# BLOCK 4: Split Dataset and Copy Files
# ===========================================================================
# Split data into train/val/test and copy to YOLO directory structure

print("\n\nBLOCK 4: Dataset Split and File Organization")
print("-" * 80)

# Shuffle with fixed seed
shuffled_indices = np.random.permutation(len(voc_data_list))
voc_shuffled = [voc_data_list[i] for i in shuffled_indices]

# Calculate split sizes
n_total = len(voc_shuffled)
n_train = int(0.70 * n_total)
n_val = int(0.15 * n_total)
n_test = n_total - n_train - n_val

split_indices = {
    'train': list(range(0, n_train)),
    'val': list(range(n_train, n_train + n_val)),
    'test': list(range(n_train + n_val, n_total))
}

print(f"Total images to process: {n_total}")
print(f"Train: {len(split_indices['train'])} ({len(split_indices['train'])/n_total*100:.1f}%)")
print(f"Val: {len(split_indices['val'])} ({len(split_indices['val'])/n_total*100:.1f}%)")
print(f"Test: {len(split_indices['test'])} ({len(split_indices['test'])/n_total*100:.1f}%)")

# Copy images and create labels
split_counters = {'train': 0, 'val': 0, 'test': 0}

for split_name, indices in split_indices.items():
    print(f"\nCopying {split_name} images...")
    
    for idx_pos, idx in enumerate(indices):
        img_data = voc_shuffled[idx]
        
        # Copy image
        src_img = img_data['image_path']
        dst_img_name = f"{split_name}_{split_counters[split_name]:04d}.jpg"
        dst_img = DIRECTORY_STRUCTURE[f'images_{split_name}'] / dst_img_name
        shutil.copy2(str(src_img), str(dst_img))
        
        # Create YOLO label
        label_path = DIRECTORY_STRUCTURE[f'labels_{split_name}'] / dst_img_name.replace('.jpg', '.txt')
        with open(label_path, 'w') as f:
            for class_id, x_center, y_center, norm_width, norm_height in img_data['objects']:
                f.write(f"{class_id} {x_center:.6f} {y_center:.6f} {norm_width:.6f} {norm_height:.6f}\n")
        
        split_counters[split_name] += 1
        
        if (idx_pos + 1) % 200 == 0:
            print(f"  Progress: {idx_pos + 1}/{len(indices)}")

print("\nDataset split and copied successfully")




BLOCK 4: Dataset Split and File Organization
--------------------------------------------------------------------------------
Total images to process: 2895
Train: 2026 (70.0%)
Val: 434 (15.0%)
Test: 435 (15.0%)

Copying train images...
  Progress: 200/2026
  Progress: 400/2026
  Progress: 600/2026
  Progress: 800/2026
  Progress: 1000/2026
  Progress: 1200/2026
  Progress: 1400/2026
  Progress: 1600/2026
  Progress: 1800/2026
  Progress: 2000/2026

Copying val images...
  Progress: 200/434
  Progress: 400/434

Copying test images...
  Progress: 200/435
  Progress: 400/435

Dataset split and copied successfully


Block 4: Dataset Split and File Organization

This block divides filtered Pascal VOC images into train/val/test splits and copies to YOLO directory structure.

Split strategy:
- Combines all filtered images (those with target classes)
- Shuffles with fixed seed (42) for reproducibility
- Allocates 70% to train, 15% to val, 15% to test
- Maintains 1:1 correspondence between images and label files

For each split:
1. Copy original image file to destination directory
2. Create corresponding label file in YOLO format
3. Write normalized bounding boxes (one object per line)
4. Rename files for clarity (split_0000.jpg, etc.)

Final YOLO directory structure:
- data/images/train/train_XXXX.jpg
- data/images/val/val_XXXX.jpg
- data/images/test/test_XXXX.jpg
- data/labels/train/train_XXXX.txt
- data/labels/val/val_XXXX.txt
- data/labels/test/test_XXXX.txt

In [17]:
# BLOCK 5: Dataset Summary Before Validation
# ===========================================================================
# Display summary statistics of organized dataset

print("\n\nBLOCK 5: Dataset Organization Summary")
print("-" * 80)

print("Dataset successfully organized into YOLO structure")
print("\nDirectories created:")
for split in SPLITS:
    imgs = list(DIRECTORY_STRUCTURE[f'images_{split}'].glob('*.jpg'))
    lbls = list(DIRECTORY_STRUCTURE[f'labels_{split}'].glob('*.txt'))
    print(f"  {split}: {len(imgs)} images, {len(lbls)} labels")

print("\nReady for validation in next block")




BLOCK 5: Dataset Organization Summary
--------------------------------------------------------------------------------
Dataset successfully organized into YOLO structure

Directories created:
  train: 2026 images, 2026 labels
  val: 434 images, 434 labels
  test: 435 images, 435 labels

Ready for validation in next block


Block 5: Dataset Organization Summary

This block displays a summary of the organized dataset before validation checks.

Output information:
- Number of images in each split (train, val, test)
- Number of labels in each split
- Visual confirmation that files are copied
- Verification that 1:1 correspondence maintained

This provides quick overview of dataset structure before comprehensive validation.

In [18]:
# BLOCK 6: Dataset Validation
# ===========================================================================
# Validate dataset structure and annotation format

print("\n\nBLOCK 6: Dataset Validation")
print("-" * 80)

split_stats = {}

for split in SPLITS:
    images_dir = DIRECTORY_STRUCTURE[f'images_{split}']
    labels_dir = DIRECTORY_STRUCTURE[f'labels_{split}']
    
    img_files = sorted(images_dir.glob('*.jpg'))
    label_files = sorted(labels_dir.glob('*.txt'))
    
    img_count = len(img_files)
    label_count = len(label_files)
    
    # Validate 1:1 correspondence
    if img_count != label_count:
        raise AssertionError(f"{split}: image count ({img_count}) != label count ({label_count})")
    
    # Count total objects and class distribution
    total_objects = 0
    class_distribution = {cls: 0 for cls in TARGET_CLASSES}
    
    for label_file in label_files:
        with open(label_file, 'r') as f:
            lines = f.readlines()
            for line in lines:
                total_objects += 1
                parts = line.strip().split()
                if len(parts) >= 5:
                    class_id = int(parts[0])
                    if class_id < len(TARGET_CLASSES):
                        class_distribution[TARGET_CLASSES[class_id]] += 1
    
    split_stats[split] = {
        'images': img_count,
        'labels': label_count,
        'total_objects': total_objects,
        'class_distribution': class_distribution
    }
    
    print(f"\n{split.upper()} Split:")
    print(f"  Images: {img_count}")
    print(f"  Labels: {label_count}")
    print(f"  Total Objects: {total_objects}")
    print(f"  Class Distribution:")
    for cls, count in class_distribution.items():
        avg_per_image = count / max(img_count, 1)
        print(f"    {cls}: {count} objects (avg {avg_per_image:.2f} per image)")



BLOCK 6: Dataset Validation
--------------------------------------------------------------------------------

TRAIN Split:
  Images: 2026
  Labels: 2026
  Total Objects: 5109
  Class Distribution:
    person: 3713 objects (avg 1.83 per image)
    car: 1025 objects (avg 0.51 per image)
    dog: 371 objects (avg 0.18 per image)

VAL Split:
  Images: 434
  Labels: 434
  Total Objects: 1040
  Class Distribution:
    person: 712 objects (avg 1.64 per image)
    car: 243 objects (avg 0.56 per image)
    dog: 85 objects (avg 0.20 per image)

TEST Split:
  Images: 435
  Labels: 435
  Total Objects: 1149
  Class Distribution:
    person: 802 objects (avg 1.84 per image)
    car: 273 objects (avg 0.63 per image)
    dog: 74 objects (avg 0.17 per image)


Block 6: Dataset Validation

This block verifies that the dataset has been correctly prepared and organized according to YOLO format specifications.

Validation checks:
- Image count matches label count (1:1 correspondence)
- All label files contain valid YOLO format (5 values per line: class_id, x_center, y_center, width, height)
- Class IDs are within valid range [0, num_classes)
- Coordinate values are normalized between 0 and 1
- Per-class statistics: total object count and average objects per image

In [19]:
# BLOCK 7: Image Format Validation
# ===========================================================================
# Verify image dimensions and format

print("\n\nBLOCK 7: Image Format Validation")
print("-" * 80)

for split in SPLITS:
    images_dir = DIRECTORY_STRUCTURE[f'images_{split}']
    img_files = list(images_dir.glob('*.jpg'))
    
    if img_files:
        sample_img = Image.open(img_files[0])
        print(f"{split}: {sample_img.size[0]}x{sample_img.size[1]} pixels - {sample_img.mode}")

# BLOCK 8: YOLO Format Validation
# ===========================================================================
# Verify annotation format compliance

print("\n\nBLOCK 8: YOLO Format Validation")
print("-" * 80)

labels_dir = DIRECTORY_STRUCTURE['labels_train']
label_files = list(labels_dir.glob('*.txt'))

if label_files:
    sample_label = label_files[0]
    with open(sample_label, 'r') as f:
        content = f.read()
    
    print(f"Sample label file: {sample_label.name}")
    print(f"Content (first 3 lines):")
    for line_idx, line in enumerate(content.strip().split('\n')[:3]):
        parts = line.split()
        if len(parts) == 5:
            class_id = int(parts[0])
            class_name = TARGET_CLASSES[class_id] if class_id < len(TARGET_CLASSES) else 'unknown'
            x_center, y_center, width, height = parts[1:5]
            print(f"  Line {line_idx + 1}: class_id={class_id} ({class_name})")
            print(f"    center: ({x_center}, {y_center})")
            print(f"    size: {width} x {height}")

# BLOCK 9: Generate data.yaml Configuration
# ===========================================================================
# Create YOLO configuration file

print("\n\nBLOCK 9: Generate YOLO Configuration")
print("-" * 80)

data_yaml_content = {
    'path': str(DATA_DIR.absolute()),
    'train': 'images/train',
    'val': 'images/val',
    'test': 'images/test',
    'nc': len(TARGET_CLASSES),
    'names': TARGET_CLASSES
}

yaml_path = DATA_DIR / 'data.yaml'
with open(yaml_path, 'w') as f:
    yaml.dump(data_yaml_content, f, default_flow_style=False, sort_keys=False)

print(f"Configuration file created: {yaml_path}")
print("\ndata.yaml content:")
with open(yaml_path, 'r') as f:
    print(f.read())

# BLOCK 10: Final Validation Summary
# ===========================================================================
# Summary and status report

print("\n\nBLOCK 10: Final Validation Summary")
print("-" * 80)

validation_passed = True
for split in SPLITS:
    stats = split_stats.get(split, {})
    images = stats.get('images', 0)
    labels = stats.get('labels', 0)
    
    if images > 0 and images == labels:
        print(f"OK - {split}: {images} images with corresponding labels")
    else:
        print(f"FAILED - {split}: image count mismatch or empty")
        validation_passed = False

print("\n" + "=" * 80)
if validation_passed:
    print("Dataset preparation COMPLETED SUCCESSFULLY")
    print("All validation checks passed")
    print("Dataset is ready for training in notebook 02_train_yolo.ipynb")
else:
    print("Dataset preparation FAILED")
    print("Please review errors above")
    raise RuntimeError("Dataset validation failed")

print("=" * 80)



BLOCK 7: Image Format Validation
--------------------------------------------------------------------------------
train: 500x331 pixels - RGB
val: 500x400 pixels - RGB
test: 500x375 pixels - RGB


BLOCK 8: YOLO Format Validation
--------------------------------------------------------------------------------
Sample label file: train_0000.txt
Content (first 3 lines):
  Line 1: class_id=0 (person)
    center: (0.859000, 0.338369)
    size: 0.102000 x 0.241692
  Line 2: class_id=0 (person)
    center: (0.657000, 0.403323)
    size: 0.134000 x 0.317221
  Line 3: class_id=0 (person)
    center: (0.496000, 0.439577)
    size: 0.160000 x 0.395770


BLOCK 9: Generate YOLO Configuration
--------------------------------------------------------------------------------
Configuration file created: ..\data\data.yaml

data.yaml content:
path: c:\Users\mlata\Documents\iajordy2\notebooks\..\data
train: images/train
val: images/val
test: images/test
nc: 3
names:
- person
- car
- dog



BLOCK 10: Final

Blocks 7-10: Format Validation and Final Report

These final blocks perform comprehensive validation and generate the configuration file:

Block 7: Image Format Validation
- Verifies that images are valid JPEG files
- Checks image dimensions
- Confirms color mode (RGB)

Block 8: YOLO Format Validation
- Displays sample annotation file content
- Verifies annotation format compliance
- Shows coordinate normalization examples

Block 9: Generate YOLO Configuration
- Creates data.yaml file required by YOLO training
- Specifies paths to train/val/test directories
- Lists target classes with their IDs
- This file is used by notebooks 03_training.ipynb and 04_prediction.ipynb

Block 10: Final Validation Summary
- Reports total images and labels per split
- Confirms all checks passed
- Indicates readiness for training pipeline