
<h1 align=center><font size = 5>CAPSTONE PROJECT</font></h1>
<h2 align=center><font size = 5>AIML Certification Programme</font></h2>



## Student Name and ID:
Mention your name and ID if done individually<br>
If done as a group,clearly mention the contribution from each group member qualitatively and as a precentage.<br>
1. KUNA MURALI (ID: 2024AIML030)                          

2. MADIRE MAHESHKUMAR (ID: 2024AIML079)

3. V VIJAY KUMAR (ID: 2024AIML100)

4. GADIGA MOUNESWAR BABU (ID: 2024AIML095)


## Helmet Violation Detection from Indian CCTV Video

**Problem statement:**
    Detect and flag two-wheeler helmet violations (helmetless riding) from traffic camera frames in Indian cities in real-time.

**Description:**
Create a computer vision system using YOLOv8 and object tracking to detect two-wheeler riders and classify helmet usage. Optionally perform license plate OCR for enforcement.

**Dataset:**

    •	Indian Helmet Detection Dataset
    
    •	Research-generated dataset of Indian two-wheeler violations (images+video with annotations for helmet & plate) 

   

## Setup

Import libraries:

In [None]:
!pip install opencv-python==4.9.0.80
!pip install matplotlib==3.8.4
!pip install numpy==1.26.4
!pip install pillow==10.3.0
!pip install pandas==2.2.2
!pip install seaborn==0.13.2
!pip install scikit-learn==1.4.2
!pip install torch==2.3.0
!pip install notebook==7.2.0
!pip install albumentations==1.4.8
!pip install albucore==0.0.16

In [None]:
import sys
import os
import random
import shutil
import hashlib
import numpy as np
import cv2
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageEnhance, ImageFilter
import albumentations as A
import glob
import pandas as pd
import seaborn as sns
from itertools import combinations
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

scripts_dir = os.path.abspath(os.path.join(os.getcwd(), '..', 'scripts'))
sys.path.append(scripts_dir)
for entry in os.listdir(scripts_dir):
    entry_path = os.path.join(scripts_dir, entry)
    if os.path.isdir(entry_path):
        sys.path.append(entry_path)

from utils import show_images_grid
from flip import HorizontalFlip
from zoom import DynamicZoomer
from mosaic import MosaicAugmentor
from cutout import CutoutAugmentor
from synthetic import SyntheticImageAugmentor
from edgedetect import EdgeDetectAugmentor
from cutmix import CutMixAugmentor
from rotate import RotateAugmentor
from shadow import ShadowCastingAugmentor
from grayscale import GrayscaleAugmentor
from noise import NoiseInjectionAugmentor

## Exploratory Data Analysis

<h3>1. Dataset Structure & Counts</h3>

 *  Number of images in total.

 *  Number of labels (bounding boxes).

 *  Distribution of classes (e.g., how many helmet, person, motorbike).

 *  Check if all images have corresponding labels (sometimes labels are missing).

 *  Check for duplicate images.

In [None]:
image_folder = "..\\data\\raw\\train\\images"
label_folder = "..\\data\\raw\\train\\labels"

class_map = {
    0: "NumberPlate",
    1: "Person",
    2: "Helmet",
    3: "Head",
    4: "Motorbike"
}

images = os.listdir(image_folder)
labels = os.listdir(label_folder)


print("Total images:", len(images))
print("Total label files:", len(labels))
print("Missing label files:", set(os.path.splitext(i)[0] for i in images) - set(os.path.splitext(l)[0] for l in labels))


**Interpretation:** Total **800** images in **training dataset** and its corresponding label files. 

Note: No image has missing corresponding label file

In [None]:
# Count classes
class_counts = {}

# Walk through all label files
for lbl_name in os.listdir(label_folder):
    if not lbl_name.endswith(".txt"):
        continue
    with open(os.path.join(label_folder, lbl_name), "r") as f:
        for line in f:
            cls = int(line.strip().split()[0])
            class_counts[cls] = class_counts.get(cls, 0) + 1

# Replace class IDs with names for plotting
labels = [class_map.get(k, str(k)) for k in class_counts.keys()]
counts = list(class_counts.values())

# Plot
fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(labels, counts, color="skyblue", edgecolor="black")

# Add numbers on top of bars
for rect in bars:
    height = rect.get_height()
    ax.annotate(f'{int(height)}',
                xy=(rect.get_x() + rect.get_width() / 2, height),
                xytext=(0, 3),  # vertical spacing
                textcoords="offset points",
                ha='center', va='bottom', fontsize=10, fontweight='bold')

# Labels and formatting
ax.set_xlabel("Class")
ax.set_ylabel("Number of objects")
ax.set_title("Class Distribution in Labels")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Intrepretation:**

<h5>Class Imbalance Observations</h5>

 *  Helmet (1,787) and Motorbike (1,750) dominate the dataset → they are much more frequent than Head (293) and Person (791).

 *  Head is significantly underrepresented (only 293 samples), which could cause poor detection performance for this class.

 *  NumberPlate (1,327) is moderately represented.

In [None]:
def dhash(image_path):
    with open(image_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

hashes = {}
duplicates = []

for f in os.listdir(image_folder):
    path = os.path.join(image_folder, f)
    h = dhash(path)
    if h in hashes:
        duplicates.append((hashes[h], f))
    else:
        hashes[h] = f

print("Duplicate image pairs:", duplicates)


**Interpretation:** There are no duplicate images

<h3>2. Image Properties</h3>
    
 *  Distribution of image dimensions (width, height).

 *  Aspect ratio distribution (wide vs tall images).

 *  Check if some images are too small or too large.

In [None]:
heights = []
widths = []
aspect_ratios = []

for img_name in os.listdir(image_folder):
    path = os.path.join(image_folder, img_name)
    img = cv2.imread(path)

    if img is None:   # Skip unreadable files
        print("Could not read:", path)
        continue

    h, w, _ = img.shape
    heights.append(h)
    widths.append(w)
    aspect_ratios.append(w / h)


plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
plt.hist(widths, bins=20, color="skyblue", edgecolor="black")
plt.title("Width Distribution")
plt.xlabel("Width (px)")
plt.ylabel("Number of Images")

plt.subplot(1,2,2)
plt.hist(heights, bins=20, color="salmon", edgecolor="black")
plt.title("Height Distribution")
plt.xlabel("Height (px)")
plt.ylabel("Number of Images")

plt.tight_layout()
plt.show()


# --- Aspect Ratio Distribution ---
plt.hist(aspect_ratios, bins=30, color="skyblue", edgecolor="black")
plt.xlabel("Aspect Ratio (Width / Height)")
plt.ylabel("Number of Images")
plt.title("Aspect Ratio Distribution")
plt.show()

# --- Scatter Plot (Width vs Height) ---
plt.scatter(widths, heights, alpha=0.5, color="green")
plt.xlabel("Width (px)")
plt.ylabel("Height (px)")
plt.title("Image Size Distribution (Width vs Height)")
plt.show()

# --- Check Small / Large Images ---
too_small = [(w, h) for w, h in zip(widths, heights) if w < 100 or h < 100]
too_large = [(w, h) for w, h in zip(widths, heights) if w > 2000 or h > 2000]

print(f"Total images: {len(widths)}")
print(f"Too small images (<100 px): {len(too_small)}")
print(f"Too large images (>2000 px): {len(too_large)}")


**Interpretation:** 

 * The dataset is well-curated, with all images being consistently sized (640x640 px), square in shape, and free of extreme outliers.

 * These properties are beneficial for training helmet violation detection models, minimizing the need for preprocessing.

<h3>3. Label Analysis</h3>

 *  Bounding box count per image (how many objects per image).

 *  Class imbalance (e.g., if helmets appear much less than motorbikes, the model may be biased).

 *  Bounding box size distribution (are boxes tiny or very large?).

 *  Check for outliers (e.g., boxes outside the image boundaries).

In [None]:
# Store bbox sizes per class
bbox_sizes_per_class = {cls: [] for cls in class_map.keys()}

for lbl_name in os.listdir(label_folder):
    if not lbl_name.endswith(".txt"):
        continue

    with open(os.path.join(label_folder, lbl_name)) as f:
        for line in f:
            parts = line.strip().split()
            cls = int(parts[0])
            w, h = float(parts[3]), float(parts[4])  # normalized width, height
            bbox_sizes_per_class[cls].append(w * h)

# --- Plot per-class histograms ---
plt.figure(figsize=(12, 8))
for cls, sizes in bbox_sizes_per_class.items():
    if len(sizes) == 0:
        continue
    plt.hist(sizes, bins=30, alpha=0.6, label=class_map[cls])

plt.xlabel("Box size (normalized area)")
plt.ylabel("Frequency")
plt.title("Bounding Box Size Distribution per Class")
plt.legend()
plt.show()

# --- Alternative: Boxplot for easier comparison ---
plt.figure(figsize=(10, 6))
plt.boxplot([sizes for sizes in bbox_sizes_per_class.values()],
            labels=[class_map[c] for c in bbox_sizes_per_class.keys()],
            showfliers=False)  # hide extreme outliers
plt.ylabel("Normalized Box Area")
plt.title("Box Size Distribution (per class)")
plt.show()

Check for outliers (e.g., boxes outside the image boundaries).

In [None]:

invalid_boxes = []

for lbl_name in os.listdir(label_folder):
    if not lbl_name.endswith(".txt"):
        continue

    with open(os.path.join(label_folder, lbl_name)) as f:
        for line_num, line in enumerate(f, start=1):
            parts = line.strip().split()
            cls, x, y, w, h = int(parts[0]), float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])

            # Check for invalid conditions
            if not (0 <= x <= 1 and 0 <= y <= 1 and 0 < w <= 1 and 0 < h <= 1):
                invalid_boxes.append((lbl_name, line_num, "Out of [0,1] range", parts))
            
            # Check if box goes outside boundaries
            if (x - w/2) < 0 or (x + w/2) > 1 or (y - h/2) < 0 or (y + h/2) > 1:
                invalid_boxes.append((lbl_name, line_num, "Box exceeds image boundary", parts))

# Print results
if invalid_boxes:
    print("⚠️ Found invalid bounding boxes:")
    for lbl, ln, issue, box in invalid_boxes[:20]:  # show only first 20
        print(f"{lbl}, line {ln}: {issue} -> {box}")
else:
    print("✅ All bounding boxes look valid")


<h3>4. Sample Visualizations</h3>

 *  Randomly show images with bounding boxes (like we did earlier).

 *  Show some wrong labels/outliers (e.g., very small/large boxes).

 *  Compare images per class visually.

In [None]:
def draw_yolo_bboxes(image_folder, label_folder, filename, class_map=None):
    """
    Draw YOLO bounding boxes on a specific image and return the annotated image (RGB, np.array).
    """
    # Build paths
    image_path = os.path.join(image_folder, filename)
    label_path = os.path.join(label_folder, os.path.splitext(filename)[0] + ".txt")
    # Load image
    image = cv2.imread(image_path)
    if image is None:
        raise FileNotFoundError(f"Image not found: {image_path}")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    h, w, _ = image.shape
    colors = {0:(255,0,0),1:(0,255,0),2:(0,255,255),3:(255,165,0),4:(0,0,255)}
    # Draw bounding boxes
    if os.path.exists(label_path):
        with open(label_path, "r") as f:
            for label in f.readlines():
                parts = label.strip().split()
                class_id = int(parts[0])
                x_center, y_center, bw, bh = map(float, parts[1:])
                x_center, y_center, bw, bh = x_center * w, y_center * h, bw * w, bh * h
                x1, y1 = int(x_center - bw / 2), int(y_center - bh / 2)
                x2, y2 = int(x_center + bw / 2), int(y_center + bh / 2)
                cv2.rectangle(image, (x1, y1), (x2, y2), colors.get(class_id, (255,255,255)), 2)
                label_text = class_map[class_id] if class_map and class_id in class_map else str(class_id)
                cv2.putText(image, label_text, (x1, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX,
                            0.7, colors.get(class_id, (255,255,255)), 2)
    return image

In [None]:
def show_random_images_grid(image_folder, label_folder, class_map=None, N=4):
    filenames = [f for f in os.listdir(image_folder) if f.endswith(".jpg")]
    random_files = random.sample(filenames, min(N, len(filenames)))
    cols = 2
    rows = (len(random_files) + 1) // cols
    plt.figure(figsize=(15, 7 * rows))
    for idx, filename in enumerate(random_files):
        img_annotated = draw_yolo_bboxes(
            image_folder=image_folder,
            label_folder=label_folder,
            filename=filename,
            class_map=class_map
        )
        plt.subplot(rows, cols, idx+1)
        plt.imshow(img_annotated)
        plt.axis("off")
        plt.title(filename)
    plt.tight_layout()
    plt.show()

In [None]:
show_random_images_grid(image_folder, label_folder, class_map, N=6)

In [None]:
def show_outlier_bboxes(image_folder, label_folder, class_map=None, min_area_ratio=0.001, max_area_ratio=0.9, N=6):
    """
    Display images with outlier bounding boxes (too small or too large)
    
    Args:
        image_folder: folder containing images
        label_folder: folder containing YOLO label files
        class_map: dict of class_id -> name
        min_area_ratio: boxes smaller than this fraction of image area are suspicious
        max_area_ratio: boxes larger than this fraction of image area are suspicious
        N: number of random images to check
    """
    filenames = [f for f in os.listdir(image_folder) if f.endswith(".jpg")]
    random_files = random.sample(filenames, min(N, len(filenames)))

    plt.figure(figsize=(15, 7 * ((N+1)//2)))
    colors = {0:(255,0,0),1:(0,255,0),2:(0,255,255),3:(255,165,0),4:(0,0,255)}

    for idx, filename in enumerate(random_files):
        image_path = os.path.join(image_folder, filename)
        label_path = os.path.join(label_folder, os.path.splitext(filename)[0] + ".txt")
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        h, w, _ = image.shape
        image_area = h * w

        # Flag boxes that are too small or too large
        if os.path.exists(label_path):
            with open(label_path, "r") as f:
                for label in f.readlines():
                    parts = label.strip().split()
                    class_id = int(parts[0])
                    x_center, y_center, bw, bh = map(float, parts[1:])
                    bw_pix, bh_pix = bw*w, bh*h
                    box_area = bw_pix * bh_pix

                    if box_area/image_area < min_area_ratio or box_area/image_area > max_area_ratio:
                        # Draw the outlier box
                        x1, y1 = int(x_center*w - bw_pix/2), int(y_center*h - bh_pix/2)
                        x2, y2 = int(x_center*w + bw_pix/2), int(y_center*h + bh_pix/2)
                        cv2.rectangle(image, (x1, y1), (x2, y2), (255,0,255), 2)  # magenta for outlier
                        label_text = class_map[class_id] if class_map and class_id in class_map else str(class_id)
                        cv2.putText(image, label_text, (x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255,0,255), 2)

        plt.subplot((N+1)//2, 2, idx+1)
        plt.imshow(image)
        plt.axis("off")
        plt.title(filename)

    plt.tight_layout()
    plt.show()

show_outlier_bboxes(image_folder, label_folder, class_map, min_area_ratio=0.001, max_area_ratio=0.9, N=6)


In [None]:
def show_class_comparison(image_folder, label_folder, class_map=None, samples_per_class=3):
    """
    Display random samples per class side by side for visual comparison.
    
    Args:
        image_folder: folder containing images
        label_folder: folder containing YOLO label files
        class_map: dict mapping class_id -> class name
        samples_per_class: number of examples to show per class
    """
    # Collect all bounding boxes per class
    class_boxes = {cid: [] for cid in class_map.keys()}

    filenames = [f for f in os.listdir(image_folder) if f.endswith(".jpg")]

    for filename in filenames:
        image_path = os.path.join(image_folder, filename)
        label_path = os.path.join(label_folder, os.path.splitext(filename)[0] + ".txt")
        
        if not os.path.exists(label_path):
            continue
        
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        h, w, _ = image.shape

        with open(label_path, "r") as f:
            for line in f.readlines():
                parts = line.strip().split()
                class_id = int(parts[0])
                x_center, y_center, bw, bh = map(float, parts[1:])
                x1, y1 = int((x_center - bw/2) * w), int((y_center - bh/2) * h)
                x2, y2 = int((x_center + bw/2) * w), int((y_center + bh/2) * h)
                
                # Crop bounding box
                crop = image[y1:y2, x1:x2]
                if crop.size > 0:
                    class_boxes[class_id].append(crop)
    
    # Plot samples per class
    n_classes = len(class_map)
    plt.figure(figsize=(samples_per_class*3, n_classes*3))

    for idx, class_id in enumerate(class_map.keys()):
        samples = random.sample(class_boxes[class_id], min(samples_per_class, len(class_boxes[class_id])))
        for s_idx, crop in enumerate(samples):
            plt.subplot(n_classes, samples_per_class, idx*samples_per_class + s_idx + 1)
            plt.imshow(crop)
            plt.axis("off")
            plt.title(class_map[class_id], fontsize=8)
            if s_idx == 0:
                plt.ylabel(class_map[class_id], fontsize=12)
    
    plt.tight_layout()
    plt.show()

show_class_comparison(image_folder, label_folder, class_map, samples_per_class=4)


<h3>5. Data Quality Checks</h3>

 *  Blurry, duplicate, or corrupted images.

 *  Bounding boxes with width/height = 0.

 *  Check if any object is labeled incorrectly (e.g., person labeled as helmet).

In [None]:
def check_corrupted_images(image_folder):
    corrupted = []
    for filename in os.listdir(image_folder):
        if not filename.endswith(".jpg"):
            continue
        path = os.path.join(image_folder, filename)
        try:
            img = cv2.imread(path)
            if img is None:
                corrupted.append(filename)
        except:
            corrupted.append(filename)
    print(f"Found {len(corrupted)} corrupted images")
    return corrupted


In [None]:
def check_duplicate_images(image_folder):
    hashes = {}
    duplicates = []
    for filename in os.listdir(image_folder):
        if not filename.endswith(".jpg"):
            continue
        path = os.path.join(image_folder, filename)
        with open(path, 'rb') as f:
            file_hash = hashlib.md5(f.read()).hexdigest()
        if file_hash in hashes:
            duplicates.append((filename, hashes[file_hash]))
        else:
            hashes[file_hash] = filename
    print(f"Found {len(duplicates)} duplicate images")
    return duplicates


In [None]:
def check_blurry_images(image_folder, threshold=100.0):
    blurry_images = []
    for filename in os.listdir(image_folder):
        if not filename.endswith(".jpg"):
            continue
        path = os.path.join(image_folder, filename)
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        if img is None:
            continue
        variance = cv2.Laplacian(img, cv2.CV_64F).var()
        if variance < threshold:
            blurry_images.append((filename, variance))
    print(f"Found {len(blurry_images)} potentially blurry images")
    return blurry_images


In [None]:
def check_invalid_bboxes(label_folder):
    invalid_boxes = []
    for filename in os.listdir(label_folder):
        if not filename.endswith(".txt"):
            continue
        path = os.path.join(label_folder, filename)
        with open(path, "r") as f:
            for line in f.readlines():
                parts = line.strip().split()
                if len(parts) != 5:
                    continue
                class_id, x_center, y_center, bw, bh = parts
                if float(bw) <= 0 or float(bh) <= 0:
                    invalid_boxes.append((filename, line.strip()))
    print(f"Found {len(invalid_boxes)} invalid bounding boxes")
    return invalid_boxes


In [None]:
def check_potential_mislabeled_boxes(image_folder, label_folder, class_map, min_area_ratio=0.001, max_area_ratio=0.5):
    suspicious = []
    for filename in os.listdir(image_folder):
        if not filename.endswith(".jpg"):
            continue
        image_path = os.path.join(image_folder, filename)
        label_path = os.path.join(label_folder, os.path.splitext(filename)[0]+".txt")
        if not os.path.exists(label_path):
            continue
        img = cv2.imread(image_path)
        h, w, _ = img.shape
        image_area = h * w

        with open(label_path, "r") as f:
            for line in f.readlines():
                parts = line.strip().split()
                class_id = int(parts[0])
                x_center, y_center, bw, bh = map(float, parts[1:])
                box_area = bw * bh  # YOLO normalized
                if box_area < min_area_ratio or box_area > max_area_ratio:
                    suspicious.append((filename, class_map.get(class_id,str(class_id)), bw* w, bh* h))
    print(f"Found {len(suspicious)} potentially mislabeled boxes")
    return suspicious


In [None]:
def display_4_images(issue_list, image_folder, title="", duplicates=False):
    """
    Display 4 sample images in a 2x2 grid.
    
    Args:
        issue_list: list of image filenames (or tuples for duplicates)
        image_folder: folder containing images
        title: main title for the figure
        duplicates: if True, issue_list contains tuples (dup, original)
    """
    if not issue_list:
        print(f"No {title} found.")
        return
    
    samples = random.sample(issue_list, min(4, len(issue_list)))
    
    plt.figure(figsize=(10,10))
    plt.suptitle(title, fontsize=16)
    
    for idx, item in enumerate(samples):
        if duplicates:
            filename = item[0]  # show the duplicate file
            subtitle = f"{item[0]}\n(dup of {item[1]})"
        else:
            if isinstance(item, tuple):  # for invalid boxes or mislabeled boxes
                filename = item[0]
                subtitle = f"{os.path.basename(item[0])}\n{item[1]}"
            else:
                filename = item
                subtitle = filename
        
        path = os.path.join(image_folder, filename)
        img = cv2.imread(path)
        if img is None:
            continue
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        plt.subplot(2,2,idx+1)
        plt.imshow(img)
        plt.axis("off")
        plt.title(subtitle, fontsize=10)
    
    plt.tight_layout(rect=[0,0,1,0.95])
    plt.show()

In [None]:
corrupted = check_corrupted_images(image_folder)
duplicates = check_duplicate_images(image_folder)
blurry = check_blurry_images(image_folder)
invalid_boxes = check_invalid_bboxes(label_folder)
mislabeled = check_potential_mislabeled_boxes(image_folder, label_folder, class_map)

In [None]:
# Display 4 sample blurry images
display_4_images([x[0] for x in blurry], image_folder, title="Blurry Images")

# Display 4 sample invalid bounding boxes
display_4_images([x[0] for x in invalid_boxes], image_folder, title="Invalid Bounding Boxes")

# Display 4 sample potentially mislabeled boxes
display_4_images([x[0] for x in mislabeled], image_folder, title="Potentially Mislabeled Boxes")



In [None]:

def display_mislabeled_boxes(image_folder, label_folder, mislabeled_list, class_map, max_samples=4):
    """
    Display potentially mislabeled boxes with bounding boxes in a 2x2 grid.
    
    Args:
        image_folder: folder containing images
        label_folder: folder containing label txt files
        mislabeled_list: list of tuples (filename, class_name, box_width, box_height)
        class_map: dict mapping class_id -> name
        max_samples: maximum number of images to display
    """
    if not mislabeled_list:
        print("No potentially mislabeled boxes found.")
        return
    
    samples = random.sample(mislabeled_list, min(max_samples, len(mislabeled_list)))
    plt.figure(figsize=(10,10))
    plt.suptitle("Potentially Mislabeled Boxes", fontsize=16)
    
    colors = {0:(255,0,0),1:(0,255,0),2:(0,255,255),3:(255,165,0),4:(0,0,255)}

    for idx, (filename, class_name, bw, bh) in enumerate(samples):
        image_path = os.path.join(image_folder, filename)
        label_path = os.path.join(label_folder, os.path.splitext(filename)[0]+".txt")
        
        img = cv2.imread(image_path)
        if img is None:
            continue
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        h, w, _ = img.shape

        # Draw all bounding boxes from the label file
        if os.path.exists(label_path):
            with open(label_path, "r") as f:
                for line in f.readlines():
                    parts = line.strip().split()
                    class_id = int(parts[0])
                    x_center, y_center, bw_norm, bh_norm = map(float, parts[1:])
                    x1 = int((x_center - bw_norm/2) * w)
                    y1 = int((y_center - bh_norm/2) * h)
                    x2 = int((x_center + bw_norm/2) * w)
                    y2 = int((y_center + bh_norm/2) * h)
                    # Highlight suspected mislabeled box in magenta
                    if class_map[class_id] == class_name:
                        color = (255,0,255)
                        thickness = 3
                    else:
                        color = colors.get(class_id, (255,255,255))
                        thickness = 2
                    cv2.rectangle(img, (x1, y1), (x2, y2), color, thickness)
                    cv2.putText(img, class_map[class_id], (x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2)

        plt.subplot(2,2,idx+1)
        plt.imshow(img)
        plt.axis("off")
        plt.title(filename, fontsize=10)
    
    plt.tight_layout(rect=[0,0,1,0.95])
    plt.show()


# --- Usage ---
display_mislabeled_boxes(image_folder, label_folder, mislabeled, class_map, max_samples=4)



# 6. Class Co-occurrence Analysis

In [None]:
# ==============================
# 1. Dataset Setup
# ==============================

image_folder = "..\\data\\raw\\train\\images"
label_folder = "..\\data\\raw\\train\\labels"

# Class mapping
class_map = {
    0: "NumberPlate",
    1: "Person",
    2: "Helmet",
    3: "Head",
    4: "Motorbike"
}

# Load labels into DataFrame
records = []
for label_file in os.listdir(label_folder):
    if not label_file.endswith(".txt"):
        continue
    image_id = os.path.splitext(label_file)[0]
    img_path = os.path.join(image_folder, image_id + ".jpg")
    if not os.path.exists(img_path):
        img_path = os.path.join(image_folder, image_id + ".png")
    if not os.path.exists(img_path):
        continue
    
    img = cv2.imread(img_path)
    if img is None:
        continue
    h, w = img.shape[:2]

    with open(os.path.join(label_folder, label_file), "r") as f:
        for line in f.readlines():
            parts = line.strip().split()
            if len(parts) >= 5:
                class_id = int(parts[0])
                x_c, y_c, bw, bh = map(float, parts[1:5])
                # Convert YOLO normalized coords to pixel coords
                x_min = int((x_c - bw/2) * w)
                y_min = int((y_c - bh/2) * h)
                x_max = int((x_c + bw/2) * w)
                y_max = int((y_c + bh/2) * h)
                records.append([image_id, class_id, x_min, y_min, x_max, y_max, w, h])

df_labels = pd.DataFrame(records, columns=[
    "image_id", "class_id", "x_min", "y_min", "x_max", "y_max", "img_w", "img_h"
])
df_labels["class_name"] = df_labels["class_id"].map(class_map)


co_occurrence = pd.crosstab(df_labels['image_id'], df_labels['class_name'])
co_matrix = co_occurrence.T @ co_occurrence  # class vs class matrix

plt.figure(figsize=(6,4))
sns.heatmap(co_matrix, annot=True, cmap="Blues", fmt="d")
plt.title("Class Co-occurrence Heatmap")
plt.show()

**What it does:** Shows which classes tend to appear together in the same image (e.g., Person + Helmet + Motorbike).

**What to look for:**

Are Helmet and Person co-occurring as expected?

Are there images with Motorbike but no Person? (may be labeling issues)

Do some classes rarely co-occur, leading to fewer “realistic” combinations in training?

**Measures & Metrics:**

Co-occurrence heatmap counts.

Sparsity of matrix (how many class pairs are rarely seen).

Imbalances (e.g., 90% of Person images also have Helmet, but only 10% without helmet → violation imbalance).

**Interpretation**

Moterbike + Helmet = 5877 and Moterbike + Head = 1007 : It shows that Images of persons with helmet are almost 6 times of person without helmet. So increasing images of persons without helmet in the training data will make equal distribution of classes.

# 7. Spatial Distribution of Bounding Boxes

In [None]:
df_labels["x_center"] = (df_labels["x_min"] + df_labels["x_max"]) / 2
df_labels["y_center"] = (df_labels["y_min"] + df_labels["y_max"]) / 2

plt.figure(figsize=(6,6))
plt.scatter(df_labels["x_center"], df_labels["y_center"], alpha=0.2, s=10)
plt.gca().invert_yaxis()
plt.title("Bounding Box Center Distribution")
plt.xlabel("X Center")
plt.ylabel("Y Center")
plt.show()

**What it does:** Plots the center points of bounding boxes.

**What to look for:**

Are objects clustered in certain regions (e.g., bottom of image → road bias)?

Are helmets mostly on the top half (where heads are), motorbikes at bottom? (logical placement check).

If all bounding boxes are centered → dataset may lack variability.

**Measures & Metrics:**

Scatterplot density of centers.

Normalized histogram of X and Y positions.

Coverage % of the image area (are some regions never used?).

**Interpretation**
Most of the images bounding box distribution is around center of the image horizontally

# 8. Occlusion & Crowding Analysis  ?

# 9. Lighting & Brightness Analysis

In [None]:
brightness = []
for img_path in glob.glob(os.path.join(image_folder, "*.jpg")) + glob.glob(os.path.join(image_folder, "*.png")):
    img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    if img is not None:
        brightness.append(img.mean())

plt.hist(brightness, bins=50)
plt.title("Brightness Distribution of Images")
plt.xlabel("Average Pixel Intensity")
plt.show()

**What it does:** Checks average brightness levels of images.

**What to look for:**

Very dark images (night) vs very bright (daylight glare).

Are all images bright? → dataset may lack nighttime conditions.

Extreme outliers may be corrupted images.

**Measures & Metrics:**

Histogram of mean pixel intensity (0–255).

Brightness variance (spread across images).

% of images in “low-light” (<50 avg intensity).

**Interpretation**: Most of the images has descent brightness distribution, however certain images has less overal mean intensity which we can consider as outliers as part of the training.

# 10. Background & Context Bias

In [None]:
features = []
sample_imgs = glob.glob(os.path.join(image_folder, "*.jpg"))[:300]  # sample
for img_path in sample_imgs:
    img = cv2.imread(img_path)
    if img is not None:
        hist = cv2.calcHist([img], [0,1,2], None, [8,8,8], [0,256,0,256,0,256]).flatten()
        features.append(hist)

features = np.array(features)
pca = PCA(n_components=2).fit_transform(features)
kmeans = KMeans(n_clusters=3, random_state=42).fit(pca)

plt.scatter(pca[:,0], pca[:,1], c=kmeans.labels_, cmap="tab10", alpha=0.7)
plt.title("Image Clustering (Background Bias Check)")
plt.show()

**What it does:** Uses clustering (PCA + KMeans) to group images by background color/texture.

**What to look for:**

If clusters strongly correspond to scene type (e.g., highway vs city vs parking lot).

Dataset too biased to one environment → poor generalization.

If violations (no helmet) only appear in certain backgrounds → model may “cheat”.

**Measures & Metrics:**

Cluster purity (% images in dominant clusters).

of distinct background clusters.

Correlation between cluster label and class presence (bias check).

In [None]:
# Short interpretation for the PCA + KMeans background clustering plot
interpretation = (
    "Three distinct clusters appear, showing the dataset contains a few dominant background/color groups. "
    "Clusters overlap somewhat, so some images share similar visual characteristics. "
    "If certain labels (e.g., helmetless riders) are concentrated in one cluster, this indicates background bias — collect/augment images from under-represented clusters to improve generalization."
)
print(interpretation)