# Helmet dataset unification & meta-classifier (YOLOv8)

Datasets used (original sources):
- Hard-Hat Detection: https://www.kaggle.com/datasets/andrewmvd/hard-hat-detection/data
- Hard-Hat Workers (Roboflow / YOLO exported): https://public.roboflow.com/object-detection/hard-hat-workers/
- Motorcycle Helmet Use Dataset: https://data.mendeley.com/datasets/bmy35m25pw/1
- HelmetWearingImageDataset: https://data.mendeley.com/datasets/tm72fkfxd5/3

Notebook tasks:
1. Attempt to download datasets (Kaggle / Roboflow / Mendeley) if possible, otherwise use local folders.
2. Inspect local dataset folders and determine format.
3. Convert formats where needed (VOC → YOLO) and sample small subsets for CPU training.
4. Train small YOLOv8 models per dataset (or small classifier for classification-only datasets).
5. Run per-dataset models on a shared validation set and collect outputs.
6. Train a small meta-classifier (LogisticRegression) to predict master classes from per-model outputs.

Run top-to-bottom. Edit PARAMETERS cell to change sample sizes or paths.

In [1]:
ROOT = '.'  # parent folder containing dataset folders
DATA_FOLDERS = {
    'hard_hat_detection': 'Hard-Hat-Detection',
    'hard_hat_workers': 'Hard-Hat-Workers',
    'helmet_wearing_dataset': 'Helmet-Wearing-Image-Dataset',
    'motorcycle_helmet_use': 'Motorcycle-Helmet-Use-Dataset',
}

SAMPLE_DETECTION = 200  # images sampled for detection datasets
SAMPLE_CLASS_PER_LABEL = 200  # per-class samples for classification datasets

# YOLO / training options
YOLO_EPOCHS = 5
YOLO_BATCH = 8
YOLO_MODEL = 'yolov8n.pt'

WORKDIR = 'workdir_helmet_unify'
import os
os.makedirs(WORKDIR, exist_ok=True)
print('Workdir:', WORKDIR)


Workdir: workdir_helmet_unify


## 1) Attempt programmatic downloads (optional)

Notes:
- **Kaggle**: Programmatically downloaded using the Kaggle API and credentials provided in the next cell. Requires `kaggle` CLI.
- **Roboflow**: Programmatically downloaded using the Roboflow Python library and API key provided in the next cell.
- **Mendeley**: Programmatically downloaded using `wget` from a direct URL provided in the next cell.
- **Motorcycle Helmet Use**: Programmatically downloaded using `wget` from a direct URL provided in the next cell.

In [None]:
import os, zipfile, subprocess, shutil
from pathlib import Path

os.makedirs("datasets", exist_ok=True)

# --- Kaggle: Hard-Hat Detection ---
kaggle_extract_path = "datasets/kaggle_hard_hat_detection"
os.makedirs(kaggle_extract_path, exist_ok=True)
print("Downloading Hard-Hat Detection from Kaggle...")
!pip install -q kaggle
os.environ['KAGGLE_CONFIG_DIR'] = '/root/.kaggle'

# Use provided Kaggle credentials
kaggle_username = "jesusudlap"
kaggle_key = "4ca740ab8ae13bc6e46eecc6c92bd069"
kaggle_config_dir = os.environ['KAGGLE_CONFIG_DIR']
os.makedirs(kaggle_config_dir, exist_ok=True)
with open(os.path.join(kaggle_config_dir, 'kaggle.json'), 'w') as f:
    f.write(f'{{"username":"{kaggle_username}","key":"{kaggle_key}"}}')
os.chmod(os.path.join(kaggle_config_dir, 'kaggle.json'), 0o600)

kaggle_zip_path = os.path.join(kaggle_extract_path, "hard-hat-detection.zip")
!kaggle datasets download -d andrewmvd/hard-hat-detection -p {kaggle_extract_path}
if os.path.exists(kaggle_zip_path):
    try:
        with zipfile.ZipFile(kaggle_zip_path, 'r') as zip_ref:
            zip_ref.extractall(kaggle_extract_path)
            print(f"Successfully unzipped Kaggle dataset to {kaggle_extract_path}")
    except zipfile.BadZipFile:
        print(f"Error: Downloaded Kaggle zip file is not a valid zip file at {kaggle_zip_path}.")
    except Exception as e:
        print(f"An error occurred during Kaggle extraction: {e}")
else:
    print(f"Error: Kaggle zip file not found at {kaggle_zip_path}. Download may have failed.")


# --- Roboflow: Hard-Hat Workers ---
print("Downloading Hard-Hat Workers from Roboflow...")
!pip install roboflow

try:
    from roboflow import Roboflow
    ROBOFLOW_API_KEY = "4n1r1hgpHwXxLsq2yrE6" # Using roboflow key
    rf = Roboflow(api_key=ROBOFLOW_API_KEY)
    project = rf.workspace("joseph-nelson").project("hard-hat-workers")
    version = project.version(10)
    dataset = version.download("yolov8")

    roboflow_extract_path_auto = Path(dataset.location)
    print(f"Roboflow dataset downloaded and extracted to: {roboflow_extract_path_auto}")

except Exception as e:
    print(f"Error during Roboflow download or extraction: {e}")
    roboflow_extract_path_auto = None # Indicate failure


# --- Mendeley: Helmet Wearing Dataset ---
mendeley_url = "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/tm72fkfxd5-3.zip"
mendeley_zip_path = "datasets/helmet_wearing_dataset.zip"
mendeley_extract_path = "datasets/helmet_wearing_dataset"
os.makedirs(mendeley_extract_path, exist_ok=True)
print(f"Downloading Mendeley Helmet Wearing Dataset from {mendeley_url}...")
!wget {mendeley_url} -O {mendeley_zip_path}

if os.path.exists(mendeley_zip_path):
    print(f"Found {mendeley_zip_path}. Attempting to unzip...")
    try:
        with zipfile.ZipFile(mendeley_zip_path, 'r') as zip_ref:
            zip_ref.extractall(mendeley_extract_path)
            print(f"Successfully unzipped Mendeley dataset to {mendeley_extract_path}")
    except zipfile.BadZipFile:
        print(f"Error: Downloaded Mendeley zip file is not a valid zip file at {mendeley_zip_path}. Please verify the file.")
    except Exception as e:
        print(f"An error occurred during Mendeley extraction: {e}")
else:
    print(f"Error: Mendeley zip file not found at {mendeley_zip_path}. Download may have failed.")


# --- Motorcycle Helmet Use Dataset ---
motorcycle_url = "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/bmy35m25pw-1.zip"
motorcycle_zip_path = "datasets/motorcycle_helmet_use.zip"
motorcycle_extract_path = "datasets/motorcycle_helmet_use"
os.makedirs(motorcycle_extract_path, exist_ok=True)
print(f"Downloading Motorcycle Helmet Use Dataset from {motorcycle_url}...")
!wget {motorcycle_url} -O {motorcycle_zip_path}

if os.path.exists(motorcycle_zip_path):
    print(f"Found {motorcycle_zip_path}. Attempting to unzip...")
    try:
        with zipfile.ZipFile(motorcycle_zip_path, 'r') as zip_ref:
            zip_ref.extractall(motorcycle_extract_path)
            print(f"Successfully unzipped Motorcycle Helmet Use dataset to {motorcycle_extract_path}")
    except zipfile.BadZipFile:
        print(f"Error: Motorcycle Helmet Use zip file is not a valid zip file at {motorcycle_zip_path}. Please verify the file.")
    except Exception as e:
        print(f"An error occurred during Motorcycle Helmet Use extraction: {e}")
else:
    print(f"Error: Motorcycle Helmet Use zip file not found at {motorcycle_zip_path}. Download may have failed.")


DATA_FOLDERS = {
    'hard_hat_detection': kaggle_extract_path,
    'hard_hat_workers': str(roboflow_extract_path_auto) if roboflow_extract_path_auto and roboflow_extract_path_auto.exists() else 'datasets/hard-hat-workers-10', # Use Roboflow auto path if successful, else use expected manual path
    'helmet_wearing_dataset': mendeley_extract_path,
    'motorcycle_helmet_use': motorcycle_extract_path,
}

print('\nDataset download and extraction attempts complete. Please ensure all desired datasets are present in the specified DATA_FOLDERS paths.')
print('Current DATA_FOLDERS mapping:', DATA_FOLDERS)

Downloading Hard-Hat Detection from Kaggle...

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


PermissionError: [Errno 13] Permission denied: '/root/.kaggle'

## 2) Inspect local folders and report format evidence
This cell enumerates each dataset folder you specified and attempts to detect annotation formats (VOC XML, YOLO txt, or folder-classification).


In [None]:
import os, glob, random, json
def detect_format(folder):
    if not os.path.exists(folder):
        return {'exists': False}
    files = os.listdir(folder)
    res = {'exists': True}
    # check for xml annotations
    xmls = glob.glob(os.path.join(folder, '**', '*.xml'), recursive=True)
    txts = glob.glob(os.path.join(folder, '**', '*.txt'), recursive=True)
    imgs = glob.glob(os.path.join(folder, '**', '*.jpg'), recursive=True) + glob.glob(os.path.join(folder, '**', '*.png'), recursive=True)
    res['n_imgs'] = len(imgs)
    res['n_xml'] = len(xmls)
    res['n_txt'] = len(txts)
    # detect folder-per-class (classification) by checking for subdirectories with many images
    subdirs = [d for d in glob.glob(os.path.join(folder, '*')) if os.path.isdir(d)]
    class_like = []
    for d in subdirs:
        n = len(glob.glob(os.path.join(d, '*.jpg')))+len(glob.glob(os.path.join(d, '*.png')))
        if n>5:
            class_like.append((os.path.basename(d), n))
    res['class_folders_sample'] = class_like[:10]
    if res['n_xml']>0:
        res['format'] = 'pascal_voc'
    elif res['n_txt']>0 and any('/labels/' in p.replace('\\','/') or p.endswith('.txt') for p in txts):
        res['format'] = 'yolo_txt'
    elif len(class_like)>0:
        res['format'] = 'folder_classification'
    else:
        res['format'] = 'unknown'
    return res

inventory = {}
for k,rel in DATA_FOLDERS.items():
    p = os.path.join(ROOT, rel)
    inventory[k] = detect_format(p)

print(json.dumps(inventory, indent=2))


{
  "hard_hat_detection": {
    "exists": true,
    "n_imgs": 5000,
    "n_xml": 5000,
    "n_txt": 0,
    "class_folders_sample": [
      [
        "images",
        5000
      ]
    ],
    "format": "pascal_voc"
  },
  "hard_hat_workers": {
    "exists": true,
    "n_imgs": 7035,
    "n_xml": 0,
    "n_txt": 7037,
    "class_folders_sample": [],
    "format": "yolo_txt"
  },
  "helmet_wearing_dataset": {
    "exists": true,
    "n_imgs": 0,
    "n_xml": 0,
    "n_txt": 0,
    "class_folders_sample": [],
    "format": "unknown"
  },
  "motorcycle_helmet_use": {
    "exists": true,
    "n_imgs": 0,
    "n_xml": 0,
    "n_txt": 0,
    "class_folders_sample": [],
    "format": "unknown"
  }
}


## 3) Conversion utilities
VOC XML → YOLO txt converter (normalized bboxes). If your datasets are VOC, run the conversion to create YOLO-format `labels/` files.


In [None]:
import xml.etree.ElementTree as ET
from PIL import Image
def voc_to_yolo(xml_path, img_path, classes_map, out_txt_path):
    """
    xml_path: VOC XML file path
    img_path: corresponding image path (to read width/height)
    classes_map: dict mapping VOC class name -> numeric class index (for YOLO)
    out_txt_path: output .txt path to write YOLO labels
    """
    tree = ET.parse(xml_path)
    root = tree.getroot()
    size = root.find('size')
    if size is not None:
        w = int(size.find('width').text)
        h = int(size.find('height').text)
    else:
        # fallback: read image
        with Image.open(img_path) as im:
            w, h = im.size
    lines = []
    for obj in root.findall('object'):
        cls = obj.find('name').text
        if cls not in classes_map:
            continue
        cls_id = classes_map[cls]
        bb = obj.find('bndbox')
        xmin = float(bb.find('xmin').text)
        ymin = float(bb.find('ymin').text)
        xmax = float(bb.find('xmax').text)
        ymax = float(bb.find('ymax').text)
        x_center = (xmin + xmax) / 2.0 / w
        y_center = (ymin + ymax) / 2.0 / h
        bw = (xmax - xmin) / w
        bh = (ymax - ymin) / h
        lines.append(f"{cls_id} {x_center:.6f} {y_center:.6f} {bw:.6f} {bh:.6f}")
    if lines:
        os.makedirs(os.path.dirname(out_txt_path), exist_ok=True)
        with open(out_txt_path, 'w') as f:
            f.write('\n'.join(lines))
    return len(lines)

print('Converter defined')


Converter defined


## 4) Prepare a small YOLOv8 dataset structure for each dataset
We will create folders under `WORKDIR/<dataset_name>/yolov8/` with `train/`, `val/`, and `labels/` that YOLOv8 expects.
The conversion and sampling behavior depends on detected format (VOC/XML → convert; YOLO txt → copy; folder-classification → convert to small classification dataset or optionally to detection by using weak boxes).


In [None]:
import shutil
from pathlib import Path
import math

def prepare_yolo_folder(src_folder, ds_key, format_hint, sample_limit=SAMPLE_DETECTION, classes_map=None, master_map=None):
    """Prepare a small sample YOLOv8 dataset structure.
    - src_folder: path to dataset root
    - format_hint: 'pascal_voc', 'yolo_txt', 'folder_classification'
    - classes_map: map of original class names -> class indices (for detectors)
    - master_map: map of original class names -> master class string (for meta label building later)
    """
    out_root = Path(WORKDIR) / ds_key / 'yolov8'
    imgs_out = out_root / 'images'
    lbls_out = out_root / 'labels'
    for p in [imgs_out/'train', imgs_out/'val', lbls_out/'train', lbls_out/'val']:
        p.mkdir(parents=True, exist_ok=True)
    # find images
    img_paths = list(Path(src_folder).rglob('*.jpg')) + list(Path(src_folder).rglob('*.png'))
    random.shuffle(img_paths)
    selected = img_paths[:sample_limit]
    ntrain = int(len(selected)*0.8)
    train = selected[:ntrain]
    val = selected[ntrain:]
    # conversion per format
    if format_hint == 'pascal_voc':
        # find XML for each image
        for split, arr in [('train', train), ('val', val)]:
            for img in arr:
                img_out = imgs_out/ split / img.name
                shutil.copyfile(img, img_out)
                xml_path = img.with_suffix('.xml')
                txt_out = lbls_out / split / (img.stem + '.txt')
                if xml_path.exists():
                    voc_to_yolo(str(xml_path), str(img), classes_map, str(txt_out))
    elif format_hint == 'yolo_txt':
        for split, arr in [('train', train), ('val', val)]:
            for img in arr:
                img_out = imgs_out/ split / img.name
                shutil.copyfile(img, img_out)
                # try find label file
                cand1 = img.with_suffix('.txt')
                cand2 = Path(src_folder) / 'labels' / (img.stem + '.txt')
                if cand1.exists():
                    shutil.copyfile(cand1, lbls_out/ split/ cand1.name)
                elif cand2.exists():
                    shutil.copyfile(cand2, lbls_out/ split/ cand2.name)
    elif format_hint == 'folder_classification':
        # We will not attempt detection conversion automatically. Instead we create a small classification set
        # by copying images into images/train/ class subfolders. For later, we can use a classification head or
        # use a simple detector trained on heuristic full-image boxes labeled as the master class (weak supervision).
        class_dirs = [d for d in Path(src_folder).iterdir() if d.is_dir()]
        for c in class_dirs:
            imgs = list(c.glob('*.jpg')) + list(c.glob('*.png'))
            if not imgs:
                continue
            sel = imgs[:sample_limit]
            # copy into train/val randomly
            random.shuffle(sel)
            ntr = int(len(sel)*0.8)
            for img in sel[:ntr]:
                shutil.copyfile(img, imgs_out/'train'/img.name)
            for img in sel[ntr:]:
                shutil.copyfile(img, imgs_out/'val'/img.name)
    else:
        print('Unknown format for', src_folder)
    # produce a data.yaml for YOLOv8 (for detection datasets)
    data_yaml = {
        'path': str(out_root),
        'train': 'images/train',
        'val': 'images/val',
        'names': [],
    }
    if classes_map is not None:
        # assume classes_map is dict name->id, convert to names by index
        names = [None]*(max(classes_map.values())+1)
        for k,v in classes_map.items():
            names[v]=k
        data_yaml['names'] = names
    with open(out_root/'data.yaml', 'w') as f:
        import yaml
        yaml.safe_dump(data_yaml, f)
    return out_root

print('prepare_yolo_folder defined')


prepare_yolo_folder defined


## 5) Define the mapping from dataset labels to master classes
Edit `LABEL_MAPPING` if you want different behavior. Keys correspond to dataset-level class names; values are master-class strings.


In [None]:
# Example label mapping (customize as needed)
LABEL_MAPPING = {
    # Kaggle / Hard-Hat-detection classes
    'Helmet': 'construction_helmet',
    'helmet': 'construction_helmet',
    'hard_hat': 'construction_helmet',
    'person_with_helmet': 'construction_helmet',
    # Motorcycle dataset
    'With Helmet': 'motorcycle_helmet',
    'WithHelmet': 'motorcycle_helmet',
    'Without Helmet': 'no_helmet',
    'WithoutHelmet': 'no_helmet',
    # HelmetWearingImageDataset example classes
    'Full-Face Helmet': 'bicycle_helmet',
    'Open-Face Helmet': 'bicycle_helmet',
    'Incorrect': 'no_helmet',
}

MASTER_CLASSES = ['construction_helmet','motorcycle_helmet','bicycle_helmet','no_helmet']
print('Mapping ready. Master classes:', MASTER_CLASSES)


Mapping ready. Master classes: ['construction_helmet', 'motorcycle_helmet', 'bicycle_helmet', 'no_helmet']


## 6) Prepare each dataset (conversion + sampling)
This cell will use the previously detected format (from the inventory) and prepare a small YOLOv8-ready folder under `WORKDIR/<dataset_key>/yolov8`.
If you have classification-only datasets, it will copy images into `images/train` and `images/val` for later classification training.


In [None]:
import yaml
prepared = {}
for ds_key, rel in DATA_FOLDERS.items():
    src = os.path.join(ROOT, rel)
    info = detect_format(src)
    print(ds_key, 'format=', info.get('format'), 'exists=', info.get('exists'))

    if not info.get('exists'):
        print(f"Skipping {ds_key}: dataset folder not found at {src}")
        continue

    # heuristic class map: for VOC we'll map any 'Helmet'/'helmet' to index 0
    classes_map = None
    if info.get('format') == 'pascal_voc':
        # create simple mapping (you can expand this by parsing unique class names in XMLs)
        # Attempt to read actual class names from XMLs to build a more accurate classes_map
        all_xmls = glob.glob(os.path.join(src, '**', '*.xml'), recursive=True)
        unique_classes = set()
        for xml_path in all_xmls:
            try:
                tree = ET.parse(xml_path)
                root = tree.getroot()
                for obj in root.findall('object'):
                    unique_classes.add(obj.find('name').text)
            except Exception as e:
                print(f"Error parsing {xml_path}: {e}")
                continue

        classes_map = {cls_name: i for i, cls_name in enumerate(sorted(list(unique_classes)))}
        print(f"Detected classes for {ds_key} (VOC): {classes_map}")

        out = prepare_yolo_folder(src, ds_key, 'pascal_voc', sample_limit=SAMPLE_DETECTION, classes_map=classes_map)

    elif info.get('format') == 'yolo_txt':
        # For YOLO txt, we don't know names; we will assume numeric classes are present. Create a placeholder names list.
        # If a data.yaml exists in the source, we can try to copy it or read class names from there.
        data_yaml_src = Path(src) / 'data.yaml'
        if data_yaml_src.exists():
            try:
                with open(data_yaml_src, 'r') as f:
                    src_data_yaml = yaml.safe_load(f)
                    if 'names' in src_data_yaml:
                        classes_map = {name: i for i, name in enumerate(src_data_yaml['names'])}
                        print(f"Detected classes for {ds_key} (YOLO txt from data.yaml): {classes_map}")
            except Exception as e:
                print(f"Could not read data.yaml from {src}: {e}")
                classes_map = None # Fallback to no classes_map if data.yaml is bad

        out = prepare_yolo_folder(src, ds_key, 'yolo_txt', sample_limit=SAMPLE_DETECTION, classes_map=classes_map)

    elif info.get('format') == 'folder_classification':
        print(f"{ds_key} is folder classification. Preparing for classification training.")
        out = prepare_yolo_folder(src, ds_key, 'folder_classification', sample_limit=SAMPLE_CLASS_PER_LABEL)

    else:
        print('Skipping unknown format for', ds_key)
        continue

    prepared[ds_key] = str(out)

print('\nPrepared datasets:')
print(yaml.safe_dump(prepared))

hard_hat_detection format= pascal_voc exists= True
Detected classes for hard_hat_detection (VOC): {'head': 0, 'helmet': 1, 'person': 2}
hard_hat_workers format= yolo_txt exists= True
Detected classes for hard_hat_workers (YOLO txt from data.yaml): {'head': 0, 'helmet': 1, 'person': 2}
helmet_wearing_dataset format= unknown exists= True
Skipping unknown format for helmet_wearing_dataset
motorcycle_helmet_use format= unknown exists= True
Skipping unknown format for motorcycle_helmet_use

Prepared datasets:
hard_hat_detection: workdir_helmet_unify/hard_hat_detection/yolov8
hard_hat_workers: workdir_helmet_unify/hard_hat_workers/yolov8



## 7) Install YOLOv8 (Ultralytics) and small training loop
We train small `yolov8n` models (nano) for a few epochs. On CPU this will be slow but workable for the small sample sizes above.


In [None]:
!pip install -q ultralytics==8.0.0  # pin a YOLOv8-compatible version
from ultralytics import YOLO
print('Ultralytics version installed')


Ultralytics version installed


## 8) Train per-dataset models
This cell loops over prepared detection-style datasets and trains a YOLOv8 model on each (tiny model, few epochs). It stores trained weights under `WORKDIR/<dataset_key>/yolov8/runs/detect/train/weights/best.pt`.

If a dataset was folder-classification only, we'll train a small classifier temporarily using a PyTorch torchvision model (simpler and faster than training a detector on weak boxes).

In [None]:
import time
from pathlib import Path
from sklearn.linear_model import LogisticRegression
import joblib
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms
from torch.utils.data import Dataset, DataLoader, ImageFolder
from PIL import Image
from ultralytics import YOLO
import torch.serialization

# Add safe globals for torch.load to fix UnpicklingError
torch.serialization.add_safe_globals(['DetectionModel'])


# Custom Dataset for folder classification (used if ImageFolder is not suitable)
class FolderClassificationDataset(Dataset):
    def __init__(self, img_dir, transform=None):
        self.img_dir = Path(img_dir)
        self.image_files = list(self.img_dir.iterdir())
        self.transform = transform

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        img_path = self.image_files[idx]
        image = Image.open(img_path).convert('RGB')
        if self.transform:
            image = self.transform(image)
        # For inference, we don't need a label
        # In training, we would need labels derived from folder names
        return image, str(img_path)


trained_models = {}
for ds_key, out in prepared.items():
    outp = Path(out)
    data_yaml = outp / 'data.yaml'

    if (outp / 'images' / 'train').exists() and not data_yaml.exists():
        print('Training simple classifier on', ds_key)
        # Assume it's a folder classification dataset prepared earlier

        # Define transforms
        train_transform = transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
        val_transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 456, 0.406], std=[0.229, 0.224, 0.225])
        ])

        # Use ImageFolder assuming structure images/train/class_name/img.jpg
        train_dir = outp / 'images' / 'train'
        val_dir = outp / 'images' / 'val' # Using val for simple training demo

        if not train_dir.exists() or not any(train_dir.iterdir()):
             print(f"No training images found for classification dataset {ds_key}. Skipping training.")
             trained_models[ds_key] = {'type':'classification', 'model_path': None, 'class_names': []}
             continue

        try:
            train_dataset = ImageFolder(train_dir, transform=train_transform)
            val_dataset = ImageFolder(val_dir, transform=val_transform)
            train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
            val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

            num_classes = len(train_dataset.classes)
            if num_classes < 2:
                 print(f"Classification dataset {ds_key} has only one class ({train_dataset.classes[0]}). Skipping training.")
                 trained_models[ds_key] = {'type':'classification', 'model_path': None, 'class_names': train_dataset.classes}
                 continue


            # Simple model for demonstration
            model = models.resnet18(weights=None) # Use weights=None to avoid downloading if preferred
            num_ftrs = model.fc.in_features
            model.fc = nn.Linear(num_ftrs, num_classes)

            criterion = nn.CrossEntropyLoss()
            optimizer = optim.Adam(model.parameters(), lr=0.001)

            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            model.to(device)

            print(f"Starting classification training for {ds_key} on {device}...")
            # Basic training loop (tiny epochs for demo)
            num_epochs = 5 # Keep small
            for epoch in range(num_epochs):
                model.train()
                running_loss = 0.0
                for inputs, labels in train_loader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    optimizer.zero_grad()
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    loss.backward()
                    optimizer.step()
                    running_loss += loss.item() * inputs.size(0)
                epoch_loss = running_loss / len(train_dataset)
                print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")

            # Save the trained model
            model_save_path = Path(WORKDIR) / ds_key / 'classification_model.pth'
            torch.save(model.state_dict(), model_save_path)
            print(f"Finished training {ds_key}. Model saved to {model_save_path}")
            trained_models[ds_key] = {'type':'classification', 'model_path': str(model_save_path), 'class_names': train_dataset.classes}

        except Exception as e:
            print(f"An error occurred during classification training for {ds_key}: {e}")
            trained_models[ds_key] = {'type':'classification', 'model_path': None, 'class_names': []}


    elif data_yaml.exists():
        print('Training YOLOv8 on', ds_key)
        # load model - YOLO constructor handles downloading .pt if not local
        try:
            model = YOLO(YOLO_MODEL)
        except Exception as e:
            print(f"Error loading YOLO model {YOLO_MODEL}: {e}")
            trained_models[ds_key] = {'type':'detection', 'model_path': None}
            continue

        # training arguments - keep tiny for CPU/small sample
        save_dir = str(Path(WORKDIR)/ds_key/'yolov8'/'runs'/'detect'/'train')
        os.makedirs(save_dir, exist_ok=True)
        results = model.train(data=str(data_yaml), epochs=YOLO_EPOCHS, batch=YOLO_BATCH, imgsz=640, project=str(Path(WORKDIR)/ds_key/'yolov8'/'runs'/'detect'), name='train')
        # store best weights path
        best = list((Path(WORKDIR)/ds_key/'yolov8'/'runs'/'detect'/'train').rglob('best.pt'))
        if best:
            trained_models[ds_key] = {'type':'detection', 'model_path': str(best[0])}
        else:
            # fallback: take last.pt
            last = list((Path(WORKDIR)/ds_key/'yolov8'/'runs'/'detect'/'train').rglob('last.pt'))
            trained_models[ds_key] = {'type':'detection', 'model_path': str(last[0]) if last else None}
    else:
        print(f"Skipping training for {ds_key}: neither data.yaml nor classification images/train folder found.")

    time.sleep(1)

print('\nTrained models summary:')
print(trained_models)
# save summary
joblib.dump(trained_models, Path(WORKDIR)/'trained_models.pkl')

Training YOLOv8 on hard_hat_detection
Error: YOLO model file yolov8n.pt not found. Skipping training for hard_hat_detection.
Training YOLOv8 on hard_hat_workers
Error: YOLO model file yolov8n.pt not found. Skipping training for hard_hat_workers.

Trained models summary:
{'hard_hat_detection': {'type': 'detection', 'model_path': None}, 'hard_hat_workers': {'type': 'detection', 'model_path': None}}


['workdir_helmet_unify/trained_models.pkl']

## 9) Run per-dataset models on a shared validation set and collect outputs
We will create a small `meta_val` image set (from all prepared val folders) and run each per-dataset model to collect per-image per-class scores.
For detection models, we map detection classes to master classes via `LABEL_MAPPING` and aggregate scores (e.g., max score for each master class). For classifiers, we will collect the softmax probability per master class if predicted.


In [None]:
from collections import defaultdict
import numpy as np
meta_images = []
meta_dir = Path(WORKDIR)/'meta_val'
meta_dir.mkdir(parents=True, exist_ok=True)
for ds_key in prepared:
    imgs_folder = Path(prepared[ds_key])/'images'/'val'
    if not imgs_folder.exists():
        continue
    for imgp in imgs_folder.iterdir():
        if imgp.suffix.lower() not in ['.jpg', '.png']:
            continue
        # copy to meta_val with unique name
        dest = meta_dir / f"{ds_key}__{imgp.name}"
        shutil.copyfile(imgp, dest)
        meta_images.append(dest)
print('Meta-val images:', len(meta_images))

# Run models
meta_features = []
meta_filenames = []
for imgp in meta_images:
    filename = str(imgp)
    feats = {}
    for ds_key, info in trained_models.items():
        if info['model_path'] is None:
            # no model trained for this dataset
            continue
        if info['type']=='detection':
            ymodel = YOLO(info['model_path'])
            results = ymodel.predict(source=filename, conf=0.25, imgsz=640)
            # aggregate scores per master class
            agg = {mc:0.0 for mc in MASTER_CLASSES}
            if len(results) > 0:
                boxes = results[0].boxes
                cls_ids = boxes.cls.numpy().astype(int) if hasattr(boxes, 'cls') else []
                scores = boxes.conf.numpy() if hasattr(boxes, 'conf') else []
                # attempt to map predicted numeric class to a name using the model's YAML names if available
                try:
                    names = results[0].names
                except Exception:
                    names = {}
                for cid, sc in zip(cls_ids, scores):
                    name = names.get(cid, str(cid))
                    master = LABEL_MAPPING.get(name, None)
                    if master is None:
                        # heuristic: if name contains 'helmet' map to construction_helmet
                        if 'helmet' in name.lower():
                            master = 'construction_helmet'
                        else:
                            master = 'no_helmet'
                    agg[master] = max(agg[master], float(sc))
            feats.update({f'{ds_key}__{mc}': agg[mc] for mc in MASTER_CLASSES})
        else:
            # classification stub (not implemented fully earlier). Put zeros for features for now.
            feats.update({f'{ds_key}__{mc}': 0.0 for mc in MASTER_CLASSES})
    meta_features.append([feats.get(f'{ds_key}__{mc}',0.0) for ds_key in prepared for mc in MASTER_CLASSES])
    meta_filenames.append(filename)

X = np.array(meta_features)
print('Feature matrix shape:', X.shape)


Meta-val images: 317
Feature matrix shape: (317, 8)


## 10) Build labels for meta-classifier
If you have ground-truth labels per image for this meta-val set, load them here. Otherwise, this notebook will create *pseudo-labels* by applying simple heuristics from folder names.
For proper performance, supply true labels or create a curated validation set.


In [None]:
# Heuristic label assignment: if filename contains known master class keyword, assign it
y = []
for fn in meta_filenames:
    lname = 'no_helmet' # Default to no_helmet if no specific keyword is found
    low = fn.lower()
    if 'motor' in low or 'motorcycle' in low:
        lname = 'motorcycle_helmet'
    elif 'hard' in low or 'hat' in low or 'construction' in low:
        lname = 'construction_helmet'
    elif 'bicycle' in low or 'bike' in low:
        lname = 'bicycle_helmet'
    elif 'without' in low or 'nohelmet' in low or 'incorrect' in low:
         lname = 'no_helmet' # Explicitly assign no_helmet if these keywords are present
    # If filename doesn't contain any of the specific keywords, it remains 'no_helmet'
    y.append(MASTER_CLASSES.index(lname))
import numpy as np
y = np.array(y)
print('Label distribution (heuristic):', {MASTER_CLASSES[i]:int((y==i).sum()) for i in range(len(MASTER_CLASSES))})

Label distribution (heuristic): {'construction_helmet': 317, 'motorcycle_helmet': 0, 'bicycle_helmet': 0, 'no_helmet': 0}


## 11) Train meta-classifier (LogisticRegression)
We train a tiny logistic regression on the meta features to predict the master classes. In practice, use a curated labeled meta training set.

*(Skipped in this version, using fuzzy logic instead)*

## 12) Inference: full pipeline example
Given a new image, run each trained dataset model, collect scores, and predict master class via the meta-classifier. Example code below.


In [None]:
def predict_master_fuzzy(image_path, trained_models):
    feats = {}
    for ds_key, info in trained_models.items():
        # Initialize features for all master classes for this dataset
        for mc in MASTER_CLASSES:
             feats[f'{ds_key}__{mc}'] = 0.0

        if info['model_path'] is None:
            # no model trained for this dataset, features remain 0.0
            continue

        if info['type']=='detection':
            # Load the model with safe globals if needed (already added in training cell)
            try:
                 ymodel = YOLO(info['model_path'])
            except Exception as e:
                 print(f"Error loading model {info['model_path']}: {e}")
                 continue # Skip this model if loading fails

            results = ymodel.predict(source=image_path, conf=0.25, imgsz=640, verbose=False) # verbose=False to reduce output
            # aggregate scores per master class
            agg = {mc:0.0 for mc in MASTER_CLASSES}
            if len(results) > 0 and results[0] is not None:
                boxes = results[0].boxes
                if boxes is not None: # Check if boxes exist
                    cls_ids = boxes.cls.numpy().astype(int) if hasattr(boxes, 'cls') and boxes.cls is not None else []
                    scores = boxes.conf.numpy() if hasattr(boxes, 'conf') and boxes.conf is not None else []
                    # attempt to map predicted numeric class to a name using the model's YAML names if available
                    try:
                        names = results[0].names
                    except Exception:
                        names = {}
                    for cid, sc in zip(cls_ids, scores):
                        name = names.get(cid, str(cid))
                        master = LABEL_MAPPING.get(name, None)
                        if master is None:
                            # heuristic: if name contains 'helmet' map to construction_helmet
                            if 'helmet' in name.lower():
                                master = 'construction_helmet'
                            else:
                                master = 'no_helmet' # Default to no_helmet if no mapping or heuristic match
                        agg[master] = max(agg[master], float(sc)) # Take max score per master class

            # Update feats with aggregated scores from this model
            for mc in MASTER_CLASSES:
                 feats[f'{ds_key}__{mc}'] = agg[mc]

        # Add handling for classification models if implemented later
        elif info['type']=='classification':
             print(f"Classification model inference not fully implemented for {ds_key}")
             # Placeholder for classification model inference

    # --- Fuzzy Prediction Logic ---
    # Aggregate scores across all models for each master class
    master_class_scores = {mc: 0.0 for mc in MASTER_CLASSES}
    for mc in MASTER_CLASSES:
        # Sum or average scores from different models for the same master class
        # Using max score across models for simplicity in this fuzzy example
        master_class_scores[mc] = max(feats.get(f'{ds_key}__{mc}', 0.0) for ds_key in trained_models)

    # Find the master class with the highest score
    predicted_master_class = 'no_helmet' # Default prediction
    max_score = 0.0
    # You could add a threshold here, e.g., if max_score < threshold, predict 'unknown' or 'no_helmet'
    for mc, score in master_class_scores.items():
        if score > max_score: # or if score > threshold and score > max_score
            max_score = score
            predicted_master_class = mc

    return predicted_master_class, master_class_scores

print('Fuzzy pipeline predict function defined (requires loaded trained_models)')

Fuzzy pipeline predict function defined (requires loaded trained_models)


## 13) Test the Fuzzy Inference Pipeline

This cell demonstrates how to use the `predict_master_fuzzy` function on the images in the `meta_val` folder.

In [None]:
import os
from pathlib import Path
from PIL import Image

# Ensure trained_models is loaded if the kernel restarted
if 'trained_models' not in locals() and Path(WORKDIR)/'trained_models.pkl'.exists():
    import joblib
    trained_models = joblib.load(Path(WORKDIR)/'trained_models.pkl')
    print("Loaded trained_models from file.")
elif 'trained_models' not in locals():
    print("Error: trained_models not found. Please run training cells first.")


meta_dir = Path(WORKDIR)/'meta_val'
meta_images_for_test = list(meta_dir.iterdir())

print(f"Testing fuzzy inference on {len(meta_images_for_test)} meta-validation images...")

test_results = []
for img_path in meta_images_for_test:
    if img_path.suffix.lower() not in ['.jpg', '.png']:
        continue
    predicted_class, scores = predict_master_fuzzy(str(img_path), trained_models)
    test_results.append({'image': img_path.name, 'predicted_master_class': predicted_class, 'scores': scores})

# Display a few results
print("\nSample Test Results:")
for i, result in enumerate(test_results[:10]):
    print(f"Image: {result['image']}, Predicted Class: {result['predicted_master_class']}, Scores: {result['scores']}")

Testing fuzzy inference on 317 meta-validation images...

Sample Test Results:
Image: hard_hat_workers__006217_jpg.rf.97b675ef7f9d8a93e58c465c69fcb85b.jpg, Predicted Class: no_helmet, Scores: {'construction_helmet': 0.0, 'motorcycle_helmet': 0.0, 'bicycle_helmet': 0.0, 'no_helmet': 0.0}
Image: hard_hat_detection__hard_hat_workers151.png, Predicted Class: no_helmet, Scores: {'construction_helmet': 0.0, 'motorcycle_helmet': 0.0, 'bicycle_helmet': 0.0, 'no_helmet': 0.0}
Image: hard_hat_workers__000718_jpg.rf.c718130bb1f8b6fdcbeb2592398f01dc.jpg, Predicted Class: no_helmet, Scores: {'construction_helmet': 0.0, 'motorcycle_helmet': 0.0, 'bicycle_helmet': 0.0, 'no_helmet': 0.0}
Image: hard_hat_detection__hard_hat_workers1123.png, Predicted Class: no_helmet, Scores: {'construction_helmet': 0.0, 'motorcycle_helmet': 0.0, 'bicycle_helmet': 0.0, 'no_helmet': 0.0}
Image: hard_hat_workers__001708_jpg.rf.185841b26a89e71b0f7b40b9c3825b5c.jpg, Predicted Class: no_helmet, Scores: {'construction_helmet

## Final notes & next steps
- If you plan to run this on Colab with a GPU: set `YOLO_MODEL='yolov8s.pt'`, increase `SAMPLE_*` and `YOLO_EPOCHS`, and enable GPU runtime.
- For best meta-classifier performance, curate a labeled set of images with ground-truth master classes (not heuristic labels) and retrain the meta-classifier.
- If you want me to produce a version of this notebook that *automatically downloads* and fully prepares each dataset (including parsing dataset-specific class names and building exact `classes_map`), tell me which datasets you want downloaded automatically and provide credentials (Kaggle API token, Roboflow API key). I can then update download code with exact dataset slugs/URLs.

If anything should be changed in the mapping or sample sizes before I finalize a variant for Colab usage, tell me and I will adjust the notebook accordingly.