
# 🏷️ From Unlabeled Images to YOLO Dataset

You have only **unlabeled images**. This notebook helps you go from raw images → labeled YOLO dataset via two paths:

**A) Manual labeling (recommended for custom objects):**
- Use `labelImg` to draw boxes and save YOLO TXT files.

**B) Assisted labeling (faster for common objects):**
- Use a pretrained `yolov8n.pt` (COCO) to create **pseudo-labels** you can then correct in `labelImg`.

In both cases you'll end up with the YOLO layout:
```
data/
├─ images/{train,val}
└─ labels/{train,val}
```
Then you can run the training notebook.


## 0) Configure where your raw images are and your target classes

In [None]:

from pathlib import Path
import os, shutil, random, yaml

# === EDIT THESE ===
RAW_IMAGES_DIR = Path("raw_images")  # folder containing ONLY your unlabeled images
DATA_ROOT = Path("data")             # where we will create YOLO folders
NAMES = ["class_0"]                  # your target class names, e.g., ["helmet", "vest"]

# Create folder structure
(IMAGES_TRAIN := DATA_ROOT/"images"/"train").mkdir(parents=True, exist_ok=True)
(IMAGES_VAL := DATA_ROOT/"images"/"val").mkdir(parents=True, exist_ok=True)
(LABELS_TRAIN := DATA_ROOT/"labels"/"train").mkdir(parents=True, exist_ok=True)
(LABELS_VAL := DATA_ROOT/"labels"/"val").mkdir(parents=True, exist_ok=True)

print("RAW_IMAGES_DIR:", RAW_IMAGES_DIR.resolve())
print("DATA_ROOT:", DATA_ROOT.resolve())
print("Classes:", NAMES)


## 1) Split raw images into train/val sets

In [None]:

from glob import glob
from pathlib import Path
import shutil, random

# Choose split ratio
VAL_RATIO = 0.2

all_imgs = []
for ext in ("*.jpg","*.jpeg","*.png","*.bmp","*.tif","*.tiff"):
    all_imgs += list(RAW_IMAGES_DIR.rglob(ext))

random.shuffle(all_imgs)
val_count = int(len(all_imgs) * VAL_RATIO)
val_imgs = set(all_imgs[:val_count])
train_imgs = all_imgs[val_count:]

def copy_to(srcs, dest_dir: Path):
    for p in srcs:
        dest = dest_dir / p.name
        if dest.resolve() != p.resolve():
            shutil.copy2(p, dest)

copy_to(train_imgs, IMAGES_TRAIN)
copy_to(val_imgs, IMAGES_VAL)

print(f"Copied {len(train_imgs)} images to {IMAGES_TRAIN}")
print(f"Copied {len(val_imgs)} images to {IMAGES_VAL}")


## 2) Prepare `labelImg` for **manual** labeling

In [None]:

# A small helper: create 'classes.txt' label list for labelImg to show your classes
classes_txt = DATA_ROOT / "classes.txt"
classes_txt.write_text("\n".join(NAMES), encoding="utf-8")
print("Wrote", classes_txt.resolve())

print("""
Manual steps (do this once):
1) Install labelImg (one of):
   - Windows: pip install labelImg  (or clone https://github.com/heartexlabs/labelImg)
   - macOS/Linux: pip install labelImg
2) Launch it:
   labelImg
3) In labelImg:
   - Open Dir:   data/images/train
   - Change Save Dir: data/labels/train
   - Open 'PascalVOC/YOLO' menu → choose 'YOLO'
   - Menu → Open 'classes' file → select data/classes.txt
   - Draw boxes and assign your class name(s)
   - Save (it will create .txt next to each image in data/labels/train)
4) Repeat for val split: Open Dir = data/images/val, Save Dir = data/labels/val
""")



## 3) (Optional) Assisted labeling — create pseudo-labels using YOLOv8 (COCO)

This is useful if your objects overlap with **COCO classes** (e.g., person, car, dog...).  
We will run `yolov8n.pt` to **auto‑annotate** your images, filter to your target classes if they exist in COCO, and save YOLO TXT files you can fix in `labelImg`.


In [None]:

# Only run this if your target classes are in COCO (or you want a starting point to correct).
from ultralytics import YOLO
import cv2, json

# COCO class names used by yolov8n.pt
COCO_NAMES = [
    'person','bicycle','car','motorcycle','airplane','bus','train','truck','boat','traffic light',
    'fire hydrant','stop sign','parking meter','bench','bird','cat','dog','horse','sheep','cow',
    'elephant','bear','zebra','giraffe','backpack','umbrella','handbag','tie','suitcase','frisbee',
    'skis','snowboard','sports ball','kite','baseball bat','baseball glove','skateboard','surfboard',
    'tennis racket','bottle','wine glass','cup','fork','knife','spoon','bowl','banana','apple',
    'sandwich','orange','broccoli','carrot','hot dog','pizza','donut','cake','chair','couch',
    'potted plant','bed','dining table','toilet','tv','laptop','mouse','remote','keyboard','cell phone',
    'microwave','oven','toaster','sink','refrigerator','book','clock','vase','scissors','teddy bear',
    'hair drier','toothbrush'
]

# Map your NAMES to COCO ids if they exist
name_to_coco = {n:i for i,n in enumerate(COCO_NAMES)}
valid_targets = [n for n in NAMES if n in name_to_coco]
print("Will pseudo-label these classes (present in COCO):", valid_targets)

if len(valid_targets) == 0:
    print("None of your classes are in COCO; skip pseudo-labeling and use manual labeling.")
else:
    model = YOLO("yolov8n.pt")  # downloads if missing

    def run_pseudo_for_split(split):
        img_dir = (DATA_ROOT/"images"/split)
        lbl_dir = (DATA_ROOT/"labels"/split)
        lbl_dir.mkdir(parents=True, exist_ok=True)
        img_paths = list(img_dir.glob("*.*"))
        print(f"Predicting on {split} ({len(img_paths)} images)")
        res = model.predict(source=str(img_dir), save=False, conf=0.25, iou=0.45, imgsz=640, verbose=False)
        # res is a list of Results per image
        for r in res:
            p = Path(r.path)  # image path
            boxes = r.boxes
            lines = []
            if boxes is not None and len(boxes) > 0:
                for b in boxes:
                    cls = int(b.cls.item())
                    cls_name = COCO_NAMES[cls]
                    if cls_name not in valid_targets:
                        continue  # skip classes you don't want
                    # convert xyxy (absolute) to YOLO normalized
                    xyxy = b.xyxy[0].tolist()
                    # image size (from r.orig_shape: (h, w))
                    h, w = r.orig_shape
                    x1, y1, x2, y2 = xyxy
                    xc = ((x1 + x2)/2.0) / w
                    yc = ((y1 + y2)/2.0) / h
                    bw = (x2 - x1) / w
                    bh = (y2 - y1) / h
                    # remap class index into your NAMES list
                    out_cid = NAMES.index(cls_name)
                    lines.append(f"{out_cid} {xc:.6f} {yc:.6f} {bw:.6f} {bh:.6f}")
            # Write label file (even if empty so you know to check it)
            out_txt = lbl_dir / (p.stem + ".txt")
            with open(out_txt, "w") as f:
                f.write("\n".join(lines))
        print("Done:", split)

    run_pseudo_for_split("train")
    run_pseudo_for_split("val")
    print("Pseudo-labels written. Open labelImg to **review and correct** them.")


## 4) Write a `data.yaml` for training later

In [None]:

data_yaml = {
    "path": str(DATA_ROOT.resolve()),
    "train": str((DATA_ROOT/"images"/"train").resolve()),
    "val":   str((DATA_ROOT/"images"/"val").resolve()),
    "names": NAMES,
    "nc": len(NAMES),
}
yaml_path = Path("configs"); yaml_path.mkdir(parents=True, exist_ok=True)
yaml_file = yaml_path / "data.yaml"
with open(yaml_file, "w") as f:
    yaml.safe_dump(data_yaml, f, sort_keys=False)
print("Wrote", yaml_file.resolve())
print(yaml.safe_dump(data_yaml, sort_keys=False))
