# YOlO experimentation

This notebook provides the code to create training data, and train the ultralytics YOLO model to produce oriented bounding boxes (OBB) around pokemon cards in an image.

### notes

ONNX format - potential increased execution speed, can be used by many libraries https://docs.ultralytics.com/integrations/onnx/


possible dataset improvements

- cut out random shapes from around the edge
- glare that extends beyond the card
- colored rectangles around card to emulate sleeves
- tight grids to simulate binders
- higher frequency of full art/full card designs
- incomplete cards that clip off screen

### setup

In [None]:
!pip install ultralytics "onnx>=1.12.0,<2.0.0" "onnxslim>=0.1.71" onnxruntime-gpu

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0 -> 25.3
[notice] To update, run: C:\Program Files\Python310\python.exe -m pip install --upgrade pip


In [3]:
# imports
import os
import random
import zipfile
import urllib
import math
from ultralytics import YOLO
from PIL import Image, ImageStat, ImageEnhance, ImageDraw, ImageFilter
import tqdm
import numpy as np
import cv2

# Dataset creation

In [4]:
# configure dataset settings here

DATASET_PATH = "../datasets/YOLO_training"
OBJECTS_PATH = "../datasets/pokemon/data/images" # the objects you want the model to detect
OBJECTS_HEIGHT = 825
OBJECTS_WIDTH = 600
OBJECTS_ASPECT_RATIO = OBJECTS_HEIGHT / OBJECTS_WIDTH
IMAGE_DIMENSION = 640

# dataset generation settings
AMOUNT_TRAIN = 500
AMOUNT_VAL = 25
MAX_CARDS_PER_IMAGE = 9
ALLOW_OVERLAP = False
BLACK_AND_WHITE = False
OBSTRUCTIONS = False

# training settings
EPOCHS = 50


### download images to use as backgrounds

In [5]:
url = "http://images.cocodataset.org/zips/val2017.zip"
output_path = DATASET_PATH + "/backgrounds/val2017.zip" 

# create directory
if not os.path.exists(os.path.dirname(output_path)):
    os.makedirs(os.path.dirname(output_path), exist_ok=True)

if not os.path.exists(DATASET_PATH + "/backgrounds/val2017/"):
    # download and extract
    print("Downloading background images...")
    urllib.request.urlretrieve(url, output_path)
    print("Extracting")
    with zipfile.ZipFile(output_path, 'r') as zip_ref:
        zip_ref.extractall(DATASET_PATH + "/backgrounds/")

    # cleanup
    os.remove(output_path)
print("Background images ready.")

Background images ready.


### Remove empty directories from objects path

this is done since we select random directories to find random images, and this breaks if empty directories exist 

In [8]:
# remove all empty directories in OBJECTS_PATH
for dirpath, dirnames, filenames in os.walk(OBJECTS_PATH, topdown=False):
    if not dirnames and not filenames:
        os.rmdir(dirpath)


### Utility functions

In [11]:
def load_random_background() -> str:
    bg_path = DATASET_PATH + "/backgrounds/val2017/"
    bg_image = random.choice(os.listdir(bg_path))
    return bg_path + bg_image

def load_random_card() -> str:
    # images may be in subdirectories
    path = random.choice(os.listdir(OBJECTS_PATH))
    while os.path.isdir(os.path.join(OBJECTS_PATH, path)):
        path = os.path.join(path, random.choice(os.listdir(os.path.join(OBJECTS_PATH, path))))
        # check if directory is empty
        if os.path.isdir(path) and len(os.listdir(path)) == 0:
            # return the first card that is downloaded in case all of the cards haven't been downloaded yet
            return "../datasets/pokemon/data/images/base1/base1-1.jpg"
    return os.path.join(OBJECTS_PATH, path)

def normalize_homography(H):
    """Normalize so H[2,2] == 1"""
    return H / H[2, 2]

def compose(*matrices):
    """
    Compose transforms left → right.
    compose(A, B, C) means: C @ B @ A
    """
    H = np.eye(3)
    for M in matrices:
        H = M @ H
    return normalize_homography(H)

def translate(tx, ty):
    return np.array([
        [1, 0, tx],
        [0, 1, ty],
        [0, 0, 1]
    ], dtype=float)


def rotate(theta_rad, center=None):
    c, s = math.cos(theta_rad), math.sin(theta_rad)
    R = np.array([
        [ c, -s, 0],
        [ s,  c, 0],
        [ 0,  0, 1]
    ], dtype=float)

    if center is None:
        return R

    cx, cy = center
    return compose(
        translate(-cx, -cy),
        R,
        translate(cx, cy)
    )

def perspective_from_corners(src_pts, dst_pts):
    """
    src_pts, dst_pts: lists of 4 (x, y) pairs
    Order must be consistent (e.g. TL, TR, BR, BL)
    """
    A = []
    b = []

    for (x, y), (u, v) in zip(src_pts, dst_pts):
        A.append([x, y, 1, 0, 0, 0, -u*x, -u*y])
        A.append([0, 0, 0, x, y, 1, -v*x, -v*y])
        b.append(u)
        b.append(v)

    A = np.array(A, dtype=float)
    b = np.array(b, dtype=float)

    h = np.linalg.solve(A, b)

    H = np.array([
        [h[0], h[1], h[2]],
        [h[3], h[4], h[5]],
        [h[6], h[7], 1.0]
    ])

    return normalize_homography(H)

def to_pillow_perspective(H):
    """
    Convert 3x3 homography to Pillow's 8-tuple
    """
    H = normalize_homography(H)
    return (
        H[0,0], H[0,1], H[0,2],
        H[1,0], H[1,1], H[1,2],
        H[2,0], H[2,1],
    )

def get_corners_after_transform(start_w, start_h, H: np.ndarray) -> list:
    corners = [
        (0, 0),
        (start_w, 0),
        (start_w, start_h),
        (0, start_h)
    ]
    transformed_corners = []
    for x, y in corners:
        vec = np.array([x, y, 1], dtype=float)
        tx, ty, tz = H @ vec
        transformed_corners.append((float(tx / tz), float(ty / tz)))
    return transformed_corners

def mesh_distort(img: Image.Image) -> Image.Image:
    np_img = np.array(img)
    rgb = np_img[..., :3]
    alpha = np_img[..., 3]

    cv_rgb = cv2.cvtColor(rgb, cv2.COLOR_RGBA2RGB)
    h, w = cv_rgb.shape[:2]

    focal_length = w
    camera_matrix = np.array([
        [focal_length, 0, w / 2],
        [0, focal_length, h / 2],
        [0, 0, 1]
    ], dtype="float32")

    randodist_coeffs = np.array([
        random.uniform(-0.05, 0.05),   # k1
        random.uniform(-0.02, 0.02),   # k2
        random.uniform(-0.005, 0.005), # p1
        random.uniform(-0.005, 0.005), # p2
        random.uniform(-0.01, 0.01)    # k3
    ], dtype=np.float32)

    new_camera_mtx, _ = cv2.getOptimalNewCameraMatrix(
        camera_matrix,
        randodist_coeffs,
        (w, h),
        alpha=0
    )

    map1, map2 = cv2.initUndistortRectifyMap(
        camera_matrix,
        randodist_coeffs,
        None,
        new_camera_mtx,
        (w, h),
        cv2.CV_32FC1
    )

    distorted_rgb = cv2.remap(
        cv_rgb,
        map1,
        map2,
        interpolation=cv2.INTER_LINEAR,
        borderMode=cv2.BORDER_CONSTANT,
        borderValue=(0, 0, 0)
    )

    distorted_alpha = cv2.remap(
        alpha,
        map1,
        map2,
        interpolation=cv2.INTER_LINEAR,
        borderMode=cv2.BORDER_CONSTANT,
        borderValue=0
    )

    # Stack channels
    rgba = np.dstack([distorted_rgb, distorted_alpha])

    # find border pixels (non-zero pixels beside zero pixels or edge of image)
    border_mask = np.zeros((h, w), dtype=bool)
    for y in range(h):
        for x in range(w):
            if distorted_alpha[y, x] == 0:
                continue
            if (x == 0 or distorted_alpha[y, x-1] == 0) or \
               (x == w-1 or distorted_alpha[y, x+1] == 0) or \
               (y == 0 or distorted_alpha[y-1, x] == 0) or \
               (y == h-1 or distorted_alpha[y+1, x] == 0):
                border_mask[y, x] = True
    
    # expand mask to neighboring pixels to ensure no gaps
    kernel = np.ones((3, 3), dtype=np.uint8)
    expanded_mask = cv2.dilate(border_mask.astype(np.uint8), kernel, iterations=1).astype(bool)
    # set border pixels half transparent to create a smoother transition
    distorted_alpha[expanded_mask] = np.minimum(distorted_alpha[expanded_mask], 128)
    # update the distorted image with the new alpha values
    rgba[..., 3] = distorted_alpha

    # create final image with smoothed borders
    distorted_pil = Image.fromarray(rgba, mode="RGBA")

    # apply blur to entire image
    distorted_pil = distorted_pil.filter(ImageFilter.GaussianBlur(radius=random.uniform(0.5, 1)))

    return distorted_pil


### Images creation


In [12]:
def create_test_item():
    shapes = []

    # smaller cards if more cards per image
    card_amount = random.randint(1, MAX_CARDS_PER_IMAGE)
    x = IMAGE_DIMENSION / math.ceil(math.sqrt(card_amount))
    # assuming a card is rotated so its diagonal is aligned with the grid, what is the max height so the card still fits
    max_card_size = math.floor((x * (OBJECTS_HEIGHT if OBJECTS_HEIGHT > OBJECTS_WIDTH else OBJECTS_WIDTH)) / (math.sqrt(OBJECTS_HEIGHT**2 + OBJECTS_WIDTH**2)))

    # use python image library to paste cards onto background
    
    # load background 
    bg_image_path = load_random_background()
    bg_image = Image.open(bg_image_path).convert("RGB")
    if BLACK_AND_WHITE:
        bg_image = bg_image.convert("L").convert("RGB")
    bg_image = bg_image.resize((IMAGE_DIMENSION, IMAGE_DIMENSION))

    # calculate average brightness of background
    bg_image_gray = bg_image.convert("L")
    stat = ImageStat.Stat(bg_image_gray)
    bg_brightness = stat.mean[0]

    # calculate image white balance
    # stat = ImageStat.Stat(bg_image)
    # r_avg, g_avg, b_avg = stat.mean
    # gray_avg = (r_avg + g_avg + b_avg) / 3
    # r_ratio = gray_avg / (r_avg + 1e-5)
    # g_ratio = gray_avg / (g_avg + 1e-5)
    # b_ratio = gray_avg / (b_avg + 1e-5)
    # # bound between 0.5 and 2 to avoid extreme color shifts
    # r_ratio = max(min(r_ratio, 2), 0.5)
    # g_ratio = max(min(g_ratio, 2), 0.5)
    # b_ratio = max(min(b_ratio, 2), 0.5)

    # paste cards
    for i in range(card_amount):
        card_height = random.uniform(50, max_card_size)
        # load card
        card_image_path = load_random_card()
        try:
            card_image = Image.open(card_image_path).convert("RGBA")
        except:
            continue
        if BLACK_AND_WHITE:
            card_image = card_image.convert("L").convert("RGBA")
        card_image = card_image.resize((int(card_height / OBJECTS_ASPECT_RATIO), int(card_height)))

        # adjust brightness to match background
        card_image_gray = card_image.convert("L")
        stat = ImageStat.Stat(card_image_gray)
        card_brightness = stat.mean[0]

        brightness_ratio = bg_brightness / (card_brightness + 1e-5)
        enhancer = ImageEnhance.Brightness(card_image)
        card_image = enhancer.enhance(brightness_ratio)

        # adjust white balance to match background
        # r, g, b, a = card_image.split()
        # r = r.point(lambda i: i * r_ratio)
        # g = g.point(lambda i: i * g_ratio)
        # b = b.point(lambda i: i * b_ratio)
        # card_image = Image.merge('RGBA', (r, g, b, a))

        # change black to transparent in edges
        for y in range(card_image.height):
            for x in range(card_image.width):
                if x < 17 and y < 17 or x > card_image.width - 18 and y < 17 or x < 17 and y > card_image.height - 18 or x > card_image.width - 18 and y > card_image.height - 18:
                    r, g, b, a = card_image.getpixel((x, y))
                    if r == 0 and g == 0 and b == 0:
                        card_image.putpixel((x, y), (0, 0, 0, 0))

        # apply obstruction if enabled
        if OBSTRUCTIONS and random.random() < 0.66:
            draw = Image.new('RGBA', card_image.size, (0, 0, 0, 0))
            obstruction_amount = random.randint(0, 10)

            # generate random shapes of varying size, shape, brightness, and alpha
            for _ in range(obstruction_amount):
                obs_w = random.randint(int(card_image.width * 0.1), int(card_image.width * 0.6))
                obs_h = random.randint(int(card_image.height * 0.1), int(card_image.height * 0.6))
                obs_x = random.randint(0, card_image.width - obs_w)
                obs_y = random.randint(0, card_image.height - obs_h)
                shade = random.randint(0, 4) * 255 // 4
                obstruction = Image.new('RGBA', (obs_w, obs_h), (shade, shade, shade, random.randint(150, 255)))
                # remove random shapes from obstruction to make it less blocky
                mask = Image.new('L', (obs_w, obs_h), 255)
                for _ in range(random.randint(5, 15)):
                    shape_w = random.randint(int(obs_w * 0.1), int(obs_w * 0.5))
                    shape_h = random.randint(int(obs_h * 0.1), int(obs_h * 0.5))
                    shape_x = random.randint(0, obs_w - shape_w)
                    shape_y = random.randint(0, obs_h - shape_h)
                    shape = Image.new('L', (shape_w, shape_h), 0)
                    mask.paste(shape, (shape_x, shape_y))
                obstruction.putalpha(mask)
                obstruction = obstruction.rotate(random.uniform(0, 360), expand=True)
                draw.paste(obstruction, (obs_x, obs_y), obstruction)
            card_image = Image.alpha_composite(card_image, draw)


        transformed_max_dim = int(math.ceil(math.sqrt(card_image.width**2 + card_image.height**2)))
        # move to center of image
        translate_mat = translate((transformed_max_dim - card_image.width) / 2, (transformed_max_dim - card_image.height) / 2)
        # rotate card
        rotation_mat = rotate(random.uniform(0, 2 * math.pi), center=(card_image.width / 2, card_image.height / 2))
        # move corners inwards randomly to simulate perspective
        adjust_factor = 0.2
        corner_adjust = perspective_from_corners(
            [
                (0,0), 
                (card_image.width,0), 
                (card_image.width,card_image.height), 
                (0,card_image.height)
            ],
            [
                (random.random()*adjust_factor * card_image.width, random.random()*adjust_factor * card_image.height),
                (card_image.width - (random.random()*adjust_factor * card_image.width), random.random()*adjust_factor * card_image.height),
                (card_image.width - (random.random()*adjust_factor * card_image.width), card_image.height - (random.random()*adjust_factor * card_image.height)),
                (random.random()*adjust_factor * card_image.width, card_image.height - (random.random()*adjust_factor * card_image.height))
            ]
        )

        # compose transformations
        H = compose(
            corner_adjust,
            rotation_mat,
            translate_mat
        )
        H_inv = np.linalg.inv(H)

        # cache starting dimensions
        start_h = card_image.height
        start_w = card_image.width

        # apply perspective transform
        card_image = card_image.transform(
            (transformed_max_dim, transformed_max_dim),
            Image.PERSPECTIVE,
            (to_pillow_perspective(H_inv)),
            resample=Image.BICUBIC,
            fillcolor=(0, 0, 0, 0)
        )

        # apply slight lens distortion
        card_image = mesh_distort(card_image)

        # get transformed corners
        corners = get_corners_after_transform(start_w, start_h, H)

        paste_x = random.randint(0, IMAGE_DIMENSION - max_card_size)
        paste_y = random.randint(0, IMAGE_DIMENSION - max_card_size)

        # position card in grid to avoid overlap
        if not ALLOW_OVERLAP:
            paste_x = i % math.ceil(math.sqrt(card_amount)) * max_card_size + random.uniform(0,max_card_size - transformed_max_dim - 1)
            paste_y = i // math.ceil(math.sqrt(card_amount)) * max_card_size + random.uniform(0,max_card_size - transformed_max_dim - 1)
            
            # keep in bounds
            paste_x = max(min(paste_x, IMAGE_DIMENSION - max_card_size), 0)
            paste_y = max(min(paste_y, IMAGE_DIMENSION - max_card_size), 0)

        # paste card onto background
        bg_image.paste(card_image, (int(paste_x), int(paste_y)), card_image)

        # calculate segment shape
        x1 = corners[0][0] + paste_x
        y1 = corners[0][1] + paste_y
        x2 = corners[1][0] + paste_x
        y2 = corners[1][1] + paste_y
        x3 = corners[2][0] + paste_x
        y3 = corners[2][1] + paste_y
        x4 = corners[3][0] + paste_x
        y4 = corners[3][1] + paste_y


        # draw bounding box for debugging
        # draw = ImageDraw.Draw(bg_image)
        # draw.line([(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x1, y1)], fill=(255, 0, 0), width=2)

        shapes.append([x1/IMAGE_DIMENSION, y1/IMAGE_DIMENSION, x2/IMAGE_DIMENSION, y2/IMAGE_DIMENSION, x3/IMAGE_DIMENSION, y3/IMAGE_DIMENSION, x4/IMAGE_DIMENSION, y4/IMAGE_DIMENSION])
    
    return bg_image, shapes

def save_image_and_labels(image: Image.Image, shapes: list[float], index: int, train: bool = True):
    # create directories if they don't exist
    images_path = DATASET_PATH + ("/images/train" if train else "/images/val")
    labels_path = DATASET_PATH + ("/labels/train" if train else "/labels/val")
    os.makedirs(images_path, exist_ok=True)
    os.makedirs(labels_path, exist_ok=True)

    # validate segments are within bounds
    for shape in shapes:
        for num in shape:
            if float(num) < 0 or float(num) > 1:
                print("out of bounds card")
                # return

    # save image
    image_save_path = os.path.join(images_path, f"image_{index:05d}.jpg")
    image.save(image_save_path)

    # save labels
    label_save_path = os.path.join(labels_path, f"image_{index:05d}.txt")
    with open(label_save_path, 'w') as f:
        for shape in shapes:
            f.write("0 " + " ".join(map(str, shape)) + "\n")

for i in tqdm.tqdm(range(AMOUNT_TRAIN), desc="Generating training dataset"):
    img, shapes = create_test_item()
    save_image_and_labels(img, shapes, i, train=True)

for i in tqdm.tqdm(range(AMOUNT_VAL), desc="Generating validation dataset"):
    img, shapes = create_test_item()
    save_image_and_labels(img, shapes, i, train=False)

yaml_content = f"""
path: {DATASET_PATH}
train: images/train
val: images/val

names:
    0: card
"""

with open(os.path.join(DATASET_PATH, "card_yolo_dataset.yaml"), 'w') as f:
    f.write(yaml_content)

Generating training dataset: 100%|██████████| 500/500 [01:24<00:00,  5.92it/s]
Generating validation dataset: 100%|██████████| 25/25 [00:04<00:00,  5.89it/s]


# Training

### model download

In [3]:
model = YOLO("yolo26n-seg.pt")

### model training

In [None]:
results = model.train(data=os.path.join(DATASET_PATH, "card_yolo_dataset.yaml"), epochs=EPOCHS , imgsz=IMAGE_DIMENSION, save_period=10, cache=True)

# save model
model.save("card_yolo.pt")

New https://pypi.org/project/ultralytics/8.4.14 available  Update with 'pip install -U ultralytics'
Ultralytics 8.4.6  Python-3.12.12 torch-2.9.0+cpu CPU (11th Gen Intel Core i5-1135G7 @ 2.40GHz)
[34m[1mengine\trainer: [0magnostic_nms=False, amp=True, angle=1.0, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=True, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=../datasets/YOLO_training\card_yolo_dataset.yaml, degrees=0.0, deterministic=True, device=cpu, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=50, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolo26n-seg.pt, momentum=0.937, mosaic=1.0, mul

### model testing

In [8]:
model = YOLO("./card_yolo.pt")

metrics = model.val(data=os.path.join(DATASET_PATH, "card_yolo_dataset.yaml"))

metrics.box.map  # map50-95
metrics.box.map50  # map50
metrics.box.map75  # map75
metrics.box.maps  # a list containing mAP50-95 for each category

Ultralytics 8.4.7  Python-3.12.12 torch-2.9.0+cu130 CUDA:0 (NVIDIA GeForce RTX 3060, 12287MiB)
YOLO26n-seg summary (fused): 139 layers, 2,689,079 parameters, 0 gradients, 9.0 GFLOPs
[34m[1mval: [0mFast image access  (ping: 0.00.0 ms, read: 639.2169.1 MB/s, size: 68.9 KB)
[K[34m[1mval: [0mScanning C:\Code\React\CollectiblesApp\src\ai_dev\datasets\YOLO_training\labels\val.cache... 25 images, 0 backgrounds, 0 corrupt: 100% ━━━━━━━━━━━━ 25/25  0.0s
[K                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100% ━━━━━━━━━━━━ 2/2 1.7s/it 3.4s9.2s
                   all         25        129      0.976      0.992      0.994      0.972      0.976      0.992      0.994      0.963
Speed: 4.2ms preprocess, 20.9ms inference, 0.0ms loss, 1.1ms postprocess per image
Results saved to [1mC:\Code\React\CollectiblesApp\src\ai_dev\notebooks\runs\segment\val[0m


array([    0.97155])

### model exporting

In [None]:
model = YOLO("./card_yolo.pt")

model.export(format="onnx", opset=12, dynamic=True)

Ultralytics 8.4.7  Python-3.12.12 torch-2.9.0+cu130 CPU (Intel Core i5-9600K 3.70GHz)
YOLO26n-seg summary (fused): 139 layers, 2,689,079 parameters, 0 gradients, 9.0 GFLOPs

[34m[1mPyTorch:[0m starting from 'card_yolo.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s) ((1, 300, 38), (1, 32, 160, 160)) (6.3 MB)

[34m[1mONNX:[0m starting export with onnx 1.20.1 opset 22...




[34m[1mONNX:[0m slimming with onnxslim 0.1.82...
[34m[1mONNX:[0m export success  2.4s, saved as 'card_yolo.onnx' (10.6 MB)

Export complete (2.9s)
Results saved to [1mC:\Code\React\CollectiblesApp\src\ai_dev\notebooks[0m
Predict:         yolo predict task=segment model=card_yolo.onnx imgsz=640 
Validate:        yolo val task=segment model=card_yolo.onnx imgsz=640 data=../datasets/YOLO_training\card_yolo_dataset.yaml  
Visualize:       https://netron.app


'card_yolo.onnx'