#Understanding the Yolo Algorithm and Fine-Tuning It
____


# Overview of Yolo Algorithm

The YOLO algorithm is designed to preform object detection and image classification. Usually, detection and classification are often two separate models which take two passes, however, yolo combines the two in one pass which is why it's named you only look one. This allows for quick detection and the ability to be used in real time applications.  The following output looks like this:

<img src="cover.png" width="400" height="300">

Yolo does object detection and classification in one pass by dividing the image into an SXS Matrix like the image below:

<img src="SXS.png" width="300" height="300">

Then for each cell two categories of features is created, the first, is on the object detection boxes and the second is classification of each SXS square. The Object Detction is done in the following way. For each rectangle in SXS we take a finite amount of bounding boxes. For example if we take two boxes per rectange we would have to vectors with the following data:

                                    [x, y, sqrt(W), sqrt(H), C]

X, Y are the center coordinates of the bounding box, W and H are the width and height, and C which is a confidence score representing the models confidence that an object actually exists in the bounding box.
Additionally, we have a tensor with the following data 

                                    [P(c1), P(c2), ... P(cn)]

This represents what is the probability that what is in the SXS cell is a given classification. So for each grid, we eventually build a (Bx5+n) matrix where B is the number of bounding boxes and n is the number of classifications in the model. We repeat this process for each cell in the SXS grid until we have a final feature matrix. 

We then use a loss function, which will be explained later in this markdown to compute both how well the bounding boxes are predicting object, how well the model is classified, and how well the combination is doing. 

Because we gather all the data in one pass, yolo can run quite fast. The newest versions run at 45 frames per second while optimized versions can process at 150 frames per second with 25 milliseconds of latency. The best alternative RCNN's most opitmized version runs at about max 17 frames per second. 

In this tutorial, we will teach you how to implement YoloV1, we will then shop how to pull the YoloV8 pre trained off the web and opotimize it for your preferred use.

# Data Engineering

Gage

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
x, t = #TODO.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Model Overview / Example Usage

In [1]:
from PIL import Image, ImageDraw

## Get Image ##

## Get Prediction ##
model = YOLO()
model.eval()
predictions = model(image)
boxes, scores, classes = post_process(predictions)


## Draw predictions on original image ##
draw = ImageDraw.Draw(image)
for box in boxes:
    draw.rectangle([(box[0], box[1]), (box[2], box[3])], outline="red")

image.show()

NameError: name 'YOLO' is not defined

Ben

# Loss Function

### Parameters

grid_size: the original image is divided into a grid with length grid_size  
num_boxes: number of bounding boxes to be predicted in each grid cell  
num_classes: number of classes an object can be identified as  

### Intersection over Union Utility Function

The intersection over Union metric is used to determine how well the predicted box matches the annotated box label. The function is essentially a ratio where the numerator is the overlap between the predictions and the denominator is the total area of the predictions. As a result, a perfect match would have an IoU score of 1, and worse predictions would have a score less than 1.

In [6]:
def intersection_over_union(boxes_preds, boxes_labels):
    """
    Parameters:
        boxes_preds (tensor): Predictions of Bounding Boxes (num_boxes, 4)
        boxes_labels (tensor): Correct labels of Bounding Boxes (num_boxes, 4)
    """

    ## Determine the boundary corners of the box given a box is defined in the params by (x,y,w,h) ##
    box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
    box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
    box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
    box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
    box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2
    box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2
    box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2
    box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2

    combined_x1 = torch.max(box1_x1, box2_x1)
    combined_y1 = torch.max(box1_y1, box2_y1)
    combined_x2 = torch.min(box1_x2, box2_x2)
    combined_y2 = torch.min(box1_y2, box2_y2)

    intersection = (combined_x2 - combined_x1).clamp(0) * (combined_y2 - combined_y1).clamp(0) # Clamp where there is no intersection
    
    box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
    box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))

    return intersection / (box1_area + box2_area - intersection + 1e-6) # include 1e-6 for no division by 0 error


### Loss Function for Box Coordinates

In [None]:
def loss_fn_box_coordinates(predictions, target, grid_size=7, num_boxes=2, num_classes=3):
    
    ## First calculate IoUs for the two bounding box predictions
    iou_b1 = intersection_over_union(predictions[..., num_classes + 1:num_classes + 5], target[..., num_classes + 1:num_classes + 5])
    iou_b2 = intersection_over_union(predictions[..., num_classes + 6:num_classes + 10], target[..., num_classes + 1:num_classes + 5])
    ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

    iou_maxes, bestbox = torch.max(ious, dim=0)
    exists_box = target[..., num_classes].unsqueeze(3)

# CNN Implementation

In [8]:
class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)
        
    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))

In [None]:
class YOLO(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * 4 * 4, 512)

    def forward()

Ben

# Training


Gage

# Implementing Yolov8 for custom implementation

The Yolo model is pretrained on the COCO Data Set:
https://cocodataset.org/#home

In [3]:
from ultralytics import YOLO
from IPython.display import display, Image

model = YOLO("yolov8m.pt")

Downloading https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8m.pt to 'yolov8m.pt'...


100%|██████████| 49.7M/49.7M [00:26<00:00, 1.93MB/s]


There are multiple models you can select, just remeber the bigger models will be more acurate in predictions but it will cause the model to be slower, so take into account what may be best for your specific application:

| Classification  | Detection | Segmentation   | Kind |
|-------|-----|------------| ----- |
| yolov8n-cls.pt | yolov8n.pt |	yolov8n-seg.pt |	Nano |
| yolov8s-cls.pt |	yolov8s.pt| yolov8s-seg.pt	| Small |
| yolov8m-cls.pt | yolov8m.pt | yolov8m-seg.pt	| Medium | 
| yolov8l-cls.pt | yolov8l.pt |	yolov8l-seg.pt	| Large |
| yolov8x-cls.pt | yolov8x.pt |	yolov8x-seg.pt	| Huge |



Here is how you use the model on a basic image, like a stop sign:  
<img src="Stop.jpg" width="400" height="400"> . 
But when we pass a yield sign, it can't detect any object: 
<img src="Yield.jpg" width="400" height="400"> . 


In [13]:
results1 = model.predict("Stop.jpg")
results2 = model.predict("Yield.jpg")


image 1/1 /Users/aidanhousenbold/GHW_DeepLearningProject/Stop.jpg: 448x640 1 car, 2 trucks, 1 stop sign, 142.6ms
Speed: 1.4ms preprocess, 142.6ms inference, 0.6ms postprocess per image at shape (1, 3, 448, 640)

image 1/1 /Users/aidanhousenbold/GHW_DeepLearningProject/Yield.jpg: 448x640 (no detections), 123.3ms
Speed: 0.9ms preprocess, 123.3ms inference, 0.3ms postprocess per image at shape (1, 3, 448, 640)


As you can see, a yield sign is not detected. You can see what items are in the COCO set here:

In [29]:
result = results1[0]
print("stop sign")
if result.boxes != None:
    for box in result.boxes:
        class_id = result.names[box.cls[0].item()]
        cords = box.xyxy[0].tolist()
        cords = [round(x) for x in cords]
        conf = round(box.conf[0].item(), 2)
        print("Object type:", class_id)
        print("Coordinates:", cords)
        print("Probability:", conf)
        print("---")
else:
   print("no objects detected")

print("yield sign image")
result = results2[0]
for box in result.boxes:
        class_id = result.names[box.cls[0].item()]
        cords = box.xyxy[0].tolist()
        cords = [round(x) for x in cords]
        conf = round(box.conf[0].item(), 2)
        print("Object type:", class_id)
        print("Coordinates:", cords)
        print("Probability:", conf)
        print("---")

stop sign
Object type: stop sign
Coordinates: [120, 16, 165, 63]
Probability: 0.93
---
Object type: truck
Coordinates: [23, 83, 72, 104]
Probability: 0.82
---
Object type: truck
Coordinates: [112, 89, 139, 100]
Probability: 0.45
---
Object type: car
Coordinates: [131, 89, 149, 100]
Probability: 0.45
---
yield sign image


So we can use a specialized data set to train our model onto be good at traffic sign image detection for applications like self driving cars:
https://www.kaggle.com/datasets/pkdarabi/cardetection  The following code will let us train our model on the signs dataset


In [43]:
import os
model.train(data="data.yaml", epochs=30)

New https://pypi.org/project/ultralytics/8.2.6 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.2.2 🚀 Python-3.11.8 torch-2.2.0 CPU (Apple M1 Max)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=yolov8m.pt, data=data.yaml, epochs=30, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=train13, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt

[34m[1mtrain: [0mScanning /Users/aidanhousenbold/GHW_DeepLearningProject/self/train/labels... 3530 images, 3 backgrounds, 0 corrupt: 100%|██████████| 3530/3530 [00:00<00:00, 4501.32it/s]


[34m[1mtrain: [0mNew cache created: /Users/aidanhousenbold/GHW_DeepLearningProject/self/train/labels.cache


[34m[1mval: [0mScanning /Users/aidanhousenbold/GHW_DeepLearningProject/self/valid/labels... 801 images, 0 backgrounds, 0 corrupt: 100%|██████████| 801/801 [00:00<00:00, 4445.92it/s]

[34m[1mval: [0mNew cache created: /Users/aidanhousenbold/GHW_DeepLearningProject/self/valid/labels.cache





Plotting labels to runs/detect/train13/labels.jpg... 
[34m[1moptimizer:[0m 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
[34m[1moptimizer:[0m AdamW(lr=0.000526, momentum=0.9) with parameter groups 77 weight(decay=0.0), 84 weight(decay=0.0005), 83 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to [1mruns/detect/train13[0m
Starting training for 30 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       1/30         0G      2.578      5.047      2.579         35        640:  13%|█▎        | 28/221 [11:44<1:19:22, 24.67s/it]