In [1]:
import matplotlib.pyplot as plt

## YOLO

In this section, we will explore one of the state-of-the-art (SOTA) models in Machine Learning.

**You Only Look Once: Unified, Real-Time Object Detection**  
[Article](https://arxiv.org/pdf/1506.02640)  
[Presentation](https://www.youtube.com/watch?v=NM6lrxy0bxs&pp=ygUkeW91IG9ubHkgbG9vayBvbmNlIHByZXNlbnRhdGlvbiBjdnBy)

YOLO (You Only Look Once) is a deep neural network architecture initially designed for real-time object detection in images. Classical detectors such as R-CNN, Fast R-CNN, and Faster R-CNN operate in several phases:

1. **Region proposal generation** using a Region Proposal Network (RPN).
2. **Feature extraction** for each proposed region.
3. **Classification and bounding box refinement** for the detected objects.

Although these methods can achieve high accuracy, they are computationally expensive and often unsuitable for real-time applications.

YOLO follows a unified approach: it divides the image into a grid and processes each cell simultaneously to predict both the bounding boxes and the object classes.This integration enables YOLO to achieve notable speed without significantly compromising accuracy. Furthermore, thanks to its optimized design, YOLO has been modified and adapted to perform a wide range of computer vision tasks. It is currently capable of classification, detection, segmentation, object tracking in video, and body motion tracking.

This approach offers several advantages:

- **Very high speed**, making it suitable for real-time applications.
- **Simple architecture**, easy to train and deploy.
- **Global reasoning**: the model sees the entire image when making predictions, improving contextual understanding.


## Evolution of YOLO

YOLO (You Only Look Once) has undergone significant development since its first version in 2015. Below is a summary of the major milestones in its evolution:

### YOLO (2015)
- Introduced in the paper *"You Only Look Once: Unified, Real-Time Object Detection"* by Redmon et al.
- Proposed a unified model for object detection that treats detection as a regression problem.
- Divides the image into an S×S grid and predicts bounding boxes and class probabilities for each grid cell.
- Very fast but suffered from localization errors and struggled with detecting small objects.

### YOLOv2 (YOLO9000, 2016)
- Improved accuracy and speed using a better backbone network (Darknet-19).
- Introduced anchor boxes to improve localization of multiple objects.
- Trained jointly on ImageNet and COCO, allowing it to detect over 9000 object categories.
- Marked a significant step towards more scalable detection.

### YOLOv3 (2018)
- Further improved with the Darknet-53 backbone (deeper and more powerful).
- Supports multi-scale detection (feature maps at three different scales).
- Uses logistic classifiers and independent object class predictions.
- Balanced well between speed and accuracy; became one of the most widely adopted versions.

### YOLOv4 (2020)
- Developed by Alexey Bochkovskiy.
- Combined multiple training tricks and improvements (e.g., Mosaic augmentation, Mish activation).
- Optimized for both GPU and CPU performance.
- Open-source and compatible with OpenCV and TensorRT.

### YOLOv5 (2020)
- Released by Ultralytics (not by the original authors).
- Implemented in PyTorch, making it more accessible to the community.
- Provided pre-trained models in multiple sizes (s, m, l, x).
- Included tools for training, inference, and export to various deployment formats.

### YOLOv6 and YOLOv7 (2022)
- YOLOv6 (by Meituan): focused on industrial applications, written in PyTorch.
- YOLOv7 (by WongKinYiu): pushed the limit of real-time object detection with high accuracy.
- Introduced features like E-ELAN blocks and model reparameterization.

### YOLOv8 (2023)
- Major rewrite by Ultralytics with a new architecture, no longer based on previous YOLO codebases.
- Supports multiple vision tasks: detection, instance segmentation, classification, pose estimation.
- Unified interface and modern design, exported easily to ONNX, TensorRT, and other formats.

### YOLOv9, v10, v11 (2024)
- YOLOv9: Introduced GELAN architecture and PGI module for improved accuracy and efficiency.
- YOLOv10: Focused on end-to-end detection without post-processing (NMS-free).
- YOLOv11: Further optimized the architecture for low-latency applications; introduced new CSP-based components.

### YOLOv12 (2025)
- Latest known version as of February 2025.
- Documentation still limited, but represents the continuation of the Ultralytics line with improvements in performance, scalability, and support for additional tasks.



To begin and carry out initial experiments, version 5 is recommended, as it offers a good balance between usability (low complexity) and the quality of the results that can be achieved. Another good entry point for using YOLO is version 8.


## Architecture

The network consists of 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by [GoogLeNet](https://arxiv.org/pdf/1409.4842), the network reduces activation maps using 1×1 convolutional layers followed by 3×3 convolutional layers.

![YOLO](../assets/bloc2/YOLO.png "YOLO")

### Unified Detection

The paper explains:

> We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes for all classes in an image simultaneously. This means that our network reasons globally about the full image and all objects within it.

The YOLO system divides the input image into an $S×S$ grid. If the center of an object falls within a grid cell, that cell is responsible for detecting the object. Each grid cell predicts $B$ bounding boxes and confidence scores for those boxes.

These confidence scores reflect the model's certainty that a box contains an object, as well as how accurate it believes the predicted box is. Each bounding box consists of five predictions: $x, y, w, h$ and confidence. The coordinates $(x, y)$ represent the center of the box relative to the grid cell boundaries. The width and height are predicted relative to the entire image. Finally, the confidence score represents the Intersection over Union (IoU) between the predicted box and any ground truth box. Each grid cell also predicts $C$ conditional class probabilities.

![YOLO](../assets/bloc2/YOLO_deteccio.png "YOLO")

The architecture described above has an output layer of size $7×7×30$, resulting from the formula: $S × S × (B × 5 + C)$.  
In the original paper: $S = 7$, $B = 2$, and $C = 20$, since the model was trained on the [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/) dataset.

Modern YOLO architectures are significantly more complex and are typically divided into three main components:

- **Backbone**: A convolutional network that extracts features from the image. From version 3 onwards, YOLO uses its own architecture called *DarkNet*, a residual network with approximately 53 layers.
- **Neck**: This component connects the backbone to the head(s). It is responsible for tasks such as multi-scale object detection, using feature pyramid networks that aggregate information from different stages of the backbone.
- **Head**: The head performs the final predictions. In modern versions of YOLO, multiple detection modules are used to predict bounding boxes, objectness scores, and class probabilities for each grid cell in the feature map. These predictions are then aggregated to produce the final detections.

An example of this more complex architecture can be found in the official documentation for [YOLO v5](https://docs.ultralytics.com/yolov5/tutorials/architecture_description/#1-model-structure).


## Using YOLO

The simplest way to use the network is through the library provided by the company Ultralytics. This makes it very easy to experiment with different versions of the network, as well as to perform fine-tuning and transfer learning processes.


In [None]:
!pip install -U ultralytics

We will start by experimenting with YOLOv5, which comes in five different versions. Each version features a backbone network of a different size. In addition, there are two available input image sizes.


<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>size<br><sup>(pixels)</sup></th>
      <th>mAP<sup>val<br>50-95</sup></th>
      <th>mAP<sup>val<br>50</sup></th>
      <th>Speed<br><sup>CPU b1<br>(ms)</sup></th>
      <th>Speed<br><sup>V100 b1<br>(ms)</sup></th>
      <th>Speed<br><sup>V100 b32<br>(ms)</sup></th>
      <th>params<br><sup>(M)</sup></th>
      <th>FLOPs<br><sup>@640 (B)</sup></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5n.pt" target="_blank">YOLOv5n</a></td>
      <td>640</td>
      <td>28.0</td>
      <td>45.7</td>
      <td><strong>45</strong></td>
      <td><strong>6.3</strong></td>
      <td><strong>0.6</strong></td>
      <td><strong>1.9</strong></td>
      <td><strong>4.5</strong></td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt" target="_blank">YOLOv5s</a></td>
      <td>640</td>
      <td>37.4</td>
      <td>56.8</td>
      <td>98</td>
      <td>6.4</td>
      <td>0.9</td>
      <td>7.2</td>
      <td>16.5</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5m.pt" target="_blank">YOLOv5m</a></td>
      <td>640</td>
      <td>45.4</td>
      <td>64.1</td>
      <td>224</td>
      <td>8.2</td>
      <td>1.7</td>
      <td>21.2</td>
      <td>49.0</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5l.pt" target="_blank">YOLOv5l</a></td>
      <td>640</td>
      <td>49.0</td>
      <td>67.3</td>
      <td>430</td>
      <td>10.1</td>
      <td>2.7</td>
      <td>46.5</td>
      <td>109.1</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5x.pt" target="_blank">YOLOv5x</a></td>
      <td>640</td>
      <td>50.7</td>
      <td>68.9</td>
      <td>766</td>
      <td>12.1</td>
      <td>4.8</td>
      <td>86.7</td>
      <td>205.7</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5n6.pt" target="_blank">YOLOv5n6</a></td>
      <td>1280</td>
      <td>36.0</td>
      <td>54.4</td>
      <td>153</td>
      <td>8.1</td>
      <td>2.1</td>
      <td>3.2</td>
      <td>4.6</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s6.pt" target="_blank">YOLOv5s6</a></td>
      <td>1280</td>
      <td>44.8</td>
      <td>63.7</td>
      <td>385</td>
      <td>8.2</td>
      <td>3.6</td>
      <td>12.6</td>
      <td>16.8</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5m6.pt" target="_blank">YOLOv5m6</a></td>
      <td>1280</td>
      <td>51.3</td>
      <td>69.3</td>
      <td>887</td>
      <td>11.1</td>
      <td>6.8</td>
      <td>35.7</td>
      <td>50.0</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5l6.pt" target="_blank">YOLOv5l6</a></td>
      <td>1280</td>
      <td>53.7</td>
      <td>71.3</td>
      <td>1784</td>
      <td>15.8</td>
      <td>10.5</td>
      <td>76.8</td>
      <td>111.4</td>
    </tr>
    <tr>
      <td><a href="https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5x6.pt" target="_blank">YOLOv5x6</a><br>+ [TTA]</td>
      <td>1280<br>1536</td>
      <td>55.0<br><strong>55.8</strong></td>
      <td>72.7<br><strong>72.7</strong></td>
      <td>3136<br>-</td>
      <td>26.2<br>-</td>
      <td>19.4<br>-</td>
      <td>140.7<br>-</td>
      <td>209.8<br>-</td>
    </tr>
  </tbody>
</table>


We will begin by testing the smallest version:

In [1]:
from ultralytics import YOLO

# Load a COCO-pretrained YOLOv5n model
model = YOLO("yolov5n.pt") #yolov5nu.pt

# Display model information (optional)
model.info()

PRO TIP  Replace 'model=yolov5n.pt' with new 'model=yolov5nu.pt'.
YOLOv5 'u' models are trained with https://github.com/ultralytics/ultralytics and feature improved performance vs standard YOLOv5 models trained with https://github.com/ultralytics/yolov5.

YOLOv5n summary: 262 layers, 2,654,816 parameters, 0 gradients, 7.8 GFLOPs


(262, 2654816, 0, 7.840102399999999)

### Inference

YOLOv5 has been trained on the COCO (Common Objects in Context) dataset [link](https://cocodataset.org/#home), which contains 80 different classes. Performing inference for detection is very simple; it is enough to call the model. The model returns an object of type _Results_. 

[Documentation](https://docs.ultralytics.com/reference/engine/results/#ultralytics.engine.results.Results).

Thus, the inference process uses the Ultralytics API and abstracts away from PyTorch.

- YOLO accepts various input formats, including:
  - **File path**: Path to an image or video file (e.g., `"image.jpg"`).
  - **List of file paths**: Multiple images for batch processing (e.g., `["img1.jpg", "img2.jpg"]`).
  - **NumPy array**: An image loaded as a NumPy array (make sure it’s RGB).
  - **PyTorch tensor**: A tensor of shape `[C, H, W]` or `[B, C, H, W]` with pixel values.
  - **Directory path**: Folder containing images.
    


 

In [4]:
img = "https://hips.hearstapps.com/hmg-prod/images/the-boys-serie-amazon-1565605836.jpg"

In [5]:
# Inference
results = model(img)
results[0]; # It's a results list


Found https://hips.hearstapps.com/hmg-prod/images/the-boys-serie-amazon-1565605836.jpg locally at the-boys-serie-amazon-1565605836.jpg
image 1/1 C:\Users\gabri\PycharmProjects\aa_2425\14_YOLO\the-boys-serie-amazon-1565605836.jpg: 448x640 5 persons, 2 handbags, 2 ties, 161.6ms
Speed: 0.0ms preprocess, 161.6ms inference, 0.0ms postprocess per image at shape (1, 3, 448, 640)


In [9]:
import matplotlib.pyplot as plt
r = results[0]

r.boxes

ultralytics.engine.results.Boxes object with attributes:

cls: tensor([ 0.,  0.,  0.,  0., 26.,  0., 26., 27., 27.])
conf: tensor([0.9311, 0.8853, 0.8708, 0.8288, 0.3462, 0.3201, 0.2963, 0.2759, 0.2705])
data: tensor([[1.1957e+03, 4.9963e+01, 1.9991e+03, 1.3012e+03, 9.3113e-01, 0.0000e+00],
        [5.1376e+02, 1.6846e+02, 1.1217e+03, 1.3078e+03, 8.8529e-01, 0.0000e+00],
        [1.1506e+00, 3.3177e+02, 3.4585e+02, 1.2959e+03, 8.7082e-01, 0.0000e+00],
        [1.8088e+02, 3.4668e+02, 5.2710e+02, 1.3006e+03, 8.2880e-01, 0.0000e+00],
        [3.6379e+02, 1.0700e+03, 5.5296e+02, 1.3080e+03, 3.4624e-01, 2.6000e+01],
        [1.0214e+03, 4.3079e+02, 1.1316e+03, 1.0513e+03, 3.2006e-01, 0.0000e+00],
        [4.2099e+02, 1.0726e+03, 5.5326e+02, 1.3072e+03, 2.9633e-01, 2.6000e+01],
        [3.4559e+02, 5.9484e+02, 4.5085e+02, 8.1905e+02, 2.7592e-01, 2.7000e+01],
        [3.4424e+02, 5.9315e+02, 4.3852e+02, 7.6313e+02, 2.7049e-01, 2.7000e+01]])
id: None
is_track: False
orig_shape: (1310, 2000)
s

### Training

To perform training, the `train` method of the `YOLO` class is used. There is no need to implement a training loop manually, as this function provides a higher-level abstraction. It is important to note that this method is highly configurable, so it is recommended to study its parameters carefully before starting a training session.

Refer to the documentation [link](https://docs.ultralytics.com/modes/train/#key-features-of-train-mode)

In [None]:
## From the official documentation we take som examples
from ultralytics import YOLO

# Load a COCO-pretrained YOLO model
model = YOLO("yolov5n.pt") # NOTE: It is also possible to load a model without pretraining. These models are defined in `.yaml` files.

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640) # NOTE: We can train here because `coco8` is included within `ultralytics`.

# Run inference
results = model("path/to/imatge.jpg")

### Segmentation using YOLO

It is important to note that YOLO (You Only Look Once) was originally developed for object detection tasks, not for segmentation. Its architecture was initially optimized for real-time detection performance, focusing on identifying and localizing objects within bounding boxes. The integration of segmentation capabilities is a more recent extension, made possible through architectural modifications in later versions such as YOLOv5 (with custom tweaks) and more naturally in YOLOv8 with dedicated segmentation heads.


Although segmentation tasks can be performed by modifying version 5 (see [link](https://github.com/ultralytics/yolov5/blob/master/segment/tutorial.ipynb)), it is from version 8 onward that this task is natively integrated into the network through the addition of a new head specifically for segmentation.

In the documentation, we can see that there are now versions of all the weight files available for the different tasks: [link](https://docs.ultralytics.com/models/yolov8/#supported-tasks-and-modes).

Below, we will look at an example of segmentation:


In [None]:
from ultralytics import YOLO

# Load a COCO-pretrained YOLOv8n model
model = YOLO("yolov8n-seg.pt")

results_seg = model(img)

In [None]:
img_result = results_seg[0].plot()

plt.figure()
plt.imshow(img_result);
plt.show();