### Intro ###

I'm Or Ohev-Zion, a Computer Science student at The Open University of Israel.
The following document is a project developed for study purposes for a course: Data Science Workshop.

As a veteran officer in the IDF, serving in a field technological unit, which uses various technological platforms and tools that enhance the IDF's combat abilities, I was intigured by the current technological development and issues of combat drones and wanted to learn how to solve an issue such technologies face today.

A military drone which has the mission of performing object detection in the field, might come across several hardships in the process.
Some of them are: poor or varying light conditions, obstructions in the field of view, varying sizes of objects (tiny to large), cluttering of images by rubble or backgroud objects like sand, water, concrete, and overlapping of objects.



![Description](clip-2-superJumbo.jpg)


When thinking on a subject for this course, I wanted to find a way to implement some machine learning model that can take the above mentioned issues and hardships and overcome them.
This of course would have to be done methodically, as I was unable (and will not even if could), use real combat footage.
To achieve this, I've decided to use animated footage from a video game (namely Baldurs Gate 3), which imitates this conditions.
Then, I've design and implemented a series of steps to create model that is able to detect objects in a simulated combat environment.

Here we can see an illustration of how this video game footage can simulate similar data by incorporating data which has varying light conditions, clutter and varying distances and sizes of objects of the same class. This is why this was used for data creation.


![Description](frame_000009.png)


The first step was to record video data from the game (using a PS5, and uploading to the cloud - google drive).
Next, I've used ffmpeg (as seen below) to separate the video clip into frames (or images) - one frame per second - for ease of use and dataset creation.

The frames where uploaded to the same cloud into a "frames" folder, which will be the location of the initial raw data.

In [1]:
from google.colab import drive

# 1. Mount Google Drive
drive.mount('/content/drive')
dataset_root = "/content/drive/MyDrive/AI Workshop/YOLODataset/SingleRunDataset"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!apt-get update
!apt-get install ffmpeg
!ffmpeg -version


import glob
import os

# 1. Define source and destination directories
frames_root = os.path.join(dataset_root, "frames")  # Where extracted frames will be saved

# 2. Find all .webm files under dataset_root (recursively)
webm_files = glob.glob(dataset_root + '/**/*.webm', recursive=True)
print(f"Found {len(webm_files)} .webm files:")

for f in webm_files:
    print("-", f)

# 3. For each .webm file, create a unique output folder, then run ffmpeg
for webm_path in webm_files:
    # Extract the base filename (without extension) for naming the subfolder
    base_name = os.path.splitext(os.path.basename(webm_path))[0]

    # e.g., if webm_path is ".../PS5/clip1.webm", base_name = "clip1"
    output_folder = os.path.join(frames_root, base_name)

    # Create the output folder (if it doesn't exist)
    os.makedirs(output_folder, exist_ok=True)

    # Build the ffmpeg command
    # -i "webm_path": input file
    # -vf fps=1: extract 1 frame per second
    # output_folder/frame_%06d.jpg: store frames as frame_000001.jpg, frame_000002.jpg, etc.
    ffmpeg_cmd = f'ffmpeg -i "{webm_path}" -vf fps=1 -pix_fmt rgb24 "{output_folder}/frame_%06d.png"'

    print(f"\nExtracting frames for: {webm_path}\nSaving to: {output_folder}")

    # 4. Run the ffmpeg command in a shell
    !$ffmpeg_cmd

print("\nAll .webm files processed!")

### Data preparation and EDA ###

An important step was to first decide on how to create data to train on.
Initially, many different objects in the footage where categorized under the same class - which made the model mistake them, or be unable to really find them when predicting on new data.

In various iterations of the process of creating the data, and doing some basic ML on it (specified how in next chapters), I was able to get better results.

I've learned to define unique classes (of objects) to unique real objects in the footage. For example, because of the high variance in look between two different characters, even when sharing the same status or job (in the scene), it was better to classify them as different objects.

This prepares the ground to a realization, that it is possibly better to train an object detection model to classify different objects of the same type, i.e. "Enemy1", "Enemy2", .. and "Friendly1", "Friendly2", .., instead of just "Enemy" and "Friendly" for all simialr objects of these categories.

The next step was to manually label the initial frames dataset.
This was done using an online tool: Roboflow.
The tool allows to upload images and manually label shared classes to objects in the iamges, and later on creating a dataset from one of the common types (for this workshop purposes, PASCAL VOC was used, which uses the images, annotations file separation format).

![Description](image_annotation_in_roboflow.png)


After creating a small but sufficient dataset (takes a long time of manual labor), it's possible to start working with the data and learning what ML models can make of it.

### Model architecture choice ###
In a separate course, namely AI Seminar, I've researched a hybrid cnn-trasnformers model, using various modern research papers published on arXiv.org, to desing a ML model that will be a good fit for this technology defined above (object detection for military drones in combat scense).

I've decided to use YOLO for this purpose as it was the cnn part of the model architecture for various reason.

For this reason, I've decided to focus on training a model using this architecture alone, using Keras code library for python.

Many hyperparamters choices were considered to get a good result, and also different sub-models were used (from YOLOv8n to YOLOv8XL), to see which one performed best, and still getting close to real-time performace for the sake of usage in real combat conditions.

### Training and prediction score ###

The first few iterations got very bad results, due to learning mostly backgroud noise. Because it was hard to calculate metrics for varying amount of objects existing in each image, a good measure of accuracy in detection was to view how the model can predict by checking the results manually - by looking at the bounding boxes and their locations in the validation dataset. Because the dataset is rather small (about 100 images), it was easy to do.

During training the best performing model was saves to disk (cloud), and reloaded once the training process ended, to start with predictions.



Although not shown in this notebook, the next step was to upload the images (frames) to Roboflow.com (free version), label 4 classes of objects in about 50 frames, download the dataset (images with metadata, specifying location of bounding boxes and labels on the frames) as Pascal VOC format (one xml file with metadata per frame (.jpg file of 640*640 resultion for comfort). Additionalty a few preprocessing techniques where performed on the dataset like "resize", "auto-orient". Then a few augmentation techniques to increase dataset size for more learning from a small dataset ("flip", "90deg rotate", "rotation +-15deg", "shear +-10deg") effectively making dataset of size 135 input images.

Here we set hyper-parameters for the model training, self-explanatory (pretty default values, like learning rate 0.001, split 80% train, 20% evaluation for dataset, 15 epochs - not too big, to avoid long training times, for easy reproduce by Teacher).

Hyperparameters used throught the training

In [None]:
SPLIT_RATIO = 0.2
BATCH_SIZE = 4
LEARNING_RATE = 0.001
EPOCH = 15
GLOBAL_CLIPNORM = 10.0

Here we install necessary packages for the model - specifically the project requirement is to use the Keras software pacakge which gives many tools for Deep Learning. Among those, the KerasCV module (which by documentation is in transition period into KerasHub, in the next year, but will remain usable) which supplies famous and state-of-the-art CNN pre-trained models. I've chosen to use the famous YOLO model as will be further explain ahead. Furthermore, I'm using Tensorflow for necessary methods to configure and train the model, like Optimizer, etc, which integrates well with Keras packages.

In [None]:
!pip install --upgrade git+https://github.com/keras-team/keras-cv -q
!pip install tensorflow --upgrade
!pip install keras --upgrade
!pip install keras-cv --upgrade


Here we import most of the necessary packages that will be used throughout the training of the model

In [None]:
import os
from tqdm.auto import tqdm
import xml.etree.ElementTree as ET

import tensorflow as tf
from tensorflow import keras

import keras_cv
from keras_cv import bounding_box
from keras_cv import visualization

Again, connect to drive, this time we fetch the images with metadata in correct format for training

In [None]:
# Paths
DATASET_ROOT = os.path.join(dataset_root, "labeled_dataset_5")

Here we prepare the classes mapping, the images and annotations lists, to create the Dataset

In [None]:
class_ids = [
    "Enemy Imp",
    "Enemy Hellsboar",
    "Player",
    "Laezel",
    "Shadowheart",
]

class_mapping = dict(zip(range(len(class_ids)), class_ids))

# Path to images and annotations
path_images = os.path.join(DATASET_ROOT, "images")
path_annot = os.path.join(DATASET_ROOT, "annotations")

# Get all XML file paths in path_annot and sort them
xml_files = sorted(
    [
        os.path.join(path_annot, file_name)
        for file_name in os.listdir(path_annot)
        if file_name.endswith(".xml")
    ]
)

# Get all JPEG image file paths in path_images and sort them
jpg_files = sorted(
    [
        os.path.join(path_images, file_name)
        for file_name in os.listdir(path_images)
        if file_name.endswith(".jpg")
    ]
)

In [None]:
jpg_files.__len__()

300

Here we parse the annotations data, and bundle the images with them in the expected format, before initializing the dataset used by the model.

In [None]:
def parse_annotation(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()

    image_name = root.find("filename").text
    image_path = os.path.join(path_images, image_name)

    boxes = []
    classes = []
    for obj in root.iter("object"):
        cls = obj.find("name").text
        classes.append(cls)

        bbox = obj.find("bndbox")
        xmin = float(bbox.find("xmin").text)
        ymin = float(bbox.find("ymin").text)
        xmax = float(bbox.find("xmax").text)
        ymax = float(bbox.find("ymax").text)
        boxes.append([xmin, ymin, xmax, ymax])

    class_ids = [
        list(class_mapping.keys())[list(class_mapping.values()).index(cls)]
        for cls in classes
    ]
    return image_path, boxes, class_ids


image_paths = []
bbox = []
classes = []
for xml_file in tqdm(xml_files):
    image_path, boxes, class_ids = parse_annotation(xml_file)
    image_paths.append(image_path)
    bbox.append(boxes)
    classes.append(class_ids)

boxes

  0%|          | 0/300 [00:00<?, ?it/s]

[[238.0, 227.0, 416.0, 431.0]]

Here is a necessary and important step for our mission. Because not all images have equal amount of bounding boxes (circling objects), we need to convert our dataset to a "Ragged Tensor" form, which allows the dataset to handle varying dimensions of objects. For example: Image 1, might have 10 detected object in it (or true labeled for that matter), and Image 2 might have only 3 detected objects. The dataset created here, is aware of the varying sizes and amounts of objects (both from true "ground-truth" object from labeled data, and from predicted objects (in evaluation data for example)).

In [None]:
bbox = tf.ragged.constant(bbox)
classes = tf.ragged.constant(classes)
image_paths = tf.ragged.constant(image_paths)

data = tf.data.Dataset.from_tensor_slices((image_paths, classes, bbox))

Here we split the dataset into training and validation by the SPLIT_RATIO.

In [None]:
# Determine the number of validation samples
num_val = int(len(xml_files) * SPLIT_RATIO)

# Split the dataset into train and validation sets
val_data = data.take(num_val)
train_data = data.skip(num_val)

Consequently, the amount of files (images) for validation:

In [None]:
num_val

60

Helper methods to load project the data and batch it together - preparing for training.

In [None]:
def load_image(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    return image


def load_dataset(image_path, classes, bbox):
    # Read Image
    image = load_image(image_path)
    bounding_boxes = {
        "classes": tf.cast(classes, dtype=tf.float32),
        "boxes": bbox,
    }
    return {"images": tf.cast(image, tf.float32), "bounding_boxes": bounding_boxes}

### Data Augmentation ###

An Augmentor layers bundle - which allows the model to first "puff" our data up by performing data science known methods to increase the input size - this is to artificially increase the amount of data the model can learn from - altough it is somewhat less effective then introducing new "real" data. I struggled getting results from the model, and decided to keep it simple, so i removed the fancy augmentations to make it easier for the model to classify objects. In hindsight, I realized that using binary_crossentropy was a bad choice, but before I did I removed the fancy augmentations. I got good results without Shear and Flip, so I kept only the resize.

In [None]:
augmenter = keras.Sequential(
    layers=[
        keras_cv.layers.RandomFlip(mode="horizontal", bounding_box_format="xyxy"),
        keras_cv.layers.RandomShear(
            x_factor=0.2, y_factor=0.2, bounding_box_format="xyxy"
        ),
        keras_cv.layers.JitteredResize(
            target_size=(640, 640), scale_factor=(0.75, 1.3), bounding_box_format="xyxy"
        ),
    ]
)

augmenter = keras.Sequential(
    layers=[
        keras_cv.layers.JitteredResize(
            target_size=(640, 640), scale_factor=(1, 1), bounding_box_format="xyxy"
        ),
    ]
)

Here we prepare the "training dataset" for the model by setting a few operation, like batching the data together, for smarter and faster learning.

In [None]:
train_ds = train_data.map(load_dataset, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.shuffle(BATCH_SIZE * 4)
train_ds = train_ds.ragged_batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(augmenter, num_parallel_calls=tf.data.AUTOTUNE)

Here we prepare the validation dataset, by performing a simple resize on the data for some of the data, just to let the model sweat a bit when predicting this new data (did not train on this data). Removed the scale factor, kept it simple.

In [None]:
resizing = keras_cv.layers.JitteredResize(
    target_size=(640, 640),
    scale_factor=(1, 1),
    bounding_box_format="xyxy",
)

val_ds = val_data.map(load_dataset, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.shuffle(BATCH_SIZE * 4)
val_ds = val_ds.ragged_batch(BATCH_SIZE, drop_remainder=True)
val_ds = val_ds.map(resizing, num_parallel_calls=tf.data.AUTOTUNE)

Here we take a sneak peak on the images, with the labeled objects, before starting training, just to get a feel of how the images look, and how much data can be seen by the model and expected by it to predict itself. Reminder: this dataset was manually created by me for the purposes of trying to train a model to detect objects in difficult conditions (similar colors with background, varying distances from viewer, unfamiliar classes (probably the model never saw such objects during pre-training)).

In [None]:
def visualize_dataset(dataset, value_range, rows, cols, bounding_box_format):
    """
    Iterates over all batches in the dataset and plots bounding boxes
    using KerasCV's visualization utilities.

    Parameters:
      - dataset: a tf.data.Dataset whose elements are dictionaries
                 with keys "images" and "bounding_boxes"
      - value_range: tuple, e.g. (0, 255) for pixel range
      - rows, cols: how to arrange the images in the gallery
      - bounding_box_format: e.g. "xyxy"
    """
    for batch in dataset:
        # Expect batch to be a dictionary with "images" and "bounding_boxes"
        images = batch["images"]
        bounding_boxes = batch["bounding_boxes"]

        visualization.plot_bounding_box_gallery(
            images,
            value_range=value_range,
            rows=rows,
            cols=cols,
            y_true=bounding_boxes,
            scale=5,
            font_scale=0.7,
            bounding_box_format=bounding_box_format,
            class_mapping=class_mapping,
        )


visualize_dataset(
    train_ds, bounding_box_format="xyxy", value_range=(0, 255), rows=2, cols=2
)

visualize_dataset(
    val_ds, bounding_box_format="xyxy", value_range=(0, 255), rows=2, cols=2
)

Here we prepare the data the last time, according to convention of the CNN model architecture

In [None]:
def dict_to_tuple(inputs):
    return inputs["images"], inputs["bounding_boxes"]


train_ds = train_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)

val_ds = val_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.prefetch(tf.data.AUTOTUNE)

Just checking the classes and boxes are populted correctly here.

In [None]:
for images, labels in train_ds.take(1):
    print(labels)  # Check if the bounding boxes and classes are correctly structured
    print(labels["classes"])
    print(labels["boxes"])
    print(images)


{'classes': <tf.RaggedTensor [[0.0, 4.0, 1.0, 0.0], [2.0, 0.0, 1.0, 3.0, 0.0], [0.0, 4.0, 1.0, 0.0],
 [2.0, 4.0, 0.0]]>, 'boxes': <tf.RaggedTensor [[[300.0, 228.0, 334.0, 276.0],
  [239.0, 505.0, 273.0, 579.0],
  [421.0, 347.0, 476.0, 415.0],
  [627.0, 262.0, 640.0, 305.0]], [[288.0, 510.0, 350.0, 622.0],
                                  [476.0, 397.0, 527.0, 470.0],
                                  [270.0, 313.0, 344.0, 402.0],
                                  [246.0, 385.0, 276.0, 463.0],
                                  [317.0, 176.0, 356.0, 231.0]],
 [[293.0, 231.0, 322.0, 275.0],
  [275.0, 513.0, 299.0, 585.0],
  [432.0, 332.0, 478.0, 394.0],
  [622.0, 219.0, 640.0, 260.0]], [[306.0, 302.0, 346.0, 393.0],
                                  [215.0, 392.0, 245.0, 465.0],
                                  [187.0, 228.0, 221.0, 286.0]]]>}
<tf.RaggedTensor [[0.0, 4.0, 1.0, 0.0], [2.0, 0.0, 1.0, 3.0, 0.0], [0.0, 4.0, 1.0, 0.0],
 [2.0, 4.0, 0.0]]>
<tf.RaggedTensor [[[300.0, 228.0, 334

### Model setup and training ###

This is an important part - here we specify the exact type of pre-trained CNN model used, namely "YOLOv8 Large with COCO backbone". This means it uses the YOLO CNN architecture of the 8th edition, with rather large and costly training and prediction costs (Small wasn't able to learn and predict this kind of data so I went for Large), and it was pre-trained on a very famous dataset of pre-labeled images called COCO dataset.

We also configure to use the helper class by KerasCV - YOLOV8Detector, which is the code that runs the above model.

Then we configure an Optimizer for the model, also fetched from Keras, with pretty default hyper-parameters values.

Finally we compile everything together, while specifying pretty standard functions for calculating improvement in the training process "binary crossentropy" for the classes and "Complete IoU" for the bounding boxes (which takes into account both two-dimensional locations and sizes and shapes of the bbs).

There is a lot to be said about the choices here - fpn_depth allows the model to handle varying sizes of objects, but because most of the objects I labeled are pretty similar in size, and after getting bad results (no detections) when playing (increasing) the fpn_depth, I kept it at 1.

the score_threshold and nms_iou_threshold are inference parameters and not trianing parameters. After the model predicts on an image, it uses these parameters to decide how much of the predicted boxes to filter out, to avoid duplicated boxes and uncertain (low-confidence) detections. We can change these even after finishing training, and will play with them later.

The bb_format is mandatory - xyxy - to fit the dataset structure (pascal voc format with x/y_min/max).

In [None]:
# Initial model setup
backbone = keras_cv.models.YOLOV8Backbone.from_preset("yolo_v8_xl_backbone_coco")

yolo = keras_cv.models.YOLOV8Detector(
    num_classes=len(class_mapping),
    bounding_box_format="xyxy",
    backbone=backbone,
    fpn_depth=1,
    score_threshold=0.5,
    nms_iou_threshold=0.5,
)

The callback "checkpoint" function, allows the model to monitor progress and keep the best fitting model between epochs. We keep its weight in a file, specified in the filepath param.

In [None]:
# Compilation methods
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="best_model2.keras",
    monitor="val_box_loss",  # or another metric, e.g., "val_box_loss"
    mode="min",          # lower loss is better
    save_best_only=True,
    save_weights_only=False,
    verbose=1,
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",   # Watch validation loss
    patience=2,           # Stop if no improvement for 3 epochs
    min_delta=0.1,        # Minimum improvement to continue training
    mode="min",           # Lower `val_loss` is better
    verbose=1
)

Every time before fitting, we need to compile the model with optimizer, classificatio loss function and box_loss function. I can't stress enough how important these are. For my case after many sweat and tears, I found the Focal Loss is mandatory for my dataset because it makes sure the model doesn't train on noise - resulting in zero predictions on actual objects we wish to find.

As for the learning rate its okay to start high (0.001) and in future epochs can reduce it by an order of magnitude to make learning even mroe precise, after getting good grasp of basic understaning of the images and objects relevant. the global_clipnorm just makes sure we avoid overfitting (i think).

In [None]:
# Setup optimizer and compile
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001,
    global_clipnorm=GLOBAL_CLIPNORM,
)

yolo.compile(
    optimizer=optimizer,
    classification_loss=keras_cv.losses.FocalLoss(from_logits=True, gamma=2.0),  # 🟢 Use Focal Loss for best performance with different classes
    box_loss="ciou"
)

Here we just check class balance - its quite okay. Not great bad not bad as well. So probably doesn't require using class weights.

In [None]:
from collections import Counter

class_counts = Counter()
for images, labels in train_ds:
    # Access classes from the 'labels' dictionary
    for cls_array in labels["classes"]: # Iterate over sub-arrays
        for cls in cls_array.numpy().tolist(): # Convert to list for iteration
            # Convert cls to a hashable type, if needed
            cls = int(cls)  # Assuming cls is now a single numeric value
            class_counts[cls] += 1

print(class_counts)

Counter({0: 255, 2: 168, 1: 159, 3: 135, 4: 81})


## Train CNN Model

Here we finally start training the model. it can take about an hour to fine-tune the model with our dataset.

After every epoch, we check and score our current model by testing its prediction abilities on the validation data - this way, it doesn't learn from the validation data, but uses it to increase the model's prediction capabilites.

Here starts a game of training - checking predictions - re-training - re-checking, until getting satisfactory results. When I say training I mean fine-tuning the model - namely YOLOv8 which is great for this purpose. It specializes on object detection, and is modern and extremly fast relatively.

In [None]:
# Setup training parameters and start training

yolo.fit(
    train_ds,
    validation_data=val_ds,
    epochs=5,
    callbacks=[checkpoint_callback],
)

Epoch 1/5


Expected: ['keras_tensor_10597']
Received: inputs=Tensor(shape=(4, 640, 640, 3))


[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11s/step - box_loss: 3.8623 - class_loss: 0.0189 - loss: 3.8811 
Epoch 1: val_box_loss improved from inf to 4.84277, saving model to best_model2.keras
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m759s[0m 12s/step - box_loss: 3.8526 - class_loss: 0.0188 - loss: 3.8714 - val_box_loss: 4.8428 - val_class_loss: 0.0417 - val_loss: 4.8845
Epoch 2/5
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11s/step - box_loss: 2.5884 - class_loss: 0.0076 - loss: 2.5961 
Epoch 2: val_box_loss improved from 4.84277 to 3.52260, saving model to best_model2.keras
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m700s[0m 12s/step - box_loss: 2.5872 - class_loss: 0.0077 - loss: 2.5948 - val_box_loss: 3.5226 - val_class_loss: 0.0086 - val_loss: 3.5312
Epoch 3/5
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11s/step - box_loss: 2.4026 - class_loss: 0.0068 - loss: 2.4094 
Epoch 3: val_box_loss i

<keras.src.callbacks.history.History at 0x7e3f3ca30890>

By now, we already have a trained model, and we can get the optimal one according to prediction score from memory.

Here we can load the best trained model's weights from file system if needed. We can probably save it on drive, to avoid losing it between runs, but for this project its okay like this.

In [None]:
yolo = keras_cv.models.YOLOV8Detector(
    num_classes=len(class_mapping),
    bounding_box_format="xyxy",
    backbone=keras_cv.models.YOLOV8Backbone.from_preset("yolo_v8_xl_backbone_coco"),
)

yolo.load_weights("best_model3.keras")

I used this to quickly load the model, although it caused some issues, once I start playing with fpn_depth between epochs. Best to just load the weights instead. But this is also one way to load the already fine-tuned model.

In [None]:
from keras.utils import custom_object_scope
import keras_cv

# Include all necessary YOLOv8 classes and import BoundingBoxes from the correct location
custom_objects = {
    "YOLOV8Detector": keras_cv.models.YOLOV8Detector,
    "YOLOV8Backbone": keras_cv.models.YOLOV8Backbone,
}


with custom_object_scope(custom_objects):
    yolo.fpn_depth = 1
    yolo = tf.keras.models.load_model("best_model3.keras", compile=True)

This is a quick and dirty way to re-load the model between epochs.

In [None]:
# First, reinitialize the YOLO model (must match the trained model config)
yolo = keras_cv.models.YOLOV8Detector(
    num_classes=len(class_mapping),
    bounding_box_format="xyxy",
    backbone=keras_cv.models.YOLOV8Backbone.from_preset("yolo_v8_xl_backbone_coco"),
    fpn_depth=2,
)

# Now, load the best weights
yolo.load_weights("best_model3.keras")

Ready it up for another run, this time with different than initial parameters if needed (fine-fine-tuning).

In [None]:
yolo.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001), # Play with this
    classification_loss="binary_crossentropy",
    box_loss="ciou",
)

yolo.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=[checkpoint_callback],
)

Epoch 1/10


Expected: ['keras_tensor_8953']
Received: inputs=Tensor(shape=(4, 640, 640, 3))


[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12s/step - box_loss: 2.2878 - class_loss: 0.6385 - loss: 2.9263 
Epoch 1: val_loss improved from 3.55063 to 3.41001, saving model to best_model.keras
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m835s[0m 13s/step - box_loss: 2.2879 - class_loss: 0.6394 - loss: 2.9273 - val_box_loss: 2.6931 - val_class_loss: 0.7169 - val_loss: 3.4100
Epoch 2/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12s/step - box_loss: 2.2347 - class_loss: 0.6592 - loss: 2.8940 
Epoch 2: val_loss improved from 3.41001 to 3.39160, saving model to best_model.keras
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m774s[0m 13s/step - box_loss: 2.2347 - class_loss: 0.6598 - loss: 2.8945 - val_box_loss: 2.6767 - val_class_loss: 0.7149 - val_loss: 3.3916
Epoch 3/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12s/step - box_loss: 2.1991 - class_loss: 0.6356 - loss: 2.8347 
Epoch 3: val_loss improved 

KeyboardInterrupt: 

Here is a function to print the images from validation dataset, and predict the objects in each one, with the finished model. We can then view the predictions it made and see if it indeed finds the objects we trained it to find, and if the bounding boxes are of appropriate sizes and shapes.

In [None]:
def visualize_entire_dataset(model, dataset, bounding_box_format, class_mapping):
    import math
    from keras_cv import visualization

    # We'll loop over each batch in the dataset
    batch_index = 0
    for images, y_true in dataset:
        # Run inference on this batch
        y_pred = model.predict(images, verbose=0)
        # print(y_pred)
        # Decide how many rows and columns to display
        # e.g., if you have 4 images in a batch, you might do 2x2
        batch_size = images.shape[0]
        cols = 2
        rows = math.ceil(batch_size / cols)  # enough rows to fit the batch

        visualization.plot_bounding_box_gallery(
            images,
            value_range=(0, 255),
            bounding_box_format=bounding_box_format,
            y_pred=y_pred,
            y_true=y_true,
            scale=4,
            rows=rows,
            cols=cols,
            show=True,
            font_scale=0.7,
            class_mapping=class_mapping,
        )
        batch_index += 1
        print(f"Displayed batch {batch_index}")


Change the confidence score and overlapping threshold (wether to display a detection the model is not confident about, and how many overlapping, maybe similar detection per object to show or not, respectively).

In [None]:
# Setup inference parameters
yolo.score_threshold = 0.8
yolo.nms_iou_threshold = 0.2

Calling this function (with either "train_ds" to predict on training data or val_ds to predict on validation data) gives a feeling of the model's detection ability.

In [None]:
# Run inference on validation set
visualize_entire_dataset(
    model=yolo,
    dataset=val_ds, # or train_ds
    bounding_box_format="xyxy",
    class_mapping=class_mapping
)

Final result and current capabilities of the model:

![Description](result_of_model1.png)


### Conclusions and remarks ###
This concludes this project. Here is a quick summary:

1. Created a dataset manually, labeling a few hard to detect objects in various environmental conditions (lighting, angles, obstructions in view, sizes, etc)

2. Planned and executed a deep learning architecture, namely a CNN pre-trained model, fine-tuned on our dataset, to perform object detection.

3. Invested time into trying to better the results, and understand the issues along the way that must be taken care of:

  3.1 For example, at first I've tried to train the model on a much larger dataset, consisting of 13 different classes, making very poor results, probably because I made very different object share classes, and the preparation of the data was not good enough (bounding boxes were not tight enough, etc).

  3.2 It also took a long time, to decide how long and which model to use for training.

  3.3 Also, hyper-parameters played a crucial role in the results - for example, switching from binary_crossentropy to focal_loss had a major imporvement on results!

4. Print our result of the object detection task on new data the model did not train on.

5. For future: possible to build a much larger dataset, creating a different class per object in images. Find many different angles, sizes, lighting conditions anr more for each. Train a large and popular model, suitable for this task for a long time on the data, and incorporate many augmentation techniquies. Then, we can expect good results, and a model that can detect object in video games in real-time (or semi real-time, probably about 15 fps is achievable with non-industrial technology and resources).

6. On a personal note, i've had the change to wrestle with machinary of object detection, and understand what it takes to train a model to perform such a task. I've also learned alot about the existing software packages that help perform this mission.
I've learned alot about the technology of CNN models itself, understanding every part of the architecture, not only theoretiacly but by actually having to need it to run successfully, integerated with all the rest of the system. And I've also learned alot about how to use many online tools like Google Collab, Gemini, Robowflow, ChatGPT, and read a few articles on the matter, to quickly find a working architechture suitable for what I've decided to do (object detection for Baldure's Gate 3).

7. This is a very hand-on course. Currently I work full-time at Intel inc. and also actively serve in the veteran IDF army. I find a balance between work, study and military service, since the 2023 October war started. I did not fully use the potential of this course, by incorporating extensive EDA techniques and scoring performance programmatically mainly due to time and effort constraints unfortunately. Hopefully when I'm ready for the masters degree at The Open University of Israel, I'll have more time on my hands to really invest and implement myself in data science which is a very important, relevant and practical subject in computer science these days. This is the last course I'm doing in the BA. CS studies, and even just passing this course, will conclude my degree with an average of 83.
I want to thank Dr. Idan Alter for his mentorship and comprehensive feedback on this project and data science in general, and wish him all the best in his efforts.

Respectfully, best regards, Or.