In [1]:
import pandas as pd
import os

In [2]:
OUTPUT_DIR = "output_csvs"
for filename in os.listdir(OUTPUT_DIR):
    if "results_val_vs" in filename and filename.endswith(".csv"):
        print(filename)

results_val_vs_original_dataset.csv
results_val_vs_randomized_clips.csv


# Metrics taken in account

## Accuracy

Accuracy is calculated as:

$$
\text{Accuracy} = \frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive + True Negative + False Negative}}
$$


## Precision

The fraction of produced detections which are true positives.

$$
\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}} = \frac{\text{True Positive}}{\text{Total Number of Prediction}}
$$


## Recall

The fraction of groundtruth boxes in the data that matched to some produced detection.

$$
\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{\text{True Positive}}{\text{Total Number of Ground Truth}}
$$

## F1 score

F1 calcula el balance entre precision y recall. Si el F1 es alto, precision y recall son altos.

The F1 Score is calculated as:

$$
\text{F1 Score} = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}
$$

# The problem

We are dealing with a weapon detection issue. It is more important to avoid false negatives than false positives. It is better not to miss a weapon detection, at the cost of detecting more weapons with less accuracy. For this reason, a low threshold will also be used.

Given the context of weapon detection and the emphasis on avoiding false negatives (missing weapon detections) at the expense of potentially having more false positives (incorrect weapon detections), **recall** is more important than precision in this scenario.

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive cases (weapons) that were correctly identified by the model. In the context of your problem, a higher recall means that the model is effectively capturing a larger portion of actual weapons, minimizing the risk of missing any potentially dangerous objects.

While precision (positive predictive value) is also significant, it prioritizes the accuracy of positive predictions among all predicted positives. In your case, a focus on precision might lead to being overly cautious and producing fewer false positives, but it could also result in missing actual weapon detections, which is a more critical concern in a weapon detection scenario.

A higher **recall** rate is crucial for ensuring that potential weapons are not overlooked, even if it means accepting a higher number of false positives.

# Evaluating models vs small dataset
## Sort by Recall

In [37]:
df = pd.read_csv(f"{OUTPUT_DIR}/results_val_vs_original_dataset.csv")
# df.set_index('model_key', inplace=True)

columns_of_interest = ['model','imgsz', 'epochs', 'batch', 'tf', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95', 'lr0', 'loss_function']

df[(df['dataset'] == 'v1') & (df['epochs'] == 50)][columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95,lr0,loss_function
0,yolov5,800,50,16,no,0.737,0.625,0.338198,0.701,0.358,0.01,SGD
3,yolov5,640,50,16,yolov5s,0.773,0.712,0.370624,0.764,0.429,0.01,SGD
5,yolov5,800,50,16,yolov5s,0.796,0.716,0.376942,0.755,0.428,0.01,SGD
8,yolov7,640,50,16,yolov7training,0.839,0.902,0.43468,0.907,0.626,0.001,SGD
11,yolov8,800,50,16,no,0.842,0.744,0.394986,0.841,0.579,0.01,SGD
21,yolov8,800,50,16,yolov8s,0.864,0.776,0.40882,0.852,0.583,0.01,SGD
18,yolov8,640,50,16,yolov8s,0.868,0.782,0.411379,0.857,0.595,0.001,Adam


### Initial conclusions

The experiments were on yolov5, yolov7 and yolov8, and are evaluated on different image sizes, number of epochs, and batch sizes. The first experiments were trained in 50 epochs. 

Among the models, YOLOv8 generally performs better than YOLOv5 and YOLOv7 in terms of precision, but YOLOv7 performs better in recall and in mAP.

#### Transfer learning
Some models use transfer learning by starting with pre-trained weights ('yolov5s.pt', 'yolov7_training.pt', 'yolov8s.pt'). Transfer learning seems to generally improve performance in terms of precision ('all_P'), recall ('all_R'), and F1-score ('all_F1').

#### Image size

Smaller image sizes (640) lead to better performance compared to larger image sizes (800).

#### Loss function

El optimizador ADAM parece tener una mejor precisión que el optimizador SGD. 

#### Learning rate

Una tasa de aprendizaje más pequeña lleva a modelos más óptimos pero que tardan más tiempo en ser entrenados.

## Training on 100 epochs 

In [38]:
df = pd.read_csv(f"{OUTPUT_DIR}/results_val_vs_original_dataset.csv")

df[(df['dataset'] == 'v1') & (df['epochs'] == 100)][columns_of_interest].sort_values(by=['all_R'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95,lr0,loss_function
1,yolov5,640,100,16,yolov5s,0.692,0.603,0.322221,0.665,0.333,0.001,SGD
4,yolov5,800,100,16,yolov5s,0.854,0.745,0.397892,0.807,0.49,0.01,SGD
2,yolov5,640,100,16,yolov5s,0.811,0.75,0.389654,0.807,0.476,0.01,SGD
19,yolov8,800,100,16,yolov8s,0.877,0.756,0.406009,0.844,0.612,0.01,SGD
12,yolov8,800,100,16,yolov8l,0.867,0.758,0.404422,0.838,0.597,0.01,SGD
10,yolov8,800,100,16,no,0.871,0.776,0.41038,0.846,0.599,0.01,SGD
16,yolov8,640,100,16,yolov8s,0.888,0.777,0.4144,0.86,0.623,0.01,SGD
15,yolov8,640,100,16,yolov8s,0.852,0.796,0.411524,0.849,0.616,0.001,Adam
9,yolov7,800,100,8,yolov7training,0.856,0.808,0.415654,0.858,0.584,0.01,SGD
7,yolov7,640,100,16,yolov7training,0.874,0.849,0.430659,0.881,0.624,0.01,SGD


Increasing the number of training epochs from 50 to 100 seems to have improved the performance of most models across various metrics.

Higher epochs might allow the models to learn more complex patterns, leading to better results.

From the table, we can draw the following conclusions:

1. **Size Doesn't Always Mean Better:** The table suggests that having a bigger model does not necessarily guarantee better performance. For instance, the yolov8l model, which is larger than yolov8s, does not perform better.

2. **Impact of Image Size:** For every model, it is shown that having a 640 imgz gives a better recall.

3. **Model Version:** Different versions of YOLO show differences in performance. Some versions might perform better in terms of certain metrics. In terms of recall, yolov7 performs better than yolov5 and yolov8. In terms of precision, the best model was achieved with yolov8.

4. **Transfer learning:** Across various configurations, using transfer learning consistently yields higher precision (P), recall (R), F1-score (F1), and mean Average Precision (mAP) metrics compared to non-transfer learning. This highlights the effectiveness of leveraging pre-trained models for better object detection performance.

5. **Precision and Recall Trade-off:** It's clear that there's a trade-off between precision and recall. Some models have higher precision but lower recall, while others have higher recall but lower precision.

Besides the metrics, the practical aspects like computational resources and inference speed will be considered choosing a model for deployment.

# Training models on a bigger dataset

Ideally, considering the previous evaluations, the optimal choice for training models on a larger dataset would have been yolov7. However, due to resource limitations on the training platform (Google Colab), specifically in terms of memory allocation, it proved unfeasible. Attempts to train the models on a bigger dataset resulted in memory overflows, abruptly terminating the training process before the model could be effectively trained. So the next option was yolov8

- image size: 640
- batch size: 16
- transfer learning: yolov8s
- lr: between 0.001 (with Adam) and 0.01 (with SGD)
- epochs: 100 and 300, to test if training longer gives better results

In [42]:
df[(df['dataset'] == 'v2') & (df['epochs'] != 50)][columns_of_interest].sort_values(by=['all_R'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95,lr0,loss_function
17,yolov8,640,300,16,yolov8s,1.0,0.961,0.490056,0.982,0.942,0.01,SGD
20,yolov8,800,100,16,yolov8s,0.987,0.961,0.486913,0.982,0.949,0.01,SGD
13,yolov8,640,100,16,yolov8m,0.993,0.962,0.488627,0.977,0.941,0.01,SGD


In [43]:
df = pd.read_csv(f"{OUTPUT_DIR}/results_val_vs_randomized_clips.csv")

df[(df['dataset'] == 'v2')][columns_of_interest].sort_values(by=['all_R'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95,lr0,loss_function
2,yolov8,800,100,16,yolov8s,0.578,0.302,0.198359,0.314,0.124,0.01,SGD
1,yolov8,640,300,16,yolov8s,0.509,0.311,0.193048,0.325,0.137,0.01,SGD
0,yolov8,640,100,16,yolov8m,0.576,0.335,0.211811,0.366,0.145,0.01,SGD
