In [37]:
import pandas as pd
import os
pd.set_option('display.float_format', '{:.5f}'.format)

# Metrics taken in account

## Accuracy

Accuracy is calculated as:

$$
\text{Accuracy} = \frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive + True Negative + False Negative}}
$$


## Precision

The fraction of produced detections which are true positives.

$$
\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}} = \frac{\text{True Positive}}{\text{Total Number of Prediction}}
$$


## Recall

The fraction of groundtruth boxes in the data that matched to some produced detection.

$$
\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Negative}}
$$

## F1 score

F1 calcula el balance entre precision y recall. Si el F1 es alto, precision y recall son altos.

The F1 Score is calculated as:

$$
\text{F1 Score} = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}
$$

# The problem

We are dealing with a weapon detection issue. It is more important to avoid false negatives than false positives. It is better not to miss a weapon detection, at the cost of detecting more weapons with less accuracy. For this reason, a low threshold will also be used.

Given the context of weapon detection and the emphasis on avoiding false negatives (missing weapon detections) at the expense of potentially having more false positives (incorrect weapon detections), **recall** is more important than precision in this scenario.

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive cases (weapons) that were correctly identified by the model. In the context of your problem, a higher recall means that the model is effectively capturing a larger portion of actual weapons, minimizing the risk of missing any potentially dangerous objects.

While precision (positive predictive value) is also significant, it prioritizes the accuracy of positive predictions among all predicted positives. In your case, a focus on precision might lead to being overly cautious and producing fewer false positives, but it could also result in missing actual weapon detections, which is a more critical concern in a weapon detection scenario.

A higher **recall** rate is crucial for ensuring that potential weapons are not overlooked, even if it means accepting a higher number of false positives.

# Evaluating models vs small dataset
## Training 50 epochs

In [38]:
OUTPUT_DIR = "./output_csvs"
for filename in os.listdir(OUTPUT_DIR):
    if filename.endswith(".csv"):
        print(filename)

eval_trained_on_dataset_v1_task_test_ds_original.csv
eval_trained_on_dataset_v1_task_val_ds_original.csv
eval_trained_on_dataset_v2_task_test_ds_original.csv
eval_trained_on_dataset_v1_task_test_ds_rc.csv
eval_trained_on_dataset_v2_task_test_ds_rc.csv
eval_trained_on_dataset_v2_task_val_ds_original.csv


In [40]:
df = pd.read_csv(f"{OUTPUT_DIR}/eval_trained_on_dataset_v1_task_test_ds_original.csv")
# df.set_index('model_key', inplace=True)

columns_of_interest = ['model','imgsz','transfer_learning', 'lr0', 'optimizer', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95']

df[(df['epochs'] == 50)][columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
1,yolov5,800,no,0.01,SGD,0.737,0.625,0.3382,0.701,0.358
4,yolov5,640,yolov5s,0.01,SGD,0.773,0.712,0.37062,0.764,0.429
6,yolov5,800,yolov5s,0.01,SGD,0.796,0.716,0.37694,0.755,0.428
9,yolov7,640,yolov7training,0.001,SGD,0.839,0.902,0.43468,0.907,0.626
12,yolov8,800,no,0.01,SGD,0.842,0.744,0.39499,0.841,0.579
18,yolov8,800,yolov8s,0.01,SGD,0.864,0.776,0.40882,0.852,0.583
16,yolov8,640,yolov8s,0.001,Adam,0.868,0.782,0.41138,0.857,0.595


### Initial conclusions

The experiments were on yolov5, yolov7 and yolov8, and are evaluated on different image sizes, number of epochs, transfer learning and learning rate. The first experiments were trained in 50 epochs. 

Among the models, YOLOv8 generally performs better than YOLOv5 and YOLOv7 in terms of precision, but YOLOv7 performs better in recall and in mAP.

#### Transfer learning
Some experiments use transfer learning by starting with pre-trained weights ('yolov5s.pt', 'yolov7_training.pt', 'yolov8s.pt'). Transfer learning seems to generally improve performance in terms of precision, recall, and F1-score.

#### Loss function and learning rate

A smaller learning rate leads to more optimal models but takes longer to train. The ADAM optimizer seems to have better accuracy than the SGD optimizer.

## Training 100 epochs 

In [41]:
df[(df['epochs'] == 100)][columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
2,yolov5,640,yolov5s,0.001,SGD,0.692,0.603,0.32222,0.665,0.333
0,yolov5,800,no,0.01,SGD,0.793,0.689,0.36868,0.747,0.422
3,yolov5,640,yolov5s,0.01,SGD,0.811,0.75,0.38965,0.807,0.476
14,yolov8,640,yolov8s,0.001,Adam,0.852,0.796,0.41152,0.849,0.616
5,yolov5,800,yolov5s,0.01,SGD,0.854,0.745,0.39789,0.807,0.49
10,yolov7,800,yolov7training,0.01,SGD,0.856,0.808,0.41565,0.858,0.583
13,yolov8,800,yolov8l,0.01,SGD,0.867,0.758,0.40442,0.838,0.597
7,yolov7,640,yolov7training,0.001,SGD,0.87,0.904,0.44334,0.914,0.64
11,yolov8,800,no,0.01,SGD,0.871,0.776,0.41038,0.846,0.599
8,yolov7,640,yolov7training,0.01,SGD,0.874,0.849,0.43066,0.881,0.624


Increasing the number of training epochs from 50 to 100 seems to have improved the performance of most models across various metrics.

Higher epochs might allow the models to learn more complex patterns, leading to better results.

From the table, we can draw the following conclusions:

1. **Size Doesn't Always Mean Better:** The table suggests that having a bigger model does not necessarily guarantee better performance. For instance, the yolov8l model, which is larger than yolov8s, does not perform better.

2. **Impact of Image Size:** For every model, it is shown that having a 640 imgsz gives a better recall.

3. **Model Version:** Different versions of YOLO show differences in performance. Some versions might perform better in terms of certain metrics. In terms of recall, yolov7 performs better than yolov5 and yolov8. In terms of precision, the best model was achieved with yolov8.

4. **Transfer learning:** Across various configurations, using transfer learning consistently yields higher precision (P), recall (R), F1-score (F1), and mean Average Precision (mAP) metrics compared to non-transfer learning. This highlights the effectiveness of leveraging pre-trained models for better object detection performance.

5. **Precision and Recall Trade-off:** It's clear that there's a trade-off between precision and recall. Some models have higher precision but lower recall, while others have higher recall but lower precision.

Besides the metrics, the practical aspects like computational resources and inference speed will be considered choosing a model for deployment.

In [51]:
df = pd.read_csv(f"{OUTPUT_DIR}/eval_trained_on_dataset_v1_task_test_ds_rc.csv")

df[columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,epochs,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
5,yolov8,50,640,yolov8s,0.001,Adam,0.124,0.155,0.06889,0.0665,0.0257
1,yolov8,50,800,no,0.01,SGD,0.135,0.129,0.06597,0.0542,0.0202
0,yolov8,100,800,no,0.01,SGD,0.137,0.117,0.06311,0.0488,0.0198
2,yolov8,100,800,yolov8l,0.01,SGD,0.157,0.127,0.07021,0.0615,0.0261
4,yolov8,100,640,yolov8s,0.01,SGD,0.159,0.107,0.06396,0.0484,0.0199
7,yolov8,50,800,yolov8s,0.01,SGD,0.216,0.176,0.09698,0.0859,0.0296
6,yolov8,100,800,yolov8s,0.01,SGD,0.222,0.118,0.07705,0.0726,0.0307
3,yolov8,100,640,yolov8s,0.001,Adam,0.323,0.11,0.08206,0.0976,0.0334


# Training models on a bigger dataset

Ideally, considering the previous evaluations, the optimal choice for training models on a larger dataset would have been yolov7. However, due to resource limitations on the training platform (Google Colab), specifically in terms of memory allocation, it proved unfeasible. Attempts to train the models on a bigger dataset resulted in memory overflows, abruptly terminating the training process before the model could be effectively trained... So, the next option was yolov8.

- image size: 640
- batch size: 16
- transfer learning: yolov8s
- lr: between 0.001 (with Adam) and 0.01 (with SGD)
- epochs: 100 and 300, to test if training longer gives better results

In [47]:
df = pd.read_csv(f"{OUTPUT_DIR}/eval_trained_on_dataset_v2_task_test_ds_original.csv")

columns_of_interest = ['model','epochs','imgsz','transfer_learning', 'lr0', 'optimizer', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95']

df[columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,epochs,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
4,yolov8,100,640,yolov8s,0.001,Adam,0.986,0.962,0.48693,0.981,0.941
7,yolov8,100,800,yolov8s,0.01,SGD,0.987,0.961,0.48691,0.982,0.949
10,yolov8,80,800,yolov8s,0.01,SGD,0.987,0.961,0.48691,0.982,0.948
3,yolov8,50,640,yolov8m,0.01,SGD,0.991,0.962,0.48814,0.974,0.938
2,yolov8,100,640,yolov8m,0.01,SGD,0.993,0.962,0.48863,0.977,0.941
9,yolov8,60,800,yolov8s,0.01,SGD,0.993,0.962,0.48863,0.977,0.946
6,yolov8,50,640,yolov8s,0.001,Adam,0.997,0.955,0.48777,0.979,0.932
8,yolov8,40,800,yolov8s,0.01,SGD,0.997,0.924,0.47956,0.98,0.932
0,yolov8,50,640,yolov8l,0.001,Adam,1.0,0.941,0.4848,0.972,0.932
1,yolov8,50,640,yolov8l,0.01,SGD,1.0,0.959,0.48954,0.984,0.937


In [54]:
df = pd.read_csv(f"{OUTPUT_DIR}/eval_trained_on_dataset_v2_task_test_ds_rc.csv")

df[columns_of_interest].sort_values(by=['all_P'])

Unnamed: 0,model,epochs,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
6,yolov8,50,640,yolov8s,0.001,Adam,0.307,0.302,0.15224,0.214,0.0827
3,yolov8,50,640,yolov8m,0.01,SGD,0.331,0.345,0.16893,0.285,0.114
0,yolov8,50,640,yolov8l,0.001,Adam,0.352,0.276,0.1547,0.21,0.0796
8,yolov8,40,800,yolov8s,0.01,SGD,0.372,0.327,0.17403,0.268,0.11
4,yolov8,100,640,yolov8s,0.001,Adam,0.435,0.286,0.17255,0.269,0.102
1,yolov8,50,640,yolov8l,0.01,SGD,0.481,0.347,0.20158,0.309,0.13
5,yolov8,300,640,yolov8s,0.01,SGD,0.509,0.311,0.19305,0.325,0.137
10,yolov8,80,800,yolov8s,0.01,SGD,0.56,0.302,0.19619,0.31,0.123
2,yolov8,100,640,yolov8m,0.01,SGD,0.576,0.335,0.21181,0.366,0.145
7,yolov8,100,800,yolov8s,0.01,SGD,0.578,0.302,0.19836,0.314,0.124


me gustan los ultimos 3 (a revisar)

In [55]:
df[columns_of_interest].sort_values(by=['all_R'])

Unnamed: 0,model,epochs,imgsz,transfer_learning,lr0,optimizer,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
0,yolov8,50,640,yolov8l,0.001,Adam,0.352,0.276,0.1547,0.21,0.0796
4,yolov8,100,640,yolov8s,0.001,Adam,0.435,0.286,0.17255,0.269,0.102
9,yolov8,60,800,yolov8s,0.01,SGD,0.581,0.293,0.19477,0.318,0.125
6,yolov8,50,640,yolov8s,0.001,Adam,0.307,0.302,0.15224,0.214,0.0827
7,yolov8,100,800,yolov8s,0.01,SGD,0.578,0.302,0.19836,0.314,0.124
10,yolov8,80,800,yolov8s,0.01,SGD,0.56,0.302,0.19619,0.31,0.123
5,yolov8,300,640,yolov8s,0.01,SGD,0.509,0.311,0.19305,0.325,0.137
8,yolov8,40,800,yolov8s,0.01,SGD,0.372,0.327,0.17403,0.268,0.11
2,yolov8,100,640,yolov8m,0.01,SGD,0.576,0.335,0.21181,0.366,0.145
3,yolov8,50,640,yolov8m,0.01,SGD,0.331,0.345,0.16893,0.285,0.114


In [None]:
el 2