In [2]:
import pandas as pd
import os

In [3]:
for filename in os.listdir():
    if "results_val_vs" in filename and filename.endswith(".csv"):
        print(filename)

results_val_vs_original_dataset.csv
results_val_vs_randomized_videos.csv
results_val_vs_randomized_videos_no_empty_images.csv


# Metrics taken in account

## Accuracy

Accuracy is calculated as:

$$
\text{Accuracy} = \frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive + True Negative + False Negative}}
$$


## Precision

The fraction of produced detections which are true positives.

$$
\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}} = \frac{\text{True Positive}}{\text{Total Number of Prediction}}
$$


## Recall

The fraction of groundtruth boxes in the data that matched to some produced detection.

$$
\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{\text{True Positive}}{\text{Total Number of Ground Truth}}
$$

## F1 score

F1 calcula el balance entre precision y recall. Si el F1 es alto, precision y recall son altos.

The F1 Score is calculated as:

$$
\text{F1 Score} = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}
$$

# El problema

Lidiamos con un problema de deteccion de armas. Es mas importante evitar falsos negativos que falsos positivos. Es mejor no perderse la deteccion de un arma, a coste de detectar mas armas con menos accuracy. Por eso tambien se utilizará un bajo threshold.


# Evaluating models vs small dataset

In [4]:
df = pd.read_csv("results_val_vs_original_dataset.csv")
# df.set_index('model_key', inplace=True)

columns_of_interest = ['model','imgsz', 'epochs', 'batch', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95', 'pistol_P', 'pistol_R',
                       'pistol_F1', 'pistol_mAP@.5', 'pistol_mAP@.5:.95', 'knife_P', 'knife_R',
                       'knife_F1', 'knife_mAP@.5', 'knife_mAP@.5:.95']

columns_of_interest = ['model','imgsz', 'epochs', 'batch', 'tf', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95']

# df[df.dataset == "v1"][columns_of_interest]
df[columns_of_interest]

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
0,yolov5,800,100,16,no,0.828,0.737,0.389927,0.8,0.467
1,yolov5,640,100,16,yolov5s,0.844,0.822,0.416427,0.857,0.522
2,yolov5,800,100,16,yolov5s,0.852,0.796,0.411524,0.831,0.517
3,yolov7,640,100,16,yolov7training,0.886,0.901,0.446719,0.91,0.649
4,yolov7,800,100,8,yolov7training,0.883,0.847,0.432313,0.893,0.616
5,yolov8,800,100,16,no,0.871,0.776,0.41038,0.846,0.599
6,yolov8,800,100,16,yolov8l,0.867,0.758,0.404422,0.838,0.597
7,yolov8,640,73,16,yolov8m,0.993,0.962,0.488627,0.976,0.941
8,yolov8,640,100,16,yolov8s,0.888,0.777,0.4144,0.86,0.623
9,yolov8,800,100,16,yolov8s,0.877,0.756,0.406009,0.844,0.612


The best model might also depend on other factors such as computational resources, inference speed, and any specific requirements of your application.

From the provided table, we can draw the following conclusions:

1. **Model Comparison:** The table compares the performance of different YOLO models (yolov5, yolov7, yolov8) with varying image sizes, training epochs, batch sizes, and different versions of the models (e.g., yolov5s, yolov8l, yolov8s).

2. **Size Doesn't Always Mean Better:** The table suggests that having a bigger model does not necessarily guarantee better performance. For instance, the yolov8l model, which is larger than yolov8s, does not perform better in terms of precision, recall, F1 score, or mAP.

3. **Impact of Image Size:** There's a variation in model performance based on image size (imgsz). For example, in yolov7, the model performance differs between imgsz of 640 and 800, indicating that image size has an impact on the model's ability to detect guns. Tanto para yolov8 y yolov7 se puede ver que conviene 640.

4. **Model Version:** Different versions of the same model (e.g., yolov5s vs. yolov8s) show differences in performance. Some versions might perform better in terms of certain metrics.

5. **Training Framework (tf):** The "tf" column indicates whether TensorFlow ("yolov7training") was used for training. This factor might contribute to differences in performance, depending on the specifics of the training framework used.

6. **Precision and Recall Trade-off:** It's clear that there's a trade-off between precision and recall. Some models have higher precision but lower recall, while others have higher recall but lower precision. The choice of model should consider the specific requirements of your use case.

7. **mAP Variation:** The mean Average Precision (mAP) metrics at different IoU thresholds (0.5 and 0.5:0.95) also show variation across models, indicating differences in object detection accuracy.

8. **Model Selection:** To choose the best model, you need to consider your application's requirements. If you want to minimize both false positives and false negatives, look for models that strike a good balance between precision and recall, as well as F1 score and mAP values.

9. **Consider Practical Aspects:** Besides the metrics, you should also consider practical aspects like computational resources and inference speed when choosing a model for deployment.

In summary, model selection involves a trade-off between various performance metrics and practical considerations. Evaluate the models based on your specific application's needs and constraints to make an informed choice.