In [1]:
import pandas as pd
import os

In [2]:
OUTPUT_DIR = "output_csvs"
for filename in os.listdir(OUTPUT_DIR):
    if "results_val_vs" in filename and filename.endswith(".csv"):
        print(filename)

results_val_vs_randomized_videos.csv
results_val_vs_randomized_videos_no_empty_images.csv
results_val_vs_original_dataset.csv


# Metrics taken in account

## Accuracy

Accuracy is calculated as:

$$
\text{Accuracy} = \frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive + True Negative + False Negative}}
$$


## Precision

The fraction of produced detections which are true positives.

$$
\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}} = \frac{\text{True Positive}}{\text{Total Number of Prediction}}
$$


## Recall

The fraction of groundtruth boxes in the data that matched to some produced detection.

$$
\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{\text{True Positive}}{\text{Total Number of Ground Truth}}
$$

## F1 score

F1 calcula el balance entre precision y recall. Si el F1 es alto, precision y recall son altos.

The F1 Score is calculated as:

$$
\text{F1 Score} = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}
$$

# The problem

We are dealing with a weapon detection issue. It is more important to avoid false negatives than false positives. It is better not to miss a weapon detection, at the cost of detecting more weapons with less accuracy. For this reason, a low threshold will also be used.

Given the context of weapon detection and the emphasis on avoiding false negatives (missing weapon detections) at the expense of potentially having more false positives (incorrect weapon detections), **recall** is more important than precision in this scenario.

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive cases (weapons) that were correctly identified by the model. In the context of your problem, a higher recall means that the model is effectively capturing a larger portion of actual weapons, minimizing the risk of missing any potentially dangerous objects.

While precision (positive predictive value) is also significant, it prioritizes the accuracy of positive predictions among all predicted positives. In your case, a focus on precision might lead to being overly cautious and producing fewer false positives, but it could also result in missing actual weapon detections, which is a more critical concern in a weapon detection scenario.

A higher **recall** rate is crucial for ensuring that potential weapons are not overlooked, even if it means accepting a higher number of false positives.

# Evaluating models vs small dataset
## Sort by Recall

In [8]:
df = pd.read_csv(f"{OUTPUT_DIR}/results_val_vs_original_dataset.csv")
# df.set_index('model_key', inplace=True)

columns_of_interest = ['model','imgsz', 'epochs', 'batch', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95', 'pistol_P', 'pistol_R',
                       'pistol_F1', 'pistol_mAP@.5', 'pistol_mAP@.5:.95', 'knife_P', 'knife_R',
                       'knife_F1', 'knife_mAP@.5', 'knife_mAP@.5:.95']

columns_of_interest = ['model','imgsz', 'epochs', 'batch', 'tf', 'all_P', 'all_R',
                       'all_F1', 'all_mAP@.5', 'all_mAP@.5:.95']

df[df.dataset == "v1"][columns_of_interest].sort_values(by=['all_R'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
0,yolov5,800,100,16,no,0.828,0.737,0.389927,0.8,0.467
9,yolov8,800,100,16,yolov8s,0.877,0.756,0.406009,0.844,0.612
6,yolov8,800,100,16,yolov8l,0.867,0.758,0.404422,0.838,0.597
5,yolov8,800,100,16,no,0.871,0.776,0.41038,0.846,0.599
8,yolov8,640,100,16,yolov8s,0.888,0.777,0.4144,0.86,0.623
2,yolov5,800,100,16,yolov5s,0.852,0.796,0.411524,0.831,0.517
1,yolov5,640,100,16,yolov5s,0.844,0.822,0.416427,0.857,0.522
4,yolov7,800,100,8,yolov7training,0.883,0.847,0.432313,0.893,0.616
3,yolov7,640,100,16,yolov7training,0.886,0.901,0.446719,0.91,0.649


## Sort by mAP

In [14]:
df[df.dataset == "v1"][columns_of_interest].sort_values(by=['all_mAP@.5:.95'])

Unnamed: 0,model,imgsz,epochs,batch,tf,all_P,all_R,all_F1,all_mAP@.5,all_mAP@.5:.95
0,yolov5,800,100,16,no,0.828,0.737,0.389927,0.8,0.467
2,yolov5,800,100,16,yolov5s,0.852,0.796,0.411524,0.831,0.517
1,yolov5,640,100,16,yolov5s,0.844,0.822,0.416427,0.857,0.522
6,yolov8,800,100,16,yolov8l,0.867,0.758,0.404422,0.838,0.597
5,yolov8,800,100,16,no,0.871,0.776,0.41038,0.846,0.599
9,yolov8,800,100,16,yolov8s,0.877,0.756,0.406009,0.844,0.612
4,yolov7,800,100,8,yolov7training,0.883,0.847,0.432313,0.893,0.616
8,yolov8,640,100,16,yolov8s,0.888,0.777,0.4144,0.86,0.623
3,yolov7,640,100,16,yolov7training,0.886,0.901,0.446719,0.91,0.649


From the table, we can draw the following conclusions:

1. **Size Doesn't Always Mean Better:** The table suggests that having a bigger model does not necessarily guarantee better performance. For instance, the yolov8l model, which is larger than yolov8s, does not perform better.

2. **Impact of Image Size:** There's a variation in model performance based on image size (imgsz). Image size has an impact on the model's ability to detect guns. For every model, it is shown that having a 640 imgz gives a better recall.

3. **Model Version:** Different versions of the same model (e.g., yolov5s vs. yolov8s) show differences in performance. Some versions might perform better in terms of certain metrics. In terms of recall, yolov7 performs better than yolov5 and yolov8. In terms of precision, the best model was achieved with yolov8.

4. **Transfer learning:** Across various configurations, using transfer learning consistently yields higher precision (P), recall (R), F1-score (F1), and mean Average Precision (mAP) metrics compared to non-transfer learning. This highlights the effectiveness of leveraging pre-trained models for better object detection performance.

5. **Precision and Recall Trade-off:** It's clear that there's a trade-off between precision and recall. Some models have higher precision but lower recall, while others have higher recall but lower precision.

Besides the metrics, the practical aspects like computational resources and inference speed will be considered choosing a model for deployment.

# Training models on a bigger dataset

Ideally, the chosen models to be trained on a bigger dataset would be yolov7 based on the evaluation made before, but given the resources for training (Google Colab) it was not possible in terms of the memory needed. So the next option was yolov8 with the following:

- image size: 640
- batch size: 16
- transfer learning: yolov8s