<h1>Object detection exercise</h1>

This exercise attempts to demonstrate if a Convolutional Neural Network (CNN) is able to detect objects within a picture. For a simple example, two classes will be considered: Cars and pedestrians.

40 images were downloaded from the internet, containing cars and/or pedestrians (in the form of persons). The images were manually labelled using labelimg software, and exported in YOLO format. Alternatively, if the annotations are in XML format, it can be converted to YOLO format, and conversion script is supplied at script_converter.ipynb. 

Train, validation and test folders were then created, in compliance with Yolov5 format requirements. Yolov5 library is then downloaded onto the local environment. Code blocks for downloading and installing Yolov5 are also supplied at script_converter.ipynb. A YAML file to indicate the path to the train, validation and test folders and the binary classes is also prepared. Transfer learning with training is then applied. Last step is to apply object detection over the test images. 

For this exercise, several transfer learning conditions shall be used, and compare which of them has the best performance:

- Condition 1: 50 epochs, 4 batches, 10 layers freeze using yolov5s

- Condition 2: 50 epochs, 4 batches, 10 layers freeze using yolov5s6

- Condition 3: 50 epochs, 4 batches, 23 layers freeze using yolov5s

<h2>Import libraries</h2>

In [9]:
from IPython.display import Image, clear_output

<h4>Additional notes</h4>

For each condition outlined, training will be applied first before testing. One sample image shall be displayed for comparison analysis.

Training: 

- pedestrian_and_car.yaml is the file containing directory pathways to train-val-test and the number of classes

- yolov5s.pt is the pre-trained weight; small weight chosen for faster training time and less use of memory resources

- epochs is the number of training epochs; test images with cars and/or pedestrians are variously fitted after 10, 30 and 50 epochs

- batch refers to training batch size; considering that there are only 40 images used together for training and validation, batch size of 4 is appropriate.

- Freeze refers to the layer number taken out from Yolov5 to be used for transfer learning. Yolov5 has a total of 24 layers, freeze 10 means that the first 10 layers of CNN are frozen. The first 10 layers constitutes the "backbone" of Yolov5.
 
- Path to the weight files can be found under: yolov5/runs/train/exp"x"/weights/best.pt, where x is the a number representing the number of run iteration

Test:
- Path to the output can be found under: yolov5/runs/detect/exp<x>/weights/best.pt, where x is the a number representing the number of output iteration

- Select best.pt for best performing model, or last.pt for output saved from final epoch

<h5>Condition 1</h5>

Training:

In [1]:
!python3 yolov5/train.py --data pedestrian_and_car.yaml --weights yolov5s.pt --epochs 50 --batch 4 --freeze 10

[34m[1mtrain: [0mweights=yolov5s.pt, cfg=, data=pedestrian_and_car.yaml, hyp=yolov5/data/hyps/hyp.scratch-low.yaml, epochs=50, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=yolov5/data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[10], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
[34m[1mgithub: [0mup to date with https://github.com/ultralytics/yolov5 ✅
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weight

Test:

In [10]:
!python3 yolov5/detect.py --weights yolov5/runs/train/exp/weights/best.pt --conf 0.40 --iou 0.50 --source yolov5/data/images/

[34m[1mdetect: [0mweights=['yolov5/runs/train/exp/weights/best.pt'], source=yolov5/data/images/, data=yolov5/data/coco128.yaml, imgsz=[640, 640], conf_thres=0.4, iou_thres=0.5, max_det=1000, device=, view_img=False, save_txt=False, save_csv=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=yolov5/runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

Fusing layers... 
Model summary: 157 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
image 1/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T1.jpg: 480x640 1 car, 1 pedestrian, 49.2ms
image 2/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T2.jpg: 352x640 4 cars, 3 pedestrians, 37.6ms
image 3/3 /home/han/Documents

Result:

The first condition performs well, as cars and pedestrians in the foreground and background can be detected.

In [12]:
Image(filename='yolov5/runs/detect/exp/T2.jpg', width=600)

<IPython.core.display.Image object>

In [13]:
!python3 yolov5/detect.py --weights yolov5/runs/train/exp/weights/best.pt --conf 0.40 --iou 0.50 --source yolov5/data/videos/

[34m[1mdetect: [0mweights=['yolov5/runs/train/exp/weights/best.pt'], source=yolov5/data/videos/, data=yolov5/data/coco128.yaml, imgsz=[640, 640], conf_thres=0.4, iou_thres=0.5, max_det=1000, device=, view_img=False, save_txt=False, save_csv=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=yolov5/runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

Fusing layers... 
Model summary: 157 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
video 1/1 (1/638) /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/videos/video_1.mp4: 384x640 3 cars, 6 pedestrians, 49.8ms
video 1/1 (2/638) /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/videos/video_1.mp4: 384x640 7 cars, 6 pedestrians, 45.9ms
v

<h5>Condition 2</h5>

Train:

In [4]:
!python3 yolov5/train.py --data pedestrian_and_car.yaml --weights yolov5s6.pt --epochs 50 --batch 4 --freeze 10

[34m[1mtrain: [0mweights=yolov5s6.pt, cfg=, data=pedestrian_and_car.yaml, hyp=yolov5/data/hyps/hyp.scratch-low.yaml, epochs=50, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=yolov5/data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[10], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
[34m[1mgithub: [0mup to date with https://github.com/ultralytics/yolov5 ✅
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weigh

Test:

In [14]:
!python3 yolov5/detect.py --weights yolov5/runs/train/exp2/weights/best.pt --conf 0.40 --iou 0.50 --source yolov5/data/images/

[34m[1mdetect: [0mweights=['yolov5/runs/train/exp2/weights/best.pt'], source=yolov5/data/images/, data=yolov5/data/coco128.yaml, imgsz=[640, 640], conf_thres=0.4, iou_thres=0.5, max_det=1000, device=, view_img=False, save_txt=False, save_csv=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=yolov5/runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

Fusing layers... 
Model summary: 206 layers, 12312052 parameters, 0 gradients, 16.1 GFLOPs
image 1/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T1.jpg: 512x640 1 pedestrian, 67.4ms
image 2/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T2.jpg: 384x640 2 cars, 1 pedestrian, 54.8ms
image 3/3 /home/han/Documents/CV mo

Result:

The second condition performs a little less well than the first; only cars and pedestrian closer to foreground can be detected.

In [17]:
Image(filename='yolov5/runs/detect/exp3/T2.jpg', width=600)

<IPython.core.display.Image object>

<h5>Condition 3</h5>

Train:

In [6]:
!python3 yolov5/train.py --data pedestrian_and_car.yaml --weights yolov5s.pt --epochs 50 --batch 4 --freeze 23

[34m[1mtrain: [0mweights=yolov5s.pt, cfg=, data=pedestrian_and_car.yaml, hyp=yolov5/data/hyps/hyp.scratch-low.yaml, epochs=50, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=yolov5/data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[23], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
[34m[1mgithub: [0mup to date with https://github.com/ultralytics/yolov5 ✅
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weight

Test:

In [18]:
!python3 yolov5/detect.py --weights yolov5/runs/train/exp3/weights/best.pt --conf 0.40 --iou 0.50 --source yolov5/data/images/

[34m[1mdetect: [0mweights=['yolov5/runs/train/exp3/weights/best.pt'], source=yolov5/data/images/, data=yolov5/data/coco128.yaml, imgsz=[640, 640], conf_thres=0.4, iou_thres=0.5, max_det=1000, device=, view_img=False, save_txt=False, save_csv=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=yolov5/runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
git: 'models/CV_Yolo5_2/yolov5' is not a git command. See 'git --help'.
YOLOv5 🚀 2024-2-14 Python-3.8.10 torch-2.2.0+cu121 CPU

Fusing layers... 
Model summary: 157 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
image 1/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T1.jpg: 480x640 1 pedestrian, 63.8ms
image 2/3 /home/han/Documents/CV models/CV_Yolo5_2/yolov5/data/images/T2.jpg: 352x640 2 cars, 3 pedestrians, 38.2ms
image 3/3 /home/han/Documents/CV mo

Result:

The cars and pedestrian in the foreground can be detected, the pedestrians in the background can also be detected. Compared to Condition 1, the confidence scores are marginally higher.

In [19]:
Image(filename='yolov5/runs/detect/exp4/T2.jpg', width=600)

<IPython.core.display.Image object>

<h2>Discussion</h2>

The results for each of the Conditions are printed at the end of each run. Out of the 3 conditions, Condition 1 has the best performance as it has mAP50 score of 0.812 for car and 0.926 for pedestrians respectively, while Condition 3 is the most consistent, though with slightly lower mAP score of 0.808 for car and 0.868 for car and pedestrians respectively. Condition 2 also shows similar consistency with 0.774 for car and 0.725 for car and pedestrians respectively but the score is the lowest.

As such, Condition 1 is taken to be the "default standard" condition, while the 2 other conditions are taken for relative comparison analysis. In Yolov5, there are a total of 24 layers consisting of backbone, head and neck, and the 10 layers represent the backbone of Yolov5. According to this post in Github <https://github.com/ultralytics/yolov5/issues/1314> transfer learning with only the backbone performs better as compared to the scenario when the entire layer is applied. While this is experimentally demonstrated to be true, the mAP scores between the car and pedestrian has been demonstrated to have a smaller difference in Condition 3 as compared to Condition 1.

As for Condition 2, this scenario is also attempted considering that Yolov5s6 has a larger size, beter mAP scores, processing speed and params as compared to Yolov5s <https://github.com/ultralytics/yolov5#inference>, hence the motivation to experimentally determine if better scores would translate to a better performance. However, for this exercise Yolov5 is a better model for transfer learning as compared to Yolov5s6, given same training conditions.

To conclude, Yolov5 has demonstrated to be a good model for transfer learning in computer vision, though there is still more room for improvement. The cars and pedestrians can be detected with higher confidence scores, which can be performed by several other options: 1. Larger training/validation model 2. Higher number of epochs, up to 150X 3. Use a larger Yolov5 model 4. Hyperparameter tuning