# Object Detection

This is an initial experiment to learn object detection using OpenCV.

The main idea of object detection is to locate the position of an object on some image. Of course, it is not limited to just one object or one type of object and as you will find out, despite we as humans do this task naturally by looking at the image, this is not a simple task algorithmically speaking, by just 'looking' at the images' pixels.

The majority of strategies to object detection include the following steps: (1) select a region of interest (ROI), (2) feature extraction, and (3) post-processing. Each of these steps could be executed in many ways, with different techniques, and these differences will distinct each one.

The ROI could be a rectangular section of the image, or the entire image itself, it can be every step of a sliding box running through the entire image, or maybe several overlapping boxes with different sizes. Broadly speaking it is just a method to select a section of the image to be compared with some defined pattern that describes the object you are looking for.

The object you are locating needs to be described by a pattern defined by a set of features. A feature is a way to abstract some possibly complex mathematical relations between the analyzed pixels, like the two sharp edges of the nose when looking at a front human face. The feature extraction is a vast subject itself and has a lot of different proposals for each type of objects, but basically, you need a method to extract those features directly from a region of pixels, like in face recognition, by using two white rectangles to identify the nose and eyes pattern ([see Haar-like features](https://en.wikipedia.org/wiki/Haar-like_feature)). But there are methods more efficient and easier to use. For a good overview about this, check out the article [Feature Extraction for Object Recognition and Image Classification, by Aastha Tiwari, Anil Kumar Goswami, and Mansi Saraswat](https://www.ijert.org/research/feature-extraction-for-object-recognition-and-image-classification-IJERTV2IS100491.pdf).

Obviously, with the advance of the current computational power with common devices, it is very suitable to program software to do an optimization of the parameters of the feature extraction model based on a set of data, like finding the best dimensions for those Haar-like features using machine learning. We also can use a deep neural network (DNN) to model and optimize the feature extraction entirely, in a way that we don't need to engineer the feature by 'hand', being able to work with high a more high level of abstraction.

By speaking of DNN, as the most common operation with image processing is the convolution, the class of DNN used for dealing with image processing is mostly named convolutional neuron network (CNN), then we can imagine a ton of matrix operations being processed here. [Read more about this achitecture](https://en.wikipedia.org/wiki/Convolutional_neural_network#Architecture).

The following sections will be experimenting by applying some object detection methods.

# Detecting People w/ Haar Cascade Classifier

The idea here is to use a Haar Cascade Classifier algorithm by loading an XML file with the pre-trained parameters for full-body detection.

To understand what is Haar Cascade Classifier and how it works, there is a good basic overview from [OpenCV-Python Tutorials](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_objdetect/py_face_detection/py_face_detection.html). A more in-depth looking can be read from [Wikipedia's article Viola-Jones object detection framework](https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework).

The approach here is very simple:
- Open the input video, get its shape and FPS.
- Setup the output video with the same shape and FPS from the input video.
- From the input video, get the current frame's image.
- Run a full-body pre-trained Haar Cascade Classifier on that image, that should return a list of detected full bodies in a form of a rectangle.
- Draw the rectangles on the image.
- Display the image.
- Save the processed frames into the output video.

There are several similar tutorials with almost identical example codes. I particularly followed the tutorial [Computer Vision — Detecting objects using Haar Cascade Classifier, from Towards Data Science](https://towardsdatascience.com/computer-vision-detecting-objects-using-haar-cascade-classifier-4585472829a9).

In [2]:
import numpy as np
import cv2  # or opencv-python
import time

# Create our body classifier
detector = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_fullbody.xml'
#     cv2.data.haarcascades + 'haarcascade_upperbody.xml'
)

# Open the input video capture
#input_filename = './1080p_TownCentreXVID.mp4'
#input_filename = './720p_TownCentreXVID.mp4'
#input_filename = './480p_TownCentreXVID.mp4'
input_filename = './360p_TownCentreXVID.mp4'
# input_filename = '../videos/video_F_2.mp4'
vcap = cv2.VideoCapture(input_filename)

# Get video properties
frame_width = int(vcap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(vcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = vcap.get(cv2.CAP_PROP_FPS)
n_frames = int(vcap.get(cv2.CAP_PROP_FRAME_COUNT))

print("Frame width:", frame_width)
print("Frame width:", frame_height)
print("Video fps:", fps)

# Setup the output video file
output_filename = './output.mp4'
apiPreference = cv2.CAP_FFMPEG
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
vout = cv2.VideoWriter(
    filename=output_filename,
    apiPreference=apiPreference,
    fourcc=fourcc,
    fps=fps,
    frameSize=(frame_width, frame_height),
    params=[]
)

print(f"Processing \"{input_filename}\" ({int(n_frames)} frames)...")

# Start app
window_name = "People Detecting"
cv2.startWindowThread()
cv2.namedWindow(window_name)

# Loop each frame
frame_count = 0
frames_to_process = 1000
processed_frames = np.zeros(frames_to_process, dtype=object)

green = (0, 255, 0)
red = (255, 0 ,0)

# start timer
start = time.time()
fps_timer = [0, cv2.getTickCount()]
while vcap.isOpened():
    # Read a frame
    ret, frame = vcap.read()
    if not ret or frame_count == frames_to_process:
        break

    # Apply the body classifier
    bodies = detector.detectMultiScale(frame, 1.1, 3)

    # Extract bounding boxes for any bodies identified
    for (x, y, w, h) in bodies:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)

    # Compute and put FPS on frame
    fps = cv2.getTickFrequency() / (fps_timer[1] - fps_timer[0]);
    fps_timer[0] = fps_timer[1]
    fps_timer[1] = cv2.getTickCount()
    cv2.putText(frame,
        text=f"FPS: {int(fps)}",
        org=(frame_width -60, frame_height -5),
        fontFace=cv2.FONT_HERSHEY_SIMPLEX,
        fontScale=0.3,
        color=green,
        thickness=1
    );
        
    # Save frame
    processed_frames[frame_count] = frame
    frame_count += 1

    # Show in app
    cv2.imshow(window_name, frame)
    cv2.waitKey(1)

# end timer
end = time.time()
overall_elapsed_time = end - start
elapsed_time_per_frame = overall_elapsed_time / frame_count

print("Done!")
print(f"{frame_count} frames processed in {overall_elapsed_time} seconds.")
print(f"({elapsed_time_per_frame}) seconds per frame.")
print(f"({1/elapsed_time_per_frame}) frames per second.")

# Write processed frames to file
for frame in processed_frames:
    vout.write(frame)

print(f"Output saved to \"{output_filename}\".")

vcap.release()
vout.release()
cv2.destroyAllWindows()

Frame width: 640
Frame width: 360
Video fps: 25.0
Processing "./360p_TownCentreXVID.mp4" (7502 frames)...
Done!
1000 frames processed in 20.32015109062195 seconds.
(0.02032015109062195) seconds per frame.
(49.2122325045858) frames per second.
Output saved to "./output.mp4".


# Detecting people with Background Subtractor

The idea here is that if the camera is static, we can take advantage of that and use a background subtractor algorithm to get a mask of the moving regions, then filter it and locate the blobs. This should give us the moving objects on a static camera.

The OpenCV page has a good article called [How to Use Background Subtraction Methods](https://docs.opencv.org/master/d1/dc5/tutorial_background_subtraction.html) commenting on the basics. To understand how they work and to get a performance comparison, be sure to read the article [A Comparison between Background Modelling Methods for Vehicle Segmentation in Highway Traffic Videos, by L. A. Marcomini and A. L. Cunha](https://arxiv.org/pdf/1810.02835.pdf).

The approach here is also very simple:  
- Initialize the detector as a background subtractor.
- Open the input video, get its shape and FPS.
- Setup the output video with the same shape and FPS from the input video.
- From the input video, get the current frame's image.
- Apply the detector to the image, getting a mask with the background extracted (the background is black, the non-background is white).
- Filter with Erode, Dilate and Close to get a better separation of the detections.
- Find each blob in the image and get its bounding box.
- Draw the bounding box as rectangles on the image.
- Display the image.
- Save the processed frames into the output video.

There are some tutorials around this and I particularly followed the [Object Tracking with Opencv and Python, from PySource](https://pysource.com/2021/01/28/object-tracking-with-opencv-and-python).

In [3]:
import numpy as np
import cv2  # or opencv-python
import time

# Create our body classifier
detector = cv2.createBackgroundSubtractorMOG2(history=150, varThreshold=50)

# Open the input video capture
#input_filename = './1080p_TownCentreXVID.mp4'
#input_filename = './720p_TownCentreXVID.mp4'
#input_filename = './480p_TownCentreXVID.mp4'
input_filename = './360p_TownCentreXVID.mp4'
# input_filename = '../videos/video_F_2.mp4'
vcap = cv2.VideoCapture(input_filename)

# Get video properties
frame_width = int(vcap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(vcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = vcap.get(cv2.CAP_PROP_FPS)
n_frames = int(vcap.get(cv2.CAP_PROP_FRAME_COUNT))

print("Frame width:", frame_width)
print("Frame width:", frame_height)
print("Video fps:", fps)

# Setup the output video file
output_filename = './output.mp4'
apiPreference = cv2.CAP_FFMPEG
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
vout = cv2.VideoWriter(
    filename=output_filename,
    apiPreference=apiPreference,
    fourcc=fourcc,
    fps=fps,
    frameSize=(frame_width, frame_height),
    params=[]
)

print(f"Processing \"{input_filename}\" ({int(n_frames)} frames)...")

# Start app
window_name = "People Detecting"
cv2.startWindowThread()
cv2.namedWindow(window_name)

# Loop each frame
frame_count = 0
frames_to_process = 1000
processed_frames = np.zeros(frames_to_process, dtype=object)

green = (0, 255, 0)
red = (255, 0 ,0)

# start timer
start = time.time()
fps_timer = [0, cv2.getTickCount()]
while vcap.isOpened():
    # Read a frame
    ret, frame = vcap.read()
    if not ret or frame_count == frames_to_process:
        break

    # Filter image to get people blobs
    mask = detector.apply(frame)
    mask[mask>125] = 255
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
    mask = cv2.erode(mask, kernel, iterations=1)
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (7, 7))
    mask = cv2.dilate(mask, kernel, iterations=2)
    mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel, iterations=5)   

    # Consider each blob a person
    contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    bodies = []
    for countour in contours:
        # Calculate area and remove small elements
        area = cv2.contourArea(countour)
        if area > 100:
            x, y, w, h = cv2.boundingRect(countour)
            bodies += [(x, y, w, h)]

    # Reconstruct the colors
    frame = cv2.bitwise_and(frame, frame, mask=mask)
    
    # Extract bounding boxes for any bodies identified
    for (x, y, w, h) in bodies:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
        
    # Compute and put FPS on frame
    fps = cv2.getTickFrequency() / (fps_timer[1] - fps_timer[0]);
    fps_timer[0] = fps_timer[1]
    fps_timer[1] = cv2.getTickCount()
    cv2.putText(frame,
        text=f"FPS: {int(fps)}",
        org=(frame_width -60, frame_height -5),
        fontFace=cv2.FONT_HERSHEY_SIMPLEX,
        fontScale=0.3,
        color=green,
        thickness=1
    );

    # Save frame
    processed_frames[frame_count] = frame
    frame_count += 1

    # Show in app
    cv2.imshow(window_name, frame)
    cv2.waitKey(1)

# end timer
end = time.time()
overall_elapsed_time = end - start
elapsed_time_per_frame = overall_elapsed_time / frame_count

print("Done!")
print(f"{frame_count} frames processed in {overall_elapsed_time} seconds.")
print(f"({elapsed_time_per_frame}) seconds per frame.")
print(f"({1/elapsed_time_per_frame}) frames per second.")

# Write processed frames to file
for frame in processed_frames:
    vout.write(frame)

print(f"Output saved to \"{output_filename}\".")

vcap.release()
vout.release()
cv2.destroyAllWindows()

Frame width: 640
Frame width: 360
Video fps: 25.0
Processing "./360p_TownCentreXVID.mp4" (7502 frames)...
Done!
1000 frames processed in 6.298488616943359 seconds.
(0.00629848861694336) seconds per frame.
(158.76824756179323) frames per second.
Output saved to "./output.mp4".


# Detecting people using YOLO

YOLO is the idea that... You Only Look Once is the idea that a CNN can process the entire image at once instead of running a detection on a sliding window through the image or a bunch of selective ROI. Instead of that, it divides the image into an SxS grid that is processed in one-pass the entire image.

For each cell, a defined number of objects can be detected by using something referred to as _Dimension Clusters_, that is similar to _anchor boxes_, but has its dimensions pre-fitted to best match the best [IOU](https://en.wikipedia.org/wiki/Jaccard_index) with the bounding boxes found in the training dataset. Each detected object will have relative dimensions offset from the cluster/box used for that detection and coordinate offsets relative to the corresponding cell of the grid.

Each cell contains attributes regarding each bounding box's (1) position, (2) dimension, and (3) confidence based on IOU. Also, each cell contains the list of the probability of each object classes that can be identified (3 values if trained to detect person, cat, and dog).

YOLO (since version 2) trains with a random multi-scaling pre-processing. Note here that because of the nature of the non-fully connected ConvNets, the weights don't need to be changed when resizing the features. This multi-scaling process not only improves its overall precision but it means it can process images with different sizes while also gives it a parameter as a trade-off between speed (FPS) and accuracy that can be adjusted in runtime, that using OpenCV it would be referred to as `size` in the `blobFromImage()`, responsible for normalizing the input image to the DNN (read more (here)[https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works]).

The image is then processed in a single convolutional network and outputs a list of bounding boxes relative to the image size.

As an open-source code, currently, there are several YOLO releases, some newer from different authors, so it is quite confusing to understand, but fortunately, [Towards Data Science](https://towardsdatascience.com/yolo-v4-or-yolo-v5-or-pp-yolo-dad8e40f7109) published a good overview from version 1 to 5. Here we'll be experimenting with Darknet's YOLOv3 and AlexeyAB's YOLOv4.

The approach here is also very simple:  
- Initialize the detector as a background subtractor.
- Open the input video, get its shape and FPS.
- Setup the output video with the same shape and FPS from the input video.
- From the input video, get the current frame's image.
- Apply the detector to the image, getting a mask with the background extracted (the background is black, the non-background is white).
- Filter with Erode, Dilate and Close to get a better separation of the detections.
- Find each blob in the image and get its bounding box.
- Draw the bounding box as rectangles on the image.
- Display the image.
- Save the processed frames into the output video.

Refs:
 - [YOLOv1](https://pjreddie.com/media/files/papers/yolo_1.pdf)
 - [YOLOv2/YOLO9000](https://pjreddie.com/media/files/papers/YOLO9000.pdf)
 - [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)
 - https://learnopencv.com/deep-learning-based-object-detection-using-yolov3-with-opencv-python-c
 - https://pysource.com/2019/06/27/yolo-object-detection-using-opencv-with-python
 - https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works
 - https://opencv-tutorial.readthedocs.io/en/latest/yolo/yolo.html
 - https://blog.roboflow.com/yolov5-improvements-and-evaluation/

Downloads:
 - [Darknet models](https://pjreddie.com/darknet/imagenet/#pretrained)
  - [names-file](https://raw.githubusercontent.com/pjreddie/darknet/master/data/coco.names)
  - YOLOv3:
    - [cfg-file](https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg)
    - [weights-file](https://pjreddie.com/media/files/yolov3.weights)
  - YOLOv3-tiny:
    - [cfg-file](https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3-tiny.cfg)
    - [weights-file](https://pjreddie.com/media/files/yolov3-tiny.weights)
 - https://github.com/kiyoshiiriemon/yolov4_darknet
  - [names-file](https://raw.githubusercontent.com/AlexeyAB/darknet/darknet_yolo_v4_pre/cfg/coco.names)
  - YOLOv4:
    - [cfg-file](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.cfg)
     - [weights-file](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights)
  - YOLOv4-tiny:
    - [cfg-file](https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-tiny.cfg)
    - [weights-file](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.weights)

In [16]:
import numpy as np
import cv2  # or opencv-python
import time

# Define YOLO files to load
# weights_file, cfg_file, names_file = "pretrained_yolo/v3/yolov3-tiny.weights", "pretrained_yolo/v3/yolov3-tiny.cfg", "pretrained_yolo/v3/coco.names"
weights_file, cfg_file, names_file = "pretrained_yolo/v4/yolov4-tiny.weights", "pretrained_yolo/v4/yolov4-tiny.cfg", "pretrained_yolo/v4/coco.names"

# YOLO and DNN Configs
# ref: https://docs.opencv.org/4.5.1/d6/d0f/group__dnn.html
mean = (0, 0, 0)  # YOLO doesn't use subtraction.
scale_factor = 1 / (255)  # the colorspace is normalized to match values from 0 to 1, so for 8 bits depth it should be 1/255.
blob_size = tuple([128*2]*2)  # this will impacat on precision over FPS. Smaller means faster. 320x320 is a common value.
confidence_threshold = 0.3  # 0 means no threshold. 0.5 is a common value.
supression_threshold = 0.4  # 1 means no supression. 0.4 is a common value.

dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_CPU, cv2.dnn.DNN_BACKEND_DEFAULT
# dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_CPU, cv2.dnn.DNN_BACKEND_OPENCV
# dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_OPENCL, cv2.dnn.DNN_BACKEND_DEFAULT
# dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_OPENCL, cv2.dnn.DNN_BACKEND_OPENCV
# dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_CUDA, cv2.dnn.DNN_BACKEND_CUDA
# dnn_target, dnn_backend = cv2.dnn.DNN_TARGET_VULKAN, cv2.dnn.DNN_BACKEND_VKCOM

# Load YOLO
net = cv2.dnn.readNet(weights_file, cfg_file)
net.setPreferableBackend(dnn_backend)
net.setPreferableTarget(dnn_target)

classes = []
with open(names_file, "r") as f:
    classes = [line.strip() for line in f.readlines()]
    print(f"{len(classes)} classes loaded:", end=' ')
    print(classes)
layer_names = net.getLayerNames()
# output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
output_layers = net.getUnconnectedOutLayersNames()
colors = np.random.uniform(0, 255, size=(len(classes), 3))

# Open the input video capture
#input_filename = './1080p_TownCentreXVID.mp4'
#input_filename = './720p_TownCentreXVID.mp4'
#input_filename = './480p_TownCentreXVID.mp4'
input_filename = './360p_TownCentreXVID.mp4'
# input_filename = '../videos/video_F_2.mp4'
vcap = cv2.VideoCapture(input_filename)

# Get video properties
frame_width = int(vcap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(vcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
channels = int(vcap.get(cv2.CAP_PROP_CHANNEL))
fps = vcap.get(cv2.CAP_PROP_FPS)
n_frames = int(vcap.get(cv2.CAP_PROP_FRAME_COUNT))

print("Frame width:", frame_width)
print("Frame width:", frame_height)
print("Video channels:", channels)
print("Video fps:", fps)

# Setup the output video file
output_filename = './output.mp4'
apiPreference = cv2.CAP_FFMPEG
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
vout = cv2.VideoWriter(
    filename=output_filename,
    apiPreference=apiPreference,
    fourcc=fourcc,
    fps=fps,
    frameSize=(frame_width, frame_height),
    params=[]
)
print(f"Processing \"{input_filename}\" ({int(n_frames)} frames)...")

# Start app
window_name = "People Detecting"
cv2.startWindowThread()
cv2.namedWindow(window_name)

# Loop each frame
frame_count = 0
frames_to_process = 1000
processed_frames = np.zeros(frames_to_process, dtype=object)

green = (0, 255, 0)
red = (255, 0 ,0)

# start timer
start = time.time()
fps_timer = [0, cv2.getTickCount()]
while vcap.isOpened():
    # Read a frame
    ret, frame = vcap.read()
    if not ret or frame_count == frames_to_process:
        break

    # Detecting objects
    blob = cv2.dnn.blobFromImage(frame,
        scalefactor=scale_factor,
        size=blob_size,
        mean=mean,
        swapRB=True,
        crop=False,
#         ddepth=cv2.CV_32F
    )
    net.setInput(blob)
    output_blobs = net.forward(output_layers)

    # Extract bounding boxes for any object detected
    class_ids = []
    confidences = []
    boxes = []
    for output_blob in output_blobs:
        for detection in output_blob:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]
            if confidence > confidence_threshold:
                center_x, center_y, w, h = detection[:4] * np.array(
                    [frame_width, frame_height, frame_width, frame_height])
                x = center_x - (w / 2)
                y = center_y - (h / 2)
                boxes.append([int(x), int(y), int(w), int(h)])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    # Remove coincidental detections
    indices = cv2.dnn.NMSBoxes(
        bboxes=boxes, 
        scores=confidences, 
        score_threshold=confidence_threshold, 
        nms_threshold=supression_threshold
    )

    # Showing informations on the screen
    font = cv2.FONT_HERSHEY_SIMPLEX
    for i in range(len(boxes)):
        if i in indices:
            x, y, w, h = boxes[i]
            label = classes[class_ids[i]]
            confidence = round(confidences[i]*100)
            color = colors[class_ids[i]]
            cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
            cv2.putText(
                frame,
                text=f"{label} : {confidence}%",
                org=(x, y-5),
                fontFace=font,
                fontScale=0.3,
                color=color,
                thickness=1
            )
        
    # Compute and put FPS on frame
    fps = cv2.getTickFrequency() / (fps_timer[1] - fps_timer[0]);
    fps_timer[0] = fps_timer[1]
    fps_timer[1] = cv2.getTickCount()
    cv2.putText(
        frame,
        text=f"FPS: {int(fps)}",
        org=(frame_width -60, frame_height -5),
        fontFace=font,
        fontScale=0.3,
        color=green,
        thickness=1
    );

    # Save frame
    processed_frames[frame_count] = frame
    frame_count += 1

    # Show in app
    cv2.imshow(window_name, frame)
    cv2.waitKey(1)

# end timer
end = time.time()
overall_elapsed_time = end - start
elapsed_time_per_frame = overall_elapsed_time / frame_count

print("Done!")
print(f"{frame_count} frames processed in {overall_elapsed_time} seconds.")
print(f"({elapsed_time_per_frame}) seconds per frame.")
print(f"({1/elapsed_time_per_frame}) frames per second.")

# Write processed frames to file
for frame in processed_frames:
    vout.write(frame)

print(f"Output saved to \"{output_filename}\".")

vcap.release()
vout.release()
cv2.destroyAllWindows()

80 classes loaded: ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant', 'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
color: [[247.86325633 136.56791797 101.88263367]
 [187.9258441  226.0536722    3.25159206]
 [174.71940172  72.65114

next: 
- Scaled-YOLOv4?
- Training with custom dataset?
- YOLOv5 with PyTorch?
- Semantic Segmentation?