## <font color='purple'> Assignment 04 - Transfer Learning and Bounding Boxes and YOLOV8

<b>Part1: Using available pre-trained models for object detection, conduct inference on a short video (5-10 seconds) of a street scene drawing bounding boxes around detected vehicles. </b>

<b>Step 1.</b> Collect a source video. It may be necessary to divide the video into discrete image frames. </br>
<b>Step 2.</b> Conduct inference on each frame of the video, drawing bounding boxes around detected vehicles.</br>
<b>Step 3.</b> Format the results back into a video.</br>

Use Pytorch.

<font color='blue'> I have used TorchVision pretrained model for Object Detection here. I'm using RetinaNet as its faster as compared to Faster R-CNN and since my use case is video instead of images, it is better suited.

In [1]:
import cv2
import os
import torch
import torchvision
from torchvision.transforms import functional as F

In [2]:
torch.cuda.is_available()

True

<font color='blue'> As a first step, I will be extraction frames as images from my raw video file. In this case the name of my video file is <b>LA_street_test.mp4</b> which is a short 7-8 second video of a busy LA street. The video has visible cars, buses, traffic lights and people, so its suited well for our use case.

In [3]:
# Function: Will be used for extraction frames from the raw video
# Parameters: Path of the raw video and output folder

def extract_frames(video_path, output_folder):
    cap = cv2.VideoCapture(video_path)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    frame_count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame_path = os.path.join(output_folder, f'frame_{frame_count:04d}.jpg')
        cv2.imwrite(frame_path, frame)
        frame_count += 1

    cap.release()
    cv2.destroyAllWindows()

<font color='blue'> Start the frame extraction by calling the previously created function. The extracted frames will be available in the directory named <b>LA_street_frames</b>.

In [4]:
extract_frames('LA_street_test.mp4', 'LA_street_frames')

<font color='blue'> Next loading the pretrained "fasterrcnn_resnet50_fpn" model from the torchvision object detection models.

In [5]:
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights='DEFAULT')
model.to('cuda')
model.eval()

Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
100%|██████████| 160M/160M [00:02<00:00, 78.8MB/s]


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

<font color='blue'> We will need to perform inference and visualization on the extracted frames. So we will need a list of label names extracted from the COCO2017 dataset. We referenced the list in text format from the following GitHub URL and converted that into a simple list object.

<b> https://github.com/amikelive/coco-labels/blob/master/coco-labels-paper.txt

<font color='blue'> <b> Note: </b> The COCO2014 dataset had 80 objects, so while using the models for evaluation, I was getting index out of bound errors.  The latest models under torchvision have been trained on COCO2017 dataset which has 90 objects. So I have referenced the list from the above URL to prevent issues during evaluations.

<font color='blue'> Also including the "_background _" class as labels in torchvision models start at index 0 which is reserved for background class.

In [6]:
file_path = 'coco-labels-paper.txt'
COCO_INSTANCE_CATEGORY_NAMES = ["__background__"]

with open(file_path, 'r') as file:
    for line in file:
        category_name = line.strip()
        COCO_INSTANCE_CATEGORY_NAMES.append(category_name)

<font color='blue'> Defining the Color list to be used for our bounding boxes. I'm using Green and White for easy viewing and creating a dictionary assigning colors to each category label.

In [7]:
# White: 255, 255, 255
# Green: 0, 255, 0
colors = [(255, 255, 255), (0, 255, 0)]

class_colors = {i: colors[i % len(colors)] for i in range(len(COCO_INSTANCE_CATEGORY_NAMES))}

<font color='blue'> The following function will be used to perform the inference on the individual frames and will help with the bounding boxes and label assignment for detected objects.

In [8]:
def process_frame(frame, model):
    frame_tensor = F.to_tensor(frame).unsqueeze(0).to('cuda')

    with torch.no_grad():
        outputs = model(frame_tensor)

    boxes = outputs[0]['boxes'].cpu().numpy()
    scores = outputs[0]['scores'].cpu().numpy()
    labels = outputs[0]['labels'].cpu().numpy()

    for box, score, label in zip(boxes, scores, labels):
        if score > 0.5:
            x1, y1, x2, y2 = map(int, box)
            if label < len(COCO_INSTANCE_CATEGORY_NAMES):
                label_name = COCO_INSTANCE_CATEGORY_NAMES[int(label)]
                label_text = f'{label_name}: {score:.2f}'
                color = class_colors[int(label)]

                cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
                cv2.putText(frame, label_text, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
            else:
                print(f"Warning: Label {label} is out of range.")

    return frame

<font color='blue'> Function that will use the Extracted frames as input, perform inference by calling the above function <b>process_frame</b> and then finally stich the frames together to generate the final output video.

In [9]:
def generate_video_from_frames(input_folder, output_video_path, model, fps=30):
    frame_files = [f for f in os.listdir(input_folder) if f.endswith('.jpg')]
    frame_files.sort()

    first_frame = cv2.imread(os.path.join(input_folder, frame_files[0]))
    height, width, layers = first_frame.shape

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))

    for frame_file in frame_files:
        frame_path = os.path.join(input_folder, frame_file)
        frame = cv2.imread(frame_path)
        if frame is not None:
            processed_frame = process_frame(frame, model)
            out.write(processed_frame)

    out.release()
    cv2.destroyAllWindows()

<font color='blue'> Start Processing for Frames

In [10]:
input_folder = 'LA_street_frames'
output_video_path = 'LA_street_output.mp4'
generate_video_from_frames(input_folder, output_video_path, model)

<font color='blue'> End Processing for Frames

### <font color='purple'> Trying out the same evaluation using Detectron2 model.

<font color='blue'> Pre-Requisite: Install Detectron2 using the following command.

<b> pip install 'git+https://github.com/facebookresearch/detectron2.git'

In [12]:
!pip install 'git+https://github.com/facebookresearch/detectron2.git'

Collecting git+https://github.com/facebookresearch/detectron2.git
  Cloning https://github.com/facebookresearch/detectron2.git to /tmp/pip-req-build-7v280a7y
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/detectron2.git /tmp/pip-req-build-7v280a7y
  Resolved https://github.com/facebookresearch/detectron2.git to commit ebe8b45437f86395352ab13402ba45b75b4d1ddb
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting yacs>=0.1.8 (from detectron2==0.6)
  Downloading yacs-0.1.8-py3-none-any.whl.metadata (639 bytes)
Collecting fvcore<0.1.6,>=0.1.5 (from detectron2==0.6)
  Downloading fvcore-0.1.5.post20221221.tar.gz (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting iopath<0.1.10,>=0.1.7 (from detectron2==0.6)
  Downloading iopath-0.1.9-py3-none-any.whl.metadata (370 bytes)
Collecting omegaconf<2.

In [13]:
import numpy as np
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2 import model_zoo

<font color='Blue'> Loaded the pretrained Detectron2 models from the already installed library. Used the MODEL_DEVICE as cpu to force the model to use CPU configuration. Model configuration was giving errors related to CUDA as default configuration expects a GPU. So we force the configuration to use CPU.

In [14]:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)

model_final_280758.pkl: 167MB [00:01, 140MB/s]                           


<font color='blue'> Redefine the list for COCO category names as the index starts from 1 for Detectron2 Model as compared to Torchvision that starts at 0

In [15]:
file_path = 'coco-labels-paper.txt'
with open(file_path, 'r') as file:
    COCO_INSTANCE_CATEGORY_NAMES_DT2 = [line.strip() for line in file.readlines()]

colors_DT2 = [(255, 255, 255), (0, 255, 0)]
class_colors_DT2 = {i: colors_DT2[i % len(colors_DT2)] for i in range(len(COCO_INSTANCE_CATEGORY_NAMES_DT2))}

<font color='blue'> Redefine the process_frame function to use the detectron2 model and new Category label list.

In [16]:
def process_frame_detectron(frame, predictor):
    outputs = predictor(frame)
    instances = outputs["instances"].to("cuda")

    boxes = instances.pred_boxes.tensor.numpy()
    scores = instances.scores.numpy()
    labels = instances.pred_classes.numpy()

    for box, score, label in zip(boxes, scores, labels):
        if score > 0.5:
            x1, y1, x2, y2 = map(int, box)

            if label < len(COCO_INSTANCE_CATEGORY_NAMES_DT2):
                label_name = COCO_INSTANCE_CATEGORY_NAMES_DT2[int(label)]
                label_text = f'{label_name}: {score:.2f}'
                color = class_colors_DT2[int(label)]

                cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
                cv2.putText(frame, label_text, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
            else:
                print(f"Warning: Label {label} is out of range.")

    return frame

<font color='blue'> Redefine the Output video function to use the detectron2 model

In [17]:
def generate_video_from_frames_detectron(input_folder, output_video_path, predictor, fps=30):
    frame_files = [f for f in os.listdir(input_folder) if f.endswith('.jpg')]
    frame_files.sort()

    first_frame = cv2.imread(os.path.join(input_folder, frame_files[0]))
    height, width, layers = first_frame.shape

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))

    for frame_file in frame_files:
        frame_path = os.path.join(input_folder, frame_file)
        frame = cv2.imread(frame_path)
        if frame is not None:
            processed_frame = process_frame_detectron(frame, predictor)
            out.write(processed_frame)

    out.release()
    cv2.destroyAllWindows()

<font color='blue'> Start Processing using the Detectron 2 models, We will use the same set of frames extracted earlier.

In [18]:
input_folder = 'LA_street_frames'
output_video_path = 'LA_street_output_detectron.mp4'
generate_video_from_frames_detectron(input_folder, output_video_path, predictor)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


<font color='blue'> End Processing for the frames using Detectron 2 Model

#### <font color = 'green'> Part 2 in another Notebook

## <font color='purple'> Thank You