# **NOTE: FOR EXAMPLE VIDEOS USE realTime.py, THIS NOTEBOOK IS MEANT FOR WEBCAMS**

## Object detection using PyTorch (in real-time)

This example is good for object detection (classify and locate the objects within a video). 

From the last example, by using the baseball.jpg as the input image, we learned that:
- Faster R-CNN w/ ResNet50 backbone is really good at detecting small objects in images by design: The baseball and even the people on the background are detected (sometimes wrongfully)
- Faster R-CNN w/ MobileNet backbone performs really well at catching the most representative objects on the foreground (but gets confused when it comes to detecting the baseball and glove)
- Retinanet detects both the bigger and smaller objects in the foreground and some of the background blurry objects too (with less confidence)

Much of the previous implementation can be reused for the real-time detection use case. Let's get started:

### Environment configuration

In [1]:
pip install numpy==1.24.1

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
pip install torch==2.0.0 torchvision

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
pip install opencv-contrib-python

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
pip install imutils.video

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement imutils.video (from versions: none)
ERROR: No matching distribution found for imutils.video

[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Importing packages

In [1]:
from torchvision.models import detection
from imutils.video import VideoStream
from imutils.video import FPS
import numpy as np
import argparse
import imutils
import pickle
import torch
import time
import cv2

Two notable additions:

- VideoStream: Accesses our webcam
- FPS: Measures our approximate frames per second throughput rate of our object detection pipeline

### Model preparation - command line arguments

In [2]:
""" #uncomment to allow cl arguments, otherwise use default variables
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--model", type=str, default="frcnn-resnet",
	choices=["frcnn-resnet", "frcnn-mobilenet", "retinanet"],
	help="name of the object detection model")
ap.add_argument("-l", "--labels", type=str, default="coco_classes.pickle",
	help="path to file containing list of categories in COCO dataset")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
args = vars(ap.parse_args())
"""

defaultVideo = "videos/drone.mp4"
defaultModel = "retinanet"
defaultLabels = "resources/coco_classes.pickle"
defaultConfidence = 0.5

**Note: edit the defaults for easier config without cl arguments (in a Jupyter Notebook for example)**

We have a number of command line arguments here, including:

- model: The type of PyTorch object detector we’ll be using (Faster R-CNN + ResNet, Faster R-CNN + MobileNet, or RetinaNet + ResNet)
- labels: The path to the COCO labels file, containing human readable class labels
- confidence: Minimum predicted probability to filter out weak detections

**Note: this example originally uses the webcam footage but an input video was included in the videos folder for debug purposes**

### Initialization

In [3]:
# set the device we will be using to run the model
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load the list of categories in the COCO dataset and then generate a
# set of bounding box colors for each class

# uncomment below for cl arguments
# CLASSES = pickle.loads(open(args["labels"], "rb").read())
CLASSES = pickle.loads(open(defaultLabels, "rb").read())
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

- **DEVICE**: sets the device we’ll be using for inference (either CPU or GPU).
- **CLASSES**: We then load our class labels from disk.
- **COLORS**: initialize a random color for each unique label. We’ll use these colors when drawing predicted bounding boxes and labels on our output image.

### **NOTES ON REAL-TIME DETECTION**

When performing object detection in video streams, I highly recommend that you use a GPU — a CPU will be too slow for anything close to real-time performance.

### Models dictionary

In [4]:
# initialize a dictionary containing model name and its corresponding 
# torchvision function call
MODELS = {
	"frcnn-resnet": detection.fasterrcnn_resnet50_fpn,
	"frcnn-mobilenet": detection.fasterrcnn_mobilenet_v3_large_320_fpn,
	"retinanet": detection.retinanet_resnet50_fpn
}
""" 
# uncomment below for cl arguments
# load the model and set it to evaluation mode
model = MODELS[args["model"]](pretrained=True, progress=True,
	num_classes=len(CLASSES), pretrained_backbone=True).to(DEVICE)"""

model = MODELS[defaultModel](pretrained=True, progress=True,
	num_classes=len(CLASSES), pretrained_backbone=True).to(DEVICE)
model.eval()



RetinaNet(
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): FrozenBatchNorm2d(256, eps=0.0)


We define a **MODELS** dictionary to map the name of a given object detector to its corresponding PyTorch function.

We load the model from disk and send it to the appropriate DEVICE. We pass in a number of key parameters, including:

- pretrained: Tells PyTorch to load the model architecture with pre-trained weights on the COCO dataset
- progress=True: Displays download progress bar if model has not already been downloaded and cached
- num_classes: Total number of unique classes
- pretrained_backbone: Also provide the backbone network to the object detector

**Important:** We place the model in evaluation mode

### Access the footage

In [5]:
# initialize the video stream, allow the camera sensor to warmup,
# and initialize the FPS counter
print("[INFO] starting video stream...")

#uncomment below for real webcam
vs = VideoStream(src=0).start()
time.sleep(2.0)

# vs = cv2.VideoCapture(defaultVideo)
fps = FPS().start()

[INFO] starting video stream...


A small sleep statement allows  for the webcam sensor to warm up (if applicable).

A call to the start method of FPS allows us to start timing our approximate frames per second throughput rate.

### Loop over video frames and processing predictions

In [6]:
# loop over the frames from the video stream
while True:
	# grab the frame from the threaded video stream and resize it
	# to have a maximum width of 400 pixels
	frame = vs.read()
	
	# intialize video
	# frame = frame[1]
	
	# if the frame could not be grabbed, then we have reached the end
	# of the video
	if frame is None:
		break
	
	frame = imutils.resize(frame, width=400)
	orig = frame.copy()
	# convert the frame from BGR to RGB channel ordering and change
	# the frame from channels last to channels first ordering
	frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
	frame = frame.transpose((2, 0, 1))
	# add a batch dimension, scale the raw pixel intensities to the
	# range [0, 1], and convert the frame to a floating point tensor
	frame = np.expand_dims(frame, axis=0)
	frame = frame / 255.0
	frame = torch.FloatTensor(frame)
	# send the input to the device and pass the it through the
	# network to get the detections and predictions
	frame = frame.to(DEVICE)
	detections = model(frame)[0]
	
    	# loop over the detections
	for i in range(0, len(detections["boxes"])):
		# extract the confidence (i.e., probability) associated with
		# the prediction
		confidence = detections["scores"][i]
		if confidence > defaultConfidence:	
			# extract the index of the class label from the
			# detections, then compute the (x, y)-coordinates of
			# the bounding box for the object
			idx = int(detections["labels"][i])
			box = detections["boxes"][i].detach().cpu().numpy()
			(startX, startY, endX, endY) = box.astype("int")
			# draw the bounding box and label on the frame
			label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
			cv2.rectangle(orig, (startX, startY), (endX, endY),
				COLORS[idx], 2)
			y = startY - 15 if startY - 15 > 15 else startY + 15
			cv2.putText(orig, label, (startX, y),
				cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)
				# show the output frame

	cv2.imshow("Frame", orig)
	key = cv2.waitKey(1) & 0xFF
	# if the 'q' key was pressed, break from the loop
	if key == ord("q"):
		break
	# update the FPS counter
	fps.update()
# stop the timer and display FPS information
fps.stop()
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
# do a bit of cleanup
cv2.destroyAllWindows()
vs.stop()

[INFO] elapsed time: 57.54
[INFO] approx. FPS: 0.23


We continue to monitor our FPS until we click on the window opened by OpenCV and press the q key to exit the script, after which we stop our FPS timer and display (1) the elapsed time of the script and (2) approximate frames per second throughput information.