**Run me first!!** 👇👇👇👇👇👇👇👇

In [1]:
# Install required pip packages
%pip -q install ultralytics opencv-python pyyaml opencv_jupyter_ui

Note: you may need to restart the kernel to use updated packages.


## Vision inferencing with a local model

In the first exercise, we can test out a basic [computer vision](https://www.microsoft.com/en-us/research/research-area/computer-vision/?msockid=22ee1fda33f46de00ef10b8532d86c89) inferencing task using a popular AI model called [YOLOv8](https://docs.ultralytics.com/models/yolov8/). YOLO (You Only Look Once) is a real-time object detection system that works by processing static images. It divides the image into a grid and predicts bounding boxes and probabilities for each grid cell, allowing it to detect multiple objects within a single image efficiently. 

To get started we will initialize the model via the Ultralytics python library. This will automatically download the model. Different sizes for the YOLOv8 model can be specified depending on the workload to adjust balance for accuracy versus speed. Once we initialize the model in our code, we can label the detected objects using [COCO dataset](https://cocodataset.org/#overview) class labels. The class labels dataset can be viewed [here](../artifacts/coco.yaml) where you can see the different types of objects that can be potentially identified.

Click on the Play icon to the left of the cell below to initialize the model.

In [2]:
import cv2, yaml
import opencv_jupyter_ui as jcv2
from ultralytics import YOLO
from pprint import pprint 

model = YOLO('yolov8n.pt')  # You can use 'yolov8s.pt', 'yolov8m.pt', etc. for different model sizes

# This code loads the class names from the COCO dataset yaml file. 
def load_class_names(yaml_file):
    with open(yaml_file, 'rb') as f:
        data = yaml.safe_load(f)
    return data['names']

class_names = load_class_names('../artifacts/coco.yaml')  # Adjust the path to your .names file

pprint(class_names)


{0: 'person',
 1: 'bicycle',
 2: 'car',
 3: 'motorcycle',
 4: 'airplane',
 5: 'bus',
 6: 'train',
 7: 'truck',
 8: 'boat',
 9: 'traffic light',
 10: 'fire hydrant',
 11: 'stop sign',
 12: 'parking meter',
 13: 'bench',
 14: 'bird',
 15: 'cat',
 16: 'dog',
 17: 'horse',
 18: 'sheep',
 19: 'cow',
 20: 'elephant',
 21: 'bear',
 22: 'zebra',
 23: 'giraffe',
 24: 'backpack',
 25: 'umbrella',
 26: 'handbag',
 27: 'tie',
 28: 'suitcase',
 29: 'frisbee',
 30: 'skis',
 31: 'snowboard',
 32: 'sports ball',
 33: 'kite',
 34: 'baseball bat',
 35: 'baseball glove',
 36: 'skateboard',
 37: 'surfboard',
 38: 'tennis racket',
 39: 'bottle',
 40: 'wine glass',
 41: 'cup',
 42: 'fork',
 43: 'knife',
 44: 'spoon',
 45: 'bowl',
 46: 'banana',
 47: 'apple',
 48: 'sandwich',
 49: 'orange',
 50: 'broccoli',
 51: 'carrot',
 52: 'hot dog',
 53: 'pizza',
 54: 'donut',
 55: 'cake',
 56: 'chair',
 57: 'couch',
 58: 'potted plant',
 59: 'bed',
 60: 'dining table',
 61: 'toilet',
 62: 'tv',
 63: 'laptop',
 64: 'mou

### Basic object detection on a static image

The next code block will load an image from disk using the Python [OpenCV](https://opencv.org/) library and send it to the model for basic object detection. Any detected objects will be annotated with a box drawn around them.

>**Note**: The image will appear in a popup that may be displayed behind the Visual Studio Code window.

In [3]:
# Load image
image_path = '../media/image/people_on_street.jpg'
image = cv2.imread(image_path)

# Perform basic detection
results = model(image)

# Draw bounding boxes on the image and label objects by referencing the class names
for result in results:
    for box in result.boxes:
        class_id = int(box.cls[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        confidence = box.conf[0]
        label = f'{class_names[class_id]} {confidence:.2f}'
        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(image, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display the image until with the bounding boxes until a key is pressed
jcv2.imshow('Results', image)



0: 448x640 10 persons, 2 handbags, 35.2ms
Speed: 2.5ms preprocess, 35.2ms inference, 103.7ms postprocess per image at shape (1, 3, 448, 640)


HBox(children=(Button(button_style='danger', description='Stop', style=ButtonStyle()), HBox(children=(Label(va…

HBox(children=(Button(button_style='danger', description='Stop', style=ButtonStyle()), HBox(children=(Label(va…

VBox(children=(HTML(value='<center>Results</center>'), Canvas()), layout=Layout(border_bottom='1.5px solid', b…

### Object detection in a video file

By adjusting our technical implementation, we can detect objects with YOLO inside a video file. 

To use YOLO with a video file, we need to extract individual frames from the video and then apply the YOLO model to each frame separately. This process involves reading the video file, extracting frames at a specified frame rate, performing object detection on each frame, and then potentially reassembling the processed frames back into a video format. This approach allows us to leverage YOLO's capabilities for real-time object detection in video streams.

![A diagram illustrating the video-to-frame concept](./img/video_to_frame_diagram_small.png)

Another concept to consider is the rate at which frames are extracted from the video and sent to the model for inferencing. This can be measured in frames-per-second, also known as framerate. At 30 frames per second, we will need to extract 30 individual images from the video stream every second. 

![A diagram illustrating frames-per-second](./img/fps_diagram.png)

Framerate can be adjusted as needed to balance between performance and cost. In our example we will set a framerate of 3, which will result in a moderate amount of frames written to disk for the included video sample file. This in turn will result in less resource cost to run inferencing against our video.

Let's use a sample video file and perform this first step to extract frames from a sample video file. Once they are extracted, they will be visible in the project in the [video frames](../video_frames/) folder.

Run the next cell using the Play button the left. 

In [4]:
import os

video_path = 'https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4'
#video_path = 'https://download.microsoft.com/download/a0ac5d61-60b6-4037-9555-ba5acefeb0c8/people-near-shop-counter-fruit.mp4'
video_filename = os.path.splitext(os.path.basename(video_path))[0]
output_folder='../video_frames/' + video_filename
os.makedirs(output_folder, exist_ok=True)

frame_skip = 3 # Set the frame skip rate
cap = cv2.VideoCapture(video_path) # Open the video file

# Get the total number of frames in the video and calculate the interval between frames to capture
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) 
frame_interval = int(cap.get(cv2.CAP_PROP_FPS) / frame_skip)
frame_count = 0
saved_frame_count = 0

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Save the frame if it is at the specified interval
    if frame_count % frame_interval == 0:
        frame_filename = os.path.join(output_folder, f'frame_{saved_frame_count:04d}.jpg')
        cv2.imwrite(frame_filename, frame)
        saved_frame_count += 1

    frame_count += 1
    print(f"Extracting frame {frame_count} from {video_path}.")
    
cap.release()
print(f"Extracted {saved_frame_count} frames from the video.")

Extracting frame 1 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 2 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 3 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 4 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 5 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 6 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 7 from https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4.
Extracting frame 8 from https://download.

## Perform object detection on the video frames

We can perform the detection in real time.

In [5]:
# Load the YOLOv8 model
model = YOLO('yolov8n.pt')

# Load video
video_path = 'https://download.microsoft.com/download/caaf80b6-2394-4fbc-8430-8b41a3206c64/people-are-pushing-carts-along.mp4'
cap = cv2.VideoCapture(video_path)

delay = 1

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Perform detection
    results = model(frame)

    # Draw bounding boxes on the frame
    for result in results:
        for box in result.boxes:
            class_id = int(box.cls[0])
            x1, y1, x2, y2 = map(int, box.xyxy[0])
            confidence = box.conf[0]
            label = f'{class_names[class_id]} {confidence:.2f}'
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)


    # Display the frame until q is pressed
    jcv2.imshow('Detected People (press Q to exit)', frame)
    # if jcv2.waitKey(delay) & 0xFF == ord('q'): 
    #     break

cap.release()
jcv2.destroyAllWindows()


0: 384x640 6 persons, 38.8ms
Speed: 3.7ms preprocess, 38.8ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)


HBox(children=(Button(button_style='danger', description='Stop', style=ButtonStyle()), HBox(children=(Label(va…

VBox(children=(HTML(value='<center>Detected People (press Q to exit)</center>'), Canvas()), layout=Layout(bord…


0: 384x640 8 persons, 1 boat, 5.1ms
Speed: 1.3ms preprocess, 5.1ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 6 persons, 1 boat, 5.3ms
Speed: 1.3ms preprocess, 5.3ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 5 persons, 5.4ms
Speed: 1.6ms preprocess, 5.4ms inference, 1.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 4 persons, 5.9ms
Speed: 1.7ms preprocess, 5.9ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 6 persons, 1 apple, 5.5ms
Speed: 1.4ms preprocess, 5.5ms inference, 2.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 6 persons, 2 apples, 4.8ms
Speed: 1.3ms preprocess, 4.8ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 5 persons, 2 apples, 7.8ms
Speed: 1.6ms preprocess, 7.8ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 6 persons, 9.3ms
Speed: 1.9ms preprocess, 9.3ms inference, 1.5ms pos

### **Other object detection models**

YOLOv8 (You Only Look Once version 8) is a popular computer vision model known for its speed and accuracy in real-time object detection. It is designed to detect multiple objects within an image or video frame in a single pass, making it highly efficient for applications requiring quick and precise object identification. However, YOLOv8 is just one of many object detection models available. Other notable models include [Faster R-CNN](https://arxiv.org/abs/1506.01497), which provides high accuracy by using region proposal networks, and [SSD (Single Shot MultiBox Detector)](https://arxiv.org/abs/1512.02325), which balances speed and accuracy by detecting objects in a single shot without requiring a region proposal stage.

### **Other Vision Inferencing Tasks**

Beyond object detection, computer vision encompasses various other inferencing tasks such as image classification, semantic segmentation, and instance segmentation. Image classification involves categorizing an entire image into a predefined class, using models like [ResNet](https://arxiv.org/abs/1512.03385) and [Inception](https://arxiv.org/abs/1512.00567). Semantic segmentation assigns a class label to each pixel in an image, enabling detailed scene understanding, with models like [U-Net](https://arxiv.org/abs/1505.04597) and [DeepLab](https://arxiv.org/abs/1606.00915) excelling in this area. Instance segmentation combines object detection and semantic segmentation to identify and segment each object instance within an image, with models like [Mask R-CNN](https://arxiv.org/abs/1703.06870) being widely used for this purpose. These diverse inferencing tasks enable a broad range of applications, from medical imaging to autonomous driving.

For more information, you can explore the following resources:
- [YOLO: You Only Look Once](https://pjreddie.com/darknet/yolo/)
- [Faster R-CNN](https://arxiv.org/abs/1506.01497)
- [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325)
- [ResNet: Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
- [Inception: Going Deeper with Convolutions](https://arxiv.org/abs/1512.00567)
- [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
- [DeepLab: Semantic Image Segmentation with Deep Convolutional Nets](https://arxiv.org/abs/1606.00915)
- [Mask R-CNN](https://arxiv.org/abs/1703.06870)

## Complete Lab
Run the following cell to complete this lab.

In [None]:
%store -r userId
import requests;print(requests.post("https://jsleaderboard001-cnece0effvapgbft.westus2-01.azurewebsites.net/complete_task", headers={"Content-Type": "application/json"}, json={"user_id": userId, "task_id": 2}).json())

### Continue

[Notebook 3 - Counting objects](./3-CountingObjects.ipynb)