<a href="https://colab.research.google.com/github/donbcolab/composable_vlms/blob/main/notebooks/landingai_shark_video_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Landing AI - shark video detection

Can you detect any surfboards or sharks in the video, draw a green line between the shark and the nearest surfboard and add the distance between them in meters assuming 30 pixels is 1 meter. Make the line red if the shark is within 10 meters of a surfboard. Sample the video at 1 frames per second and save the output video as output.mp4.

## Creating instructions

Extract frames from the video '/home/user/tmp2dwnlz5z.mp4' at 1 frame per second using the 'extract_frames' tool.
For each extracted frame, use the 'owl_v2' tool with the prompt 'surfboard, shark' to detect surfboards and sharks in the frame.
For each frame, if both surfboards and sharks are detected, calculate the distance between each shark and the nearest surfboard using the 'closest_box_distance' tool. Assume 30 pixels is 1 meter.
Draw a green line between each shark and the nearest surfboard, and add the distance in meters. If the distance is within 10 meters, make the line red.
Save the modified frames as a video named 'output.mp4' using the 'save_video' tool.

## Retrieving tools

'extract_frames' extracts frames from a video which can be a file path or youtube link, returns a list of tuples (frame, timestamp), where timestamp is the relative time in seconds where the frame was captured. The frame is a numpy array.
'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.
'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.
'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.
'closest_box_distance' calculates the closest distance between two bounding boxes.
'closest_mask_distance' calculates the closest distance between two masks.
'save_image' is a utility function that saves an image to a file path.

In [1]:
!pip install -qU vision_agent

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.8/102.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.0/135.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.6/328.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m75.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
os.environ['XDG_RUNTIME_DIR'] = '/tmp/runtime-dir'

In [3]:
from vision_agent.tools import extract_frames, owl_v2, closest_box_distance, save_video
import cv2
import numpy as np

def detect_and_draw(video_uri: str, output_video_path: str, debug: bool = False):
    # Constants
    FPS = 1
    PIXELS_PER_METER = 30
    DISTANCE_THRESHOLD_METERS = 10
    DISTANCE_THRESHOLD_PIXELS = DISTANCE_THRESHOLD_METERS * PIXELS_PER_METER

    # Extract frames from the video
    frames = extract_frames(video_uri, fps=FPS)

    # List to store modified frames
    modified_frames = []

    for frame, timestamp in frames:
        if debug:
            print(f"Processing frame at {timestamp} seconds")

        # Detect surfboards and sharks
        detections = owl_v2("surfboard, shark", frame)

        # Separate detections into surfboards and sharks
        surfboards = [d for d in detections if d['label'] == 'surfboard']
        sharks = [d for d in detections if d['label'] == 'shark']

        if debug:
            print(f"Detected {len(surfboards)} surfboards and {len(sharks)} sharks")

        # Convert normalized coordinates to pixel coordinates
        height, width, _ = frame.shape
        for detection in detections:
            bbox = detection['bbox']
            detection['bbox'] = [
                int(bbox[0] * width), int(bbox[1] * height),
                int(bbox[2] * width), int(bbox[3] * height)
            ]

        # Draw lines between sharks and the nearest surfboards
        for shark in sharks:
            shark_bbox = shark['bbox']
            nearest_surfboard = None
            min_distance = float('inf')

            for surfboard in surfboards:
                surfboard_bbox = surfboard['bbox']
                distance = closest_box_distance(shark_bbox, surfboard_bbox, (height, width))

                if distance < min_distance:
                    min_distance = distance
                    nearest_surfboard = surfboard

            if nearest_surfboard:
                color = (0, 255, 0) if min_distance > DISTANCE_THRESHOLD_PIXELS else (255, 0, 0)
                shark_center = ((shark_bbox[0] + shark_bbox[2]) // 2, (shark_bbox[1] + shark_bbox[3]) // 2)
                surfboard_center = ((nearest_surfboard['bbox'][0] + nearest_surfboard['bbox'][2]) // 2,
                                    (nearest_surfboard['bbox'][1] + nearest_surfboard['bbox'][3]) // 2)

                cv2.line(frame, shark_center, surfboard_center, color, 2)

                if debug:
                    print(f"Drew line from shark at {shark_center} to surfboard at {surfboard_center} with color {color}")

        # Append the modified frame to the list
        modified_frames.append(frame)

    # Save the modified frames as a video
    save_video(modified_frames, output_video_path, fps=FPS)


In [6]:
def test_detect_and_draw():
    # Define the input video path and output video path
    input_video_path = "/content/tmp2dwnlz5z.mp4"
    output_video_path = "/content/output.mp4"

    input_video_url ="https://github.com/donbcolab/composable_vlms/raw/main/videos/shark3_short.mp4"

    # Download the input video
    !wget -O $input_video_path $input_video_url

    # Call the function with the provided video and output path
    detect_and_draw(input_video_path, output_video_path, debug=True)

    # Print the output video path to verify the function ran successfully
    print(f"Output video saved at: {output_video_path}")

    # Return the output video path for further verification if needed
    return output_video_path

In [7]:
# Run the test function
test_detect_and_draw()

--2024-07-18 21:45:53--  https://github.com/donbcolab/composable_vlms/raw/main/videos/shark3_short.mp4
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/donbcolab/composable_vlms/main/videos/shark3_short.mp4 [following]
--2024-07-18 21:45:53--  https://raw.githubusercontent.com/donbcolab/composable_vlms/main/videos/shark3_short.mp4
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2226774 (2.1M) [application/octet-stream]
Saving to: ‘/content/tmp2dwnlz5z.mp4’


2024-07-18 21:45:53 (30.5 MB/s) - ‘/content/tmp2dwnlz5z.mp4’ saved [2226774/2226774]



  0%|          | 0/1 [00:00<?, ?it/s]
Extracting frames from clip 0-8.01:   0%|          | 0/8 [00:00<?, ?it/s][A
Extracting frames from clip 0-8.01:  25%|██▌       | 2/8 [00:00<00:00,  7.87it/s][A
Extracting frames from clip 0-8.01:  38%|███▊      | 3/8 [00:00<00:01,  4.99it/s][A
Extracting frames from clip 0-8.01:  50%|█████     | 4/8 [00:00<00:00,  4.30it/s][A
Extracting frames from clip 0-8.01:  62%|██████▎   | 5/8 [00:01<00:00,  4.06it/s][A
Extracting frames from clip 0-8.01:  75%|███████▌  | 6/8 [00:01<00:00,  3.77it/s][A
Extracting frames from clip 0-8.01:  88%|████████▊ | 7/8 [00:01<00:00,  3.72it/s][A
Extracting frames from clip 0-8.01: 100%|██████████| 8/8 [00:01<00:00,  3.65it/s][A
Extracting frames from clip 0-8.01: 9it [00:02,  4.00it/s]
100%|██████████| 1/1 [00:02<00:00,  2.98s/it]


Processing frame at 0.033 seconds
Detected 6 surfboards and 2 sharks
Drew line from shark at (930, 415) to surfboard at (1161, 426) with color (255, 0, 0)
Drew line from shark at (1641, 788) to surfboard at (1574, 680) with color (255, 0, 0)
Processing frame at 1.001 seconds
Detected 6 surfboards and 1 sharks
Drew line from shark at (883, 404) to surfboard at (1170, 442) with color (255, 0, 0)
Processing frame at 2.002 seconds
Detected 4 surfboards and 2 sharks
Drew line from shark at (815, 399) to surfboard at (1190, 458) with color (255, 0, 0)
Drew line from shark at (969, 664) to surfboard at (1190, 458) with color (255, 0, 0)
Processing frame at 3.003 seconds
Detected 7 surfboards and 2 sharks
Drew line from shark at (767, 431) to surfboard at (950, 663) with color (255, 0, 0)
Drew line from shark at (959, 696) to surfboard at (950, 663) with color (255, 0, 0)
Processing frame at 4.004 seconds
Detected 10 surfboards and 1 sharks
Drew line from shark at (719, 415) to surfboard at (9



Moviepy - Done !
Moviepy - video ready /content/output.mp4


Output video saved at: /content/output.mp4


'/content/output.mp4'

In [8]:
# display output.mp4
from IPython.display import Video
Video("/content/output.mp4")