# Detection and Room Positioning Algorithm

The goal of the detection process is to determine where in the room museum visitors are located. The algorithm can be described as follows:

1. Each video frame is put through the YOLO neural network, which detects human bodies and heads, providing the corresponding bounding boxes.

2. If the full human body is visible, the pixel at the center of the bottom edge of the body bounding box is used as the pixel corresponding to the point on the floor.

3. If the full human body is not visible and only the head is detected, the body height is estimated using a head-to-body ratio derived from the training data. According to the [MRI-based anatomical model of the human head](https://pmc.ncbi.nlm.nih.gov/articles/PMC2828153/#:~:text=The%20general%20shape%20of%20the,ethnicity%2C%20sex%2C%20and%20age.), the head size is generally proportional to the body height. Using this principle, a custom coefficient specific to our dataset has been calculated. The estimated height is then projected downward from the center of the top edge of the head bounding box to determine the point on the floor, representing the approximate full-body height.

4. To match body and head bounding boxes, the percentage of intersection between the two is calculated and used as a matching criterion.

5. The homography matrix is used to transform the resultant pixel coordinates into the actual positions of people in the room.

This notebook is designed to run locally.

In [None]:
# Install the library with YOLO models
%pip install ultralytics

# Install the computer vision library
%pip install opencv-python

# Homography

To determine positions within the room, a homography matrix is calculated. For demonstration and experimentation purposes, and due to lacking access to the museum, a simplified example of homography matrix calculation is presented here.

The measurements were taken in the living room of my house. Empty water bottles were placed across the room to serve as reference points. The `room_coords` list contains the physical positions of these bottles in the room (in centimeters), while the `pixel_coords` list contains their corresponding pixel coordinates in the image.

For validation, my actual position in the room, measured as `319x403`, was compared to the predicted position derived using the homography matrix, which was `316x396`. This result shows a good accuracy in position estimation.

<img alt="Detection Example" src="./data/images/me_in_the_room.jpg" width="400" hspace="0"/>

In [None]:
import numpy as np
import cv2

# Pixel coordinates for each bottle
pixel_coords = np.array([
    [413, 1872],
    [1215, 2270],
    [2696, 2929],
    [1154, 1907],
    [2218, 2216],
    [3237, 2580],
    [1705, 1684],
    [2222, 1900],
    [3243, 2023],
    [2254, 1653],
], dtype=np.float32)

# Room positions for each bottle
room_coords = np.array([
    [24, 240],
    [205, 238],
    [386, 238],
    [124, 298],
    [297, 309],
    [412, 321],
    [135, 403],
    [251, 380],
    [381, 437],
    [202, 473],
], dtype=np.float32)

# Compute the homography matrix H
H, _ = cv2.findHomography(pixel_coords, room_coords)

# Validate the resultant matrix on my position in the room
val_pixel_coord = np.array([[2673, 1999]], dtype=np.float32)
val_pixel_coord = np.array([val_pixel_coord])

predicted_room_position = cv2.perspectiveTransform(val_pixel_coord, H)
x, y = predicted_room_position[0][0]
print(f'Predicted room position: {round(x)}x{round(y)}')

# Detection

The following code brings all components together. Human bodies and heads are detected using the YOLO model. Height lines are drawn based on full-body bounding box predictions and estimated body heights derived from the head-to-body coefficient. Pixel coordinates of points on the floor are calculated and transformed into actual room positions using the homography matrix. Finally, the video outputs are generated, showcasing the results.

---

Detection Examples  
<img title="Detection Example 1" src="./data/images/detection_example_1.png" width="400" hspace="0"/>
<img title="Detection Example 2" src="./data/images/detection_example_2.png" width="400" hspace="0"/>  
Position Examples  
<img title="Position Example 1" src="./data/images/position_example_1.png" width="400" hspace="0"/>
<img title="Position Example 2" src="./data/images/position_example_2.png" width="400" hspace="0"/>  

---

In [None]:
from ultralytics import YOLO
import cv2

HEAD_CLASS_ID = 0
HEAD_TO_BODY_COEF = 5.9437 # Refer to head_to_body_coef.ipynb

# Load the trained YOLO model
model = YOLO('./data/best.pt')

# Replace with the desired paths
input_video_path = '/path/to/input/video.mp4'
detection_output_path = '/path/to/output/detection_video.mp4'
position_output_path = '/path/to/output/position_video.mp4'

def detect():
    print('Starting...')

    # Capture the input video and its properties
    video = cv2.VideoCapture(input_video_path)
    frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(video.get(cv2.CAP_PROP_FPS))

    # Create VideoWriter objects to save the output videos
    detection_out = cv2.VideoWriter(detection_output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))
    position_out = cv2.VideoWriter(position_output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))

    # Process each frame
    current_frame_count = 0
    print_after_each_ith_frame = int(frame_count / 100)
    while video.isOpened():
        has_frame, detection_frame = video.read()
        if not has_frame:
            break
        
        # Create a fully white frame to output room positions
        position_frame = np.ones((frame_height, frame_width, 3), dtype=np.uint8) * 255

        # Print the processing percentages
        current_frame_count += 1
        if current_frame_count % print_after_each_ith_frame == 0:
            print(f'Processed {((current_frame_count / frame_count) * 100):.0f}%')

        # Run object detection with the YOLO model
        results = model.predict(detection_frame, conf=0.7, save=False, verbose=False)

        heads = []
        bodies = []
        pixel_coords = []

        # Loop through each detected object in the frame
        for result in results:
            for box in result.boxes:
                x1, y1, x2, y2 = box.xyxy[0].numpy()
                class_id = int(box.cls[0].numpy())

                if class_id == HEAD_CLASS_ID:
                    heads.append((x1, y1, x2, y2))
                else:
                    bodies.append((x1, y1, x2, y2))

        matched_head_indices = set()

        # Loop through each detected body, identify a matching head, draw the height line
        for body in bodies:
            x1, y1, x2, y2 = body

            upper_body = [x1, y1, x2, y1 + (y2 - y1) * 0.2]  # Top 20% of the body bounding box
            matched_head_idx = -1
            matched_fraction_of_intersection = float('-inf')

            # Loop through each detected head, identify the best matching one
            for idx, head in enumerate(heads):
                fraction_of_intersection = get_fraction_of_intersection(upper_body, head)
                if fraction_of_intersection > matched_fraction_of_intersection:
                    matched_head_idx = idx
                    matched_fraction_of_intersection = fraction_of_intersection

            # Fix the best matching head
            if matched_head_idx != -1:
                matched_head_indices.add(matched_head_idx)
            
            height = y2 - y1
            middle_x = int((x2 + x1) / 2)
            start_y = int(y1)
            end_y = int(y1 + height)

            # Store pixel coords corresponding to the position in the room
            pixel_coords.append([middle_x, end_y])

            # Draw the height line (green)
            cv2.line(detection_frame, (middle_x, start_y), (middle_x, end_y), (100, 255, 110), 2)

        # Loop through each detected head, skip the matched once, draw the estimated height line
        for idx, head in enumerate(heads):
            if idx in matched_head_indices:
                continue

            x1, y1, x2, y2 = head

            estimated_height = (y2 - y1) * HEAD_TO_BODY_COEF
            middle_x = int((x2 + x1) / 2)
            start_y = int(y1)
            end_y = int(y1 + estimated_height)

            # Store pixel coords
            pixel_coords.append([middle_x, end_y])

            # Draw the estimated height line (yellow)
            cv2.line(detection_frame, (middle_x, start_y), (middle_x, end_y), (255, 240, 100), 2)

        # Convert pixel coordinates into room positions
        for x, y in pixel_coords:
            pixel_coord = np.array([[x, y]], dtype=np.float32)
            pixel_coord = np.array([pixel_coord])

            # --- Apply the real homography matrix for the museum ---
            room_coord = [x, y] # E.g., room_coord = cv2.perspectiveTransform(pixel_coord, H)

            # Draw the circle to represent the estimated room position
            cv2.circle(position_frame, room_coord, radius=60, color=(0, 0, 0), thickness=1)

        # Write to the output videos
        detection_out.write(detection_frame)
        position_out.write(position_frame)

    print('Finishing...')

    video.release()
    detection_out.release()
    position_out.release()
    cv2.destroyAllWindows()
        

def get_fraction_of_intersection(body, head):
    x1 = max(body[0], head[0])
    y1 = max(body[1], head[1])
    x2 = min(body[2], head[2])
    y2 = min(body[3], head[3])

    intersection_area = max(0, x2 - x1) * max(0, y2 - y1)
    head_area = (head[2] - head[0]) * (head[3] - head[1])

    return intersection_area / head_area

detect()