# Detection and Room Positioning Algorithm

The goal of the detection process is to determine where in the room museum visitors are located. The algorithm can be described as follows:

1. Each video frame is put through the YOLO neural network, which detects human bodies and heads, providing the corresponding bounding boxes.

2. If the full human body is visible, the pixel at the center of the bottom edge of the body bounding box is used as the pixel corresponding to the point on the floor.

3. If the full human body is not visible and only the head is detected, the body height is estimated using a head-to-body ratio derived from the training data. According to the [MRI-based anatomical model of the human head](https://pmc.ncbi.nlm.nih.gov/articles/PMC2828153/#:~:text=The%20general%20shape%20of%20the,ethnicity%2C%20sex%2C%20and%20age.), the head size is generally proportional to the body height. Using this principle, a custom coefficient specific to our dataset has been calculated. The estimated height is then projected downward from the center of the top edge of the head bounding box to determine the point on the floor, representing the approximate full-body height.

4. To match body and head bounding boxes, the percentage of intersection between the two is calculated and used as a matching criterion.

5. The homography matrix is used to transform the resultant pixel coordinates into the actual positions of people in the room.

This notebook is designed to run locally.

In [None]:
# Install the library with YOLO models
%pip install ultralytics

# Install the computer vision library
%pip install opencv-python

# Homography

To determine the positions of people within the museum, a homography matrix was calculated. Access to the museum map was obtained, and ten specific points on the map were cross-referenced with their corresponding points in the images captured by the cameras. Using this data, a homography matrix was calculated, enabling the transformation of pixel coordinates from the camera images into real-world positions within the museum.

Initially, the last (10th) points were excluded from calculations and used for validation. The predicted position on the map was `1110x603`, while the actual position was estimated to be `1128x610`. This demonstrated a good level of accuracy, especially considering the imposed limitations.

<img alt="Detection Example" src="./data/images/map.png" width="500" hspace="0"/>

In [34]:
import numpy as np
import cv2

# Video coordinates
video_coords = np.array([
    [44, 549],
    [256, 540],
    [615, 398],
    [823, 311],
    [851, 300],
    [1064, 327],
    [862, 523],
    [486, 445],
    [750, 339],
    [1005, 385],
], dtype=np.float32)

# Map coordinates
map_coords = np.array([
    [630, 308],
    [708, 389],
    [980, 389],
    [1250, 388],
    [1294, 390],
    [1294, 602],
    [862, 610],
    [858, 390],
    [1131, 388],
    [1128, 610],
], dtype=np.float32)

# Compute the homography matrix H
H, _ = cv2.findHomography(video_coords, map_coords)

# Validate the resultant matrix
val_video_coord = np.array([[1005, 385]], dtype=np.float32)
val_video_coord = np.array([val_video_coord])

predicted_map_coord = cv2.perspectiveTransform(val_video_coord, H)
x, y = predicted_map_coord[0][0]
print(f'Predicted map coordinate: {round(x)}x{round(y)}')

Predicted map coordinate: 1118x605


# Detection

The following code brings all components together. Human bodies and heads are detected using the YOLO model. Height lines are drawn based on full-body bounding box predictions and estimated body heights derived from the head-to-body coefficient. Pixel coordinates of points on the floor are calculated and transformed into actual room positions using the homography matrix. Finally, the video outputs are generated, showcasing the results.

---

Example 1: Detections and Estimated Positions  
<img src="./data/images/detection_example_1.png" width="500" hspace="0"/>
<img src="./data/images/position_example_1.png" width="480" hspace="0"/>

Example 2: Detections and Estimated Positions  
<img src="./data/images/detection_example_2.png" width="500" hspace="0"/>
<img src="./data/images/position_example_2.png" width="480" hspace="0"/>

---

In [None]:
from ultralytics import YOLO
import cv2

HEAD_CLASS_ID = 0
HEAD_TO_BODY_COEF = 5.9437 # Refer to head_to_body_coef.ipynb
MAP_IMAGE_PATH = './data/images/map.png'

# Load the trained YOLO model
model = YOLO('./data/best.pt')

# Replace with the desired paths
input_video_path = '/Users/eakriulin/Downloads/museum_human_example.mp4'
detection_output_path = '/Users/eakriulin/Downloads/mh_with_map/detection_2.mp4'
position_output_path = '/Users/eakriulin/Downloads/mh_with_map/position_2.mp4'

def detect():
    print('Starting...')

    # Capture the input video and its properties
    video = cv2.VideoCapture(input_video_path)
    frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(video.get(cv2.CAP_PROP_FPS))

    # Read properties of the map image
    map_image = cv2.imread(MAP_IMAGE_PATH)
    map_height, map_width, _ = map_image.shape

    # Create VideoWriter objects to save the output videos
    detection_out = cv2.VideoWriter(detection_output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))
    position_out = cv2.VideoWriter(position_output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (map_width, map_height))

    # Process each frame
    current_frame_count = 0
    print_after_each_ith_frame = int(frame_count / 100)
    while video.isOpened():
        has_frame, detection_frame = video.read()
        if not has_frame:
            break
        
        # Create a new map frame to output positions
        position_frame = cv2.imread(MAP_IMAGE_PATH)

        # Print the processing percentages
        current_frame_count += 1
        if current_frame_count % print_after_each_ith_frame == 0:
            print(f'Processed {((current_frame_count / frame_count) * 100):.0f}%')

        # Run object detection with the YOLO model
        results = model.predict(detection_frame, conf=0.7, save=False, verbose=False)

        heads = []
        bodies = []
        pixel_coords = []

        # Loop through each detected object in the frame
        for result in results:
            for box in result.boxes:
                x1, y1, x2, y2 = box.xyxy[0].numpy()
                class_id = int(box.cls[0].numpy())

                if class_id == HEAD_CLASS_ID:
                    heads.append((x1, y1, x2, y2))
                else:
                    bodies.append((x1, y1, x2, y2))

        matched_head_indices = set()

        # Loop through each detected body, identify a matching head, draw the height line
        for body in bodies:
            x1, y1, x2, y2 = body

            upper_body = [x1, y1, x2, y1 + (y2 - y1) * 0.2]  # Top 20% of the body bounding box
            matched_head_idx = -1
            matched_fraction_of_intersection = float('-inf')

            # Loop through each detected head, identify the best matching one
            for idx, head in enumerate(heads):
                fraction_of_intersection = get_fraction_of_intersection(upper_body, head)
                if fraction_of_intersection > matched_fraction_of_intersection:
                    matched_head_idx = idx
                    matched_fraction_of_intersection = fraction_of_intersection

            # Fix the best matching head
            if matched_head_idx != -1:
                matched_head_indices.add(matched_head_idx)
            
            height = y2 - y1
            middle_x = int((x2 + x1) / 2)
            start_y = int(y1)
            end_y = int(y1 + height)

            # Store pixel coords corresponding to the position in the room
            pixel_coords.append([middle_x, end_y])

            # Draw the height line (green)
            cv2.line(detection_frame, (middle_x, start_y), (middle_x, end_y), (100, 255, 110), 2)

        # Loop through each detected head, skip the matched once, draw the estimated height line
        for idx, head in enumerate(heads):
            if idx in matched_head_indices:
                continue

            x1, y1, x2, y2 = head

            estimated_height = (y2 - y1) * HEAD_TO_BODY_COEF
            middle_x = int((x2 + x1) / 2)
            start_y = int(y1)
            end_y = int(y1 + estimated_height)

            # Store pixel coords
            pixel_coords.append([middle_x, end_y])

            # Draw the estimated height line (yellow)
            cv2.line(detection_frame, (middle_x, start_y), (middle_x, end_y), (255, 240, 100), 2)

        # Convert pixel coordinates into room positions
        for x, y in pixel_coords:
            pixel_coord = np.array([[x, y]], dtype=np.float32)
            pixel_coord = np.array([pixel_coord])

            room_coord = cv2.perspectiveTransform(pixel_coord, H)
            room_x = int(room_coord[0][0][0])
            room_y = int(room_coord[0][0][1])

            # Draw the circle to represent the estimated room position
            cv2.circle(position_frame, [room_x, room_y], radius=40, color=(0, 0, 0), thickness=1)

        # Write to the output videos
        detection_out.write(detection_frame)
        position_out.write(position_frame)

    print('Finishing...')

    video.release()
    detection_out.release()
    position_out.release()
    cv2.destroyAllWindows()
        

def get_fraction_of_intersection(body, head):
    x1 = max(body[0], head[0])
    y1 = max(body[1], head[1])
    x2 = min(body[2], head[2])
    y2 = min(body[3], head[3])

    intersection_area = max(0, x2 - x1) * max(0, y2 - y1)
    head_area = (head[2] - head[0]) * (head[3] - head[1])

    return intersection_area / head_area

detect()