**WELCOME TO PERFECT-PULLUP**

The goal of this project is to estimate if a pull-up, taken from video, is a full range-of-motion, or "perfect," pull-up. This project utilizes a pre-trained deep learning human pose estimation model to extract key points and assess the form of the pull-ups exercise. 

The pre-trained model used is the **Caffe** model.

**Caffe = Convolutional Architecture for Fast Feature Embedding.**
Caffe is an open-source deep learning framework developed by the Berkeley Vision and Learning Center (BVLC).

**Layers:**
The network consists of multiple convolutional layers (Convolution), followed by ReLU activation layers (ReLU) and pooling layers (Pooling).
The convolutional layers perform feature extraction, while the ReLU layers introduce non-linearity.
The pooling layers downsample the feature maps, reducing their spatial dimensions.
*The layers are described in prototxt file in the models folder.

**Hyper-Parameters:**
In the parameters of the layers the convolutions are inside, a learning rate multipliers of 4.0 and 8.0 are used with decay multipliers of 1.0 and 0 respectively.

**Architecture:**
The architecture follows a Convolutional Pose Machine (CPM) approach.
It consists of multiple stages, each containing several convolutional layers followed by ReLU activations.
Each stage refines the pose estimation progressively, typically starting with low-resolution feature maps and gradually refining them to higher resolutions.

**Output:**
The layers output feature maps that correspond to keypoints during the human pose estimation task.
Example heatmap of all of the 18 keypoints from the model:
<img src="media/heatmaps.png" alt="Screenshot" width="1180"/>

**Literature Review:**
The first piece of literature is "Caffe: Convolutional Architecture for Fast Feature Embedding," a report by Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama and Trevor Darrell at UC Berkeley.

The model I used is a variant of the Caffe model. The Caffe model "can also be used to extract semantic features from images using a pre-trained network" (pg. 678, para. 1).

A layer in the CNN-based Caffe model takes in blob as input and yields one as an output. "Layers have two key responsibilities for the
operation of the network as a whole: a forward pass that takes the inputs and produces the outputs, and a backward pass that takes the gradient with respect to the output, and computes the gradients with respect to the parameters and to the input" (pg. 677, para. 3).

Using a pre-trained model means this is done already, and I only need to use a forward pass after setting a blob from the frame as input to get the desired outputs.

Below is an example of extracting a singular feature (left shoulder) when detecting keypoints in an image.

<img src="media/feature.png" alt="Screenshot" width="200"/>
<img src="media/with-feature.png" alt="Screenshot" width="200"/>

The model I chose has max pooling layers. The report "Pooling Methods in Deep Neural Networks, a Review," by Hossein Gholamalinezhad and Hossein Khosravi, says "In max pooling, the maximum activation is selected from each pooling region" (Section 2.5). 

Figure from section 2.2 of the report:

<img src="media/max_pooling.png" alt="Screenshot" width="400"/>

In keypoint detection (or other topics), max pooling is a method for downsampling feature maps. By selecting the maximum activation within each pooling region, max pooling retains the most prominent features while discarding less relevant information.

In [2]:
import cv2
import numpy as np
from enum import Enum

In [3]:
# State of the pull-up
class PullUpState(Enum):
    BOTTOM = 0
    TOP = 1

In [4]:
# Resize the frame for better processing
def preprocess_frame(frame):
    frame = cv2.resize(frame, None, fx=0.5, fy=0.5)
    return frame

In [5]:
def detect_keypoints(frame, output, keypoints_mapping, threshold):
    H, W = frame.shape[:2]
    
    keypoints_of_interest = ["LShoulder", "RShoulder", "LElbow", "RElbow", "LWrist", "RWrist"]
    
    for keypoint in keypoints_of_interest:
        # Get the index of the keypoint in the keypoints mapping
        index = keypoints_mapping.index(keypoint)
        
        # Get the probability map for the keypoint
        prob_map = output[0, index, :, :]
        prob_map = cv2.resize(prob_map, (W, H))
        
        # Find the maximum confidence value and its location in the probability map
        _, confidence, _, point = cv2.minMaxLoc(prob_map)
        
        # Check if the confidence value exceeds the threshold
        if confidence > threshold:
            cv2.circle(frame, point, 5, (0, 255, 255), thickness=-1, lineType=cv2.FILLED)
            cv2.putText(frame, f"{keypoint}: {confidence:.2f}", (point[0] + 6, point[1] + 6),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1, lineType=cv2.LINE_AA)
    
    # Return the frame with keypoints drawn on it
    return frame


In [6]:
def draw_connections(frame, output, keypoints_pairs, keypoints_mapping, threshold):
    H, W = frame.shape[:2]
    
    for pair in keypoints_pairs:
        # Check if either of the keypoints in the pair is one of the keypoints of interest
        if pair[0] in [keypoints_mapping.index("LShoulder"), keypoints_mapping.index("RShoulder"),
                       keypoints_mapping.index("LElbow"), keypoints_mapping.index("RElbow"),
                       keypoints_mapping.index("LWrist"), keypoints_mapping.index("RWrist")]:
            # Get the indices of the keypoints in the pair
            index1, index2 = pair
            
            # Get the maximum confidence scores for both keypoints in the pair
            confidence1 = output[0, index1, :, :].max()
            confidence2 = output[0, index2, :, :].max()
            
            # Check if both keypoints have confidence scores above the threshold
            if confidence1 > threshold and confidence2 > threshold:
                # Get the probability maps for both keypoints
                prob_map1 = output[0, index1, :, :]
                prob_map2 = output[0, index2, :, :]
                
                # Find the location of the maximum confidence points for both keypoints
                _, _, _, point1 = cv2.minMaxLoc(prob_map1)
                _, _, _, point2 = cv2.minMaxLoc(prob_map2)
                
                # Calculate the pixel coordinates of the keypoints in the original frame
                x1, y1 = int(W * point1[0] / prob_map1.shape[1]), int(H * point1[1] / prob_map1.shape[0])
                x2, y2 = int(W * point2[0] / prob_map2.shape[1]), int(H * point2[1] / prob_map2.shape[0])
                
                # Draw a line connecting the keypoints on the frame
                cv2.line(frame, (x1, y1), (x2, y2), (0, 255, 0), 3)
    
    # Return the frame with connections drawn on it
    return frame


In [7]:
def detect_pull_ups(output, SHOULDER_INDEX, ELBOW_INDEX, WRIST_INDEX, threshold, ANGLE_THRESHOLD_BOTTOM, ANGLE_THRESHOLD_TOP, current_state, repetition_count, prev_angle):
    # Extract coordinates of shoulder, elbow, and wrist points from the output
    shoulder_point = output[0, SHOULDER_INDEX, :, :]
    elbow_point = output[0, ELBOW_INDEX, :, :]
    wrist_point = output[0, WRIST_INDEX, :, :]

    # Check if all keypoints have confidence scores above the threshold
    if all(output[0, index, :, :].max() > threshold for index in [SHOULDER_INDEX, ELBOW_INDEX, WRIST_INDEX]):
        # Find the location of maximum confidence for each keypoint
        _, _, _, shoulder_max_loc = cv2.minMaxLoc(shoulder_point)
        _, _, _, elbow_max_loc = cv2.minMaxLoc(elbow_point)
        _, _, _, wrist_max_loc = cv2.minMaxLoc(wrist_point)

        # Convert locations to numpy arrays
        shoulder_coords = np.array(shoulder_max_loc)
        elbow_coords = np.array(elbow_max_loc)
        wrist_coords = np.array(wrist_max_loc)

        # Calculate vectors representing upper arm and forearm
        upper_arm_vector = shoulder_coords - elbow_coords
        forearm_vector = wrist_coords - elbow_coords
        
        # Calculate dot product and magnitudes of vectors
        dot_product = np.dot(upper_arm_vector, forearm_vector)
        upper_arm_magnitude = np.linalg.norm(upper_arm_vector)
        forearm_magnitude = np.linalg.norm(forearm_vector)

        # Calculate angle between upper arm and forearm vectors
        angle_radians = np.arccos(dot_product / (upper_arm_magnitude * forearm_magnitude))
        angle_degrees = np.degrees(angle_radians)

        # Check for state transitions
        prev_angle = angle_degrees
        
        if current_state == PullUpState.BOTTOM:
            if angle_degrees < ANGLE_THRESHOLD_TOP:
                print("Entered TOP state")
                current_state = PullUpState.TOP
        elif current_state == PullUpState.TOP:
            if angle_degrees > ANGLE_THRESHOLD_BOTTOM:
                print("Entered BOTTOM state")
                repetition_count += 1
                current_state = PullUpState.BOTTOM

    # Return updated state, repetition count, and previous angle
    return current_state, repetition_count, prev_angle

In [8]:
def visualize_heatmaps(output, keypoints_mapping):
    num_keypoints = len(keypoints_mapping)
    heatmaps = []

    # Iterate through each keypoint
    for index in range(num_keypoints):
        # Get the probability map for the keypoint
        prob_map = output[0, index, :, :]

        # Normalize the probability map to the range [0, 255]
        prob_map = cv2.normalize(prob_map, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U)

        # Apply a color map for better visualization
        heatmap_colored = cv2.applyColorMap(prob_map, cv2.COLORMAP_JET)
        
        # Append the colored heatmap to the list
        heatmaps.append(heatmap_colored)

    # Concatenate all the heatmaps horizontally for visualization
    heatmaps_combined = np.hstack(heatmaps)

    return heatmaps_combined

Example heatmap of all of the 18 keypoints from the model
<img src="media/heatmaps.png" alt="Screenshot" width="1180"/>

In [9]:
# Load the Caffe deep neural network for pose estimation
net = cv2.dnn.readNet('models/pose_iter_440000.caffemodel', 'models/pose_deploy_linevec.prototxt')
# caffemodel file stores the weights of the trained model
# prototxt file stores the architecture of the neural network, defining the layers and their connections

In [10]:
# Load the video file
#cap = cv2.VideoCapture(1)
cap = cv2.VideoCapture('media/cam.mp4') 

if not cap.isOpened():
    print("Error: Could not open video file.")
    exit()

In [11]:
# Define the confidence threshold for keypoint detection
threshold = 0.1

In [12]:
# Define the keypoints names and the index pairs of keypoints that should be connected
keypoints_mapping = [
    "Nose", "Neck", "RShoulder", "RElbow", "RWrist", "LShoulder", "LElbow", "LWrist",
    "RHip", "RKnee", "RAnkle", "LHip", "LKnee", "LAnkle", "REye", "LEye", "REar", "LEar"
]

keypoints_pairs = [
    (0, 1), (1, 2), (2, 3), (3, 4), (1, 5), (5, 6), (6, 7),
    (1, 14), (14, 16), (14, 15), (1, 8), (8, 9), (9, 10),
    (11, 12), (12, 13)
]

In [13]:
# Define the starting state
current_state = PullUpState.BOTTOM
repetition_count = 0

prev_angle = None

In [14]:
# Define the indices of key points for shoulders, elbows, and wrists
SHOULDER_INDEX = keypoints_mapping.index("LShoulder")
ELBOW_INDEX = keypoints_mapping.index("LElbow")
WRIST_INDEX = keypoints_mapping.index("LWrist")

# Threshold for arm extension and flexion angle (in degrees)
ANGLE_THRESHOLD_BOTTOM = 140  # Can adjust
ANGLE_THRESHOLD_TOP = 59  # Can adjust

In [15]:
# Define frame counter to skip frames and improve speed
frame_counter = 0

# Start a loop to continuously capture frames
while True:
    # Capture frame-by-frame
    ret, frame = cap.read()

    if not ret:
        print("Error: Failed to capture frame.")
        break

    frame_counter += 1

    # Process every 6th frame
    if frame_counter % 6 != 0:
        continue  

    # Preprocess the frame
    frame = preprocess_frame(frame)

    # Generate blob from the frame, set it as the input to the network, and perform a forward pass
    blob = cv2.dnn.blobFromImage(frame, 1.0 / 255, (368, 368), (0, 0, 0), swapRB=False, crop=False)
    net.setInput(blob)
    output = net.forward() #Only a forward pass is needed as the network is already trained

    # Visualize the heatmaps for each keypoint
    heatmap = visualize_heatmaps(output, keypoints_mapping)

    # Display the combined heatmaps
    cv2.imshow("Heatmaps", heatmap)

    # Detect keypoints and draw them on the frame, then draw connections between them
    frame = detect_keypoints(frame, output, keypoints_mapping, threshold)
    frame = draw_connections(frame, output, keypoints_pairs, keypoints_mapping, threshold)

    # Detect pull-ups and update state, repetition count, and previous angle
    current_state, repetition_count, prev_angle = detect_pull_ups(output, SHOULDER_INDEX, ELBOW_INDEX, WRIST_INDEX, threshold, ANGLE_THRESHOLD_BOTTOM, ANGLE_THRESHOLD_TOP, current_state, repetition_count, prev_angle)

    # Add info to frame
    cv2.putText(frame, f"Repetitions: {repetition_count}", (10, 60), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
    cv2.putText(frame, f"Angle: {prev_angle:.2f}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)

    # Display the frame
    cv2.imshow('Pullup Pose', frame)

    # Break the loop if 'q' is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Entered TOP state
Entered BOTTOM state
Entered TOP state
Entered BOTTOM state
Entered TOP state
Entered BOTTOM state
Error: Failed to capture frame.
