# Pose Estimation Related to Emotions

Before running the iPython Notebook, it is important to install all necessary packages. To do that, in terminal type the command <code>pip install -r requirements.txt</code>.

In [1]:
import cv2
from IPython.display import Image
import pandas as pd
from tools.detector import detect_poses
from tools.extractor import Extractor
from tools.metrics import label_probabilities
from ultralytics import YOLO



## Step 1: Preprocessing

### Data Preparation

Before operating with data, it's important to see how the data looks like. For that purpose, let's convert <code>annotation.csv</code> file to pandas Dataframe. As we can see below, the Dataframe has the following structure:

- Video Tag → The video identification present in YouTube. Use it to retrieve the source video. 
In this version of the dataset, the videos are present in the "/Videos" folder.
- Clip Id → Id for each clip from a source video. This identification is unique within a source video. 
For a certain “Video Tag” with an “Clip Id”, the “Person Id” will be unique to a certain person. 
- Labels → An arrays of arrays containing the labels given by each annotator of the dataset.
- Frame Number → The frame that was used for that annotation
- X → Starting position of the bounding box in the x-axis
- Y → Starting position of the bounding box in the y-axis
- Width → % of the width of the video used as offset for “X”
- Height → % of the height of the video used as offset for “Y”
- Person Id → Integer to identify a certain person for clips with the same “Video Tag” and “Clip Id”

In [2]:
df = pd.read_csv("assets/annotations/annotations.csv")
df.head()

Unnamed: 0,Video Tag,Clip Id,Labels,Frame Number,X,Y,Width,Height,Person Id
0,aJKL0ahn1Dk,1,"[['Happy'], ['Happy'], ['Happy']]",19532,41.9652,4.873195,44.216991,94.802684,0
1,aJKL0ahn1Dk,1,"[['Happy'], ['Happy'], ['Happy']]",19538,41.564836,4.87464,44.216991,94.802684,0
2,aJKL0ahn1Dk,1,"[['Happy'], ['Happy'], ['Happy']]",19544,41.164472,4.876086,44.216991,94.802684,0
3,aJKL0ahn1Dk,1,"[['Happy'], ['Happy'], ['Happy']]",19550,40.764108,4.877532,44.216991,94.802684,0
4,aJKL0ahn1Dk,1,"[['Happy'], ['Happy'], ['Happy']]",19556,39.646728,5.014136,44.216991,94.802684,0


Each combination of <code>Video Tag</code>, <code>Clip Id</code> and <code>Person Id</code> represents a unique emotion related to a person. Therefore, we can split these emotions into segments.

In [3]:
extractor = Extractor(
    "/Users/deniskrylov/Developer/PosEmotion/assets/annotations/annotations.csv",
    "/Users/deniskrylov/Developer/PosEmotion/assets/videos",
    "/Users/deniskrylov/Developer/PosEmotion/assets/frames"
)

# Uncomment the line below to extract frames from the videos
# extractor.extract_frames()

# Extracting the segments from the CSV file 
# (each segment represents a unique person in the fragment of video)
segments = extractor.extract_segments()
print("Number of segments:", len(segments))
print("First 5 segments:", segments[:5])

Number of segments: 629
First 5 segments: [(0, 27), (28, 39), (40, 51), (52, 69), (70, 77)]


Before pose detection, we need to convert our dataset in such a way, that the array of <code>Labels</code> column will be converted to multiple columns, where each column represents a probability of a particular emotion, calculated as $i/n$, where $i$ is an emotion label and $n$ is a total number of emotions that were detected by different annotators.

In [4]:
df = label_probabilities(df)
df.head()

Unnamed: 0,Video Tag,Clip Id,Frame Number,X,Y,Width,Height,Person Id,Happy,Sad,Fear,Neutral,Surprise,Disgust,Anger
0,aJKL0ahn1Dk,1,19532,41.9652,4.873195,44.216991,94.802684,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,aJKL0ahn1Dk,1,19538,41.564836,4.87464,44.216991,94.802684,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,aJKL0ahn1Dk,1,19544,41.164472,4.876086,44.216991,94.802684,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,aJKL0ahn1Dk,1,19550,40.764108,4.877532,44.216991,94.802684,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,aJKL0ahn1Dk,1,19556,39.646728,5.014136,44.216991,94.802684,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Extract Key Points

To extract keypoints, different approaches will be used such as YOLO-Pose, DeepPose and OpenPose. For each of the approaches, a different dataframe will be created with coordinates of keypoints.

- For each frame, a person will be detected (using ground truth) and cut out of the frame.
- After for each frame pose detection algorithm will be applied.
- At the end, csv file with keypoints will be created.

In [5]:
# for each row get X, Y, Width, Height
# create a cropped_image = image[y:y+h, x:x+w]
# use this cropped image to detect poses
# get the pose keypoints
# add them to the dataframe

##### YOLO-Pose

In [6]:
def apply_yolo():    
    keypoints = []
    model = YOLO("/Users/deniskrylov/Developer/PosEmotion/models/yolo-pose.pt")

    for index, row in df.iterrows():
        try:
            result = detect_poses("/Users/deniskrylov/Developer/PosEmotion/assets/frames/{}_{}.jpg".format(
                row["Video Tag"], index), 
                model
            )
            keypoints.append(result.to_dict())
            print("Progress: {}/{}".format(index+1, len(df)))
        except:
            raise Exception("Error in detecting poses!")

    keypoints_df = pd.DataFrame(keypoints)
    keypoints_df.to_csv("/Users/deniskrylov/Developer/PosEmotion/assets/annotations/yolo_keypoints.csv", index=True)


# Uncomment the line below to apply YOLO to the frames
apply_yolo()


image 1/1 /Users/deniskrylov/Developer/PosEmotion/assets/frames/aJKL0ahn1Dk_0.jpg: 768x1280 1 person, 1097.3ms
Speed: 2.4ms preprocess, 1097.3ms inference, 245.4ms postprocess per image at shape (1, 3, 768, 1280)
Progress: 1/8087

image 1/1 /Users/deniskrylov/Developer/PosEmotion/assets/frames/aJKL0ahn1Dk_1.jpg: 768x1280 1 person, 1045.9ms
Speed: 1.7ms preprocess, 1045.9ms inference, 0.5ms postprocess per image at shape (1, 3, 768, 1280)
Progress: 2/8087

image 1/1 /Users/deniskrylov/Developer/PosEmotion/assets/frames/aJKL0ahn1Dk_2.jpg: 768x1280 1 person, 1036.3ms
Speed: 1.6ms preprocess, 1036.3ms inference, 0.5ms postprocess per image at shape (1, 3, 768, 1280)
Progress: 3/8087

image 1/1 /Users/deniskrylov/Developer/PosEmotion/assets/frames/aJKL0ahn1Dk_3.jpg: 768x1280 1 person, 1047.6ms
Speed: 1.7ms preprocess, 1047.6ms inference, 0.7ms postprocess per image at shape (1, 3, 768, 1280)
Progress: 4/8087

image 1/1 /Users/deniskrylov/Developer/PosEmotion/assets/frames/aJKL0ahn1Dk_4.jpg

##### OpenPose

##### DeepPose

### Normalize key points

Normalization has 2 parts: per image and per segment.

- [WRONG] Per Image: all keypoints will be normalized according to the default size of the image $(w,h)$ and according to the size of a person on the image.
- Per Segment: all segment sizes will be normalized to the default segment size $x$.

### Frame-wise Aggregation

## Step 2: Feature Extraction

### Pose Features

### Dimensionality Reduction

## Step 3: Clustering

### Choose Clustering Algorithm

### Cluster Poses

## Step 4: Emotion Label Association

### Associate Poses with Emotions

## Step 5: Evaluation and Refinement

### Evaluate Clusters

### Refine Clusters