<center><h1> Body Tracking Using MediaPipe  </h1>


<h3> 
    Wim Pouw ( wim.pouw@donders.ru.nl )<br>James Trujillo ( james.trujillo@donders.ru.nl )<br>
    18-11-2021 </h3>
    
<img src="./images/BOOTCAMP.png"> </center>

<h3> Info documents </h3>
This module provides a simple demonstration of how to use MediaPipe for motion tracking of a single person. The approach provides a lightweight motion tracking solution, and several distinct advantages in the type of output that we get
<br><br>

* location code: 
https://github.com/WimPouw/EnvisionBootcamp2021/tree/main/Python/MediaBodyTracking

* citation: 
Pouw, W.T.J.L  &  Trujillo, J.P.(2021-11-18). <i> Body Tracking Using MediaPipe </i> \[day you visited the site]. Retrieved from: https://github.com/WimPouw/EnvisionBootcamp2021/tree/main/Python/MediaBodyTracking 

<h4>resources</h4>
* https://github.com/google/mediapipe
<br><br>
* Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
<br>
<h4>Required</h4>
Before you start, make sure the following python packages are installed:

* opencv-python
* mediapipe
* numpy
* pandas

In [1]:
%config Completer.use_jedi = False
import cv2
import mediapipe
import pandas as pd
import numpy as np
import csv
 
drawingModule = mediapipe.solutions.drawing_utils
poseModule = mediapipe.solutions.pose

In [2]:
#list all videos in mediafolder
from os import listdir
from os.path import isfile, join
mypath = "./MediaToAnalyze/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#time series output folder
foldtime = "./Timeseries_Output/"

In [3]:
#some preperatory functions and lists for saving the data


#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
    gogobj = str(gogobj).strip("[]")
    gogobj = gogobj.split("\n")
    return(gogobj[:-1]) #ignore last element as this has nothing

#landmarks 33x
markers = ['NOSE', 'LEFT_EYE_INNER', 'LEFT_EYE', 'LEFT_EYE_OUTER', 'RIGHT_EYE_OUTER', 'RIGHT_EYE', 'RIGHT_EYE_OUTER',
          'LEFT_EAR', 'RIGHT_EAR', 'MOUTH_LEFT', 'MOUTH_RIGHT', 'LEFT_SHOULDER', 'RIGHT_SHOULDER', 'LEFT_ELBOW', 
          'RIGHT_ELBOW', 'LEFT_WRIST', 'RIGHT_WRIST', 'LEFT_PINKY', 'RIGHT_PINKY', 'LEFT_INDEX', 'RIGHT_INDEX',
          'LEFT_THUMB', 'RIGHT_THUMB', 'LEFT_HIP', 'RIGHT_HIP', 'LEFT_KNEE', 'RIGHT_KNEE', 'LEFT_ANKLE', 'RIGHT_ANKLE',
          'LEFT_HEEL', 'RIGHT_HEEL', 'LEFT_FOOT_INDEX', 'RIGHT_FOOT_INDEX']

#check if there are numbers in a string
def num_there(s):
    return any(i.isdigit() for i in s)

#make the stringifyd position traces into clean values
def listpostions(newsamplemarks):
    tracking_p = []
    for value in newsamplelmarks:
        if num_there(value):
            stripped = value.split(':', 1)[1]
            stripped = stripped.strip() #remove spaces in the string if present
            tracking_p.append(stripped) #add to this list  
    return(tracking_p)

Once we have our preparatory functions set and packages loaded. We can get to tracking. In the code block below, we will do 3 things. The code will perform the actual tracking using MediaPipe (functions such as <i> pose, posemodule</i>), draw the tracked points back onto each frame of the video (using <i>cv2</i>), and save the coordinates of the tracked points into a dataframe (using <i>pandas</i>) for analysis or further processing. 

In [4]:

for ff in onlyfiles:
    capture = cv2.VideoCapture(mypath+ff)
    frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
    frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fps = capture.get(cv2.CAP_PROP_FPS)
    print(frameWidth, frameHeight, fps )
    #pose tracking with keypoints save!
    #make an 'empty' video file where we can store the visualized tracking
    samplerate = fps #make the same as current video
    fourcc = cv2.VideoWriter_fourcc(*'XVID') #(*'XVID')
    out = cv2.VideoWriter('Videotracking_output/'+ff[:-4]+'.avi', fourcc, fps= samplerate, frameSize = (1280, 720))


    #make a variable list with x, y, z, info where data is appended to
    markerxyz = []
    for mark in markers:
        for pos in ['X', 'Y', 'Z', 'visibility']:
            nm = pos + "_" + mark
            markerxyz.append(nm)
    addvariable = ['time']
    addvariable.extend(markerxyz)

    time = 0
    timeseries = [addvariable]
    #MAIN ROUTINE
    with poseModule.Pose(min_detection_confidence=0.5, model_complexity = 2, min_tracking_confidence=0.75, smooth_landmarks = True) as pose:
         while (True):
            ret, frame = capture.read()
            if ret == True:
                results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                if results.pose_landmarks != None:
                    newsamplelmarks = makegoginto_str(results.pose_world_landmarks)
                    newsamplelmarks = listpostions(newsamplelmarks)
                    fuldataslice = [str(time)]
                    fuldataslice.extend(newsamplelmarks) #add positions
                    timeseries.append(fuldataslice) #append to the timeries data
                        #get information about hand index [0], hand confidence [1], handedness [2]              
                    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
                    drawingModule.draw_landmarks(frame, results.pose_landmarks, poseModule.POSE_CONNECTIONS)
                    #for point in handsModule.HandLandmark:
                        #normalizedLandmark = results.pose_landmarks.landmark[point]
                        #pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)
                        #cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)
                cv2.imshow('MediaPipe Pose', frame)
                out.write(frame)  ################################################comment this if you dont want to make a video
                time = round(time+1000/samplerate)
                if cv2.waitKey(1) == 27:
                    break
            if ret == False:
                break
    out.release()
    capture.release()
    cv2.destroyAllWindows()

    ####################################################### data to be written row-wise in csv fil
    data = timeseries

    # opening the csv file in 'w+' mode
    file = open(foldtime + ff[:-4]+'.csv', 'w+', newline ='')
    #write it
    with file:    
        write = csv.writer(file)
        write.writerows(data)

544.0 362.0 30.013573403038997


Here's a sample frame from the output video: <br>
<img src="./images/mediapipe_body.png"> </center> <br>
As well as a sample of the data that we produced:<br>

In [5]:
df_body = pd.read_csv(foldtime + ff[:-4]+'.csv')
df_body.head()

Unnamed: 0,time,X_NOSE,Y_NOSE,Z_NOSE,visibility_NOSE,X_LEFT_EYE_INNER,Y_LEFT_EYE_INNER,Z_LEFT_EYE_INNER,visibility_LEFT_EYE_INNER,X_LEFT_EYE,...,Z_RIGHT_HEEL,visibility_RIGHT_HEEL,X_LEFT_FOOT_INDEX,Y_LEFT_FOOT_INDEX,Z_LEFT_FOOT_INDEX,visibility_LEFT_FOOT_INDEX,X_RIGHT_FOOT_INDEX,Y_RIGHT_FOOT_INDEX,Z_RIGHT_FOOT_INDEX,visibility_RIGHT_FOOT_INDEX
0,396,0.538114,-0.060236,-0.050362,0.99951,0.555659,-0.049128,-0.04591,0.999261,0.557287,...,-0.055976,0.933046,-0.283941,0.462588,-0.274494,0.989258,-0.450967,-0.337301,-0.104251,0.955913
1,429,0.443146,-0.034791,0.088246,0.999035,0.458518,-0.032522,0.10883,0.998529,0.46063,...,-0.109341,0.936138,-0.410189,0.583331,-0.182948,0.989362,-0.638469,-0.215301,-0.182827,0.95865
2,462,0.440275,0.01185,0.067064,0.996508,0.453032,0.007117,0.08474,0.994618,0.45507,...,-0.109471,0.928449,-0.477191,0.576702,-0.025761,0.980243,-0.694765,-0.208185,-0.178391,0.954599
3,627,0.129521,0.13649,0.031299,0.996851,0.119219,0.130574,-0.018655,0.995154,0.1158,...,-0.424263,0.929862,-0.453918,0.379679,-0.243619,0.975504,0.40891,0.111685,-0.371516,0.949324
4,660,0.139343,0.048282,0.262759,0.997156,0.129757,0.025985,0.275302,0.995634,0.127441,...,-0.327688,0.929971,-0.452408,0.286135,-0.220267,0.951644,0.571573,0.054125,-0.259778,0.913952


One advantage of the output that we get here is that even though we used a 2D video, we get 3D tracking coordinates. This is possible because the MediaPipe detector was trained on hand coordinates for which the depth was known. As the authors state: <i>"Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos." Zhang et al., 2020 </i><br><br>
Additionally, the coordinates provided here are given in meters, with the absolute origin (0,0,0) being the center between the hips. This is advantageous because it reduces variability between videos when the distance to camera also varies. <br><br>
The major disadvantage to this method is that it is only capable of tracking a single individual at a time. For videos of one speaker/actor, this isn't an issue of course. But if we're interested in multi-party interactions and cannot (or do not wish to) split the video into different individuals (e.g., because of overlapping space between them), we need to use a different solution. We discuss a couple of such options in the modules covering hand tracking with MediaPipe, tracking using DeepLabCut, and tracking using OpenPose.