# Using Envision's Automatic Hand Gesture Detection PyPi package (envisionhgdetector)

<br>
<div align="center">Wim Pouw (wim.pouw@donders.ru.nl)</div>

<img src="Images/envision_banner.png" alt="isolated" width="300"/>

<img src="Images/ex.gif">

## Info
In the following notebook, we are going to simply use an envisionbox python package. This package is called "envisionhgdetector" and contains functions to automatically annotate gesture. In some other envisionbox module on training a gesture classifier we exhibited an [end-to-end pipeline](https://github.com/WimPouw/envisionBOX_modulesWP/tree/main/UsingEnvisionHGdetector_package) for training a model on particular human behaviors, e.g., head nodding, clapping; and then producing some inferences on new videos. This package builds further on that work. Namely, we have trained convolutional neural network to differientate no gestures (including self-adaptors), and a gesture. We do this based on the SAGA dataset, the Zhubo dataset, and the TED M3D dataset. Given that we have trained it on a bit of variability in terms of datasets and angles, and more than 9000 gestures, we can use this gesture detector to a little bit more varied settings than we could do would we have trained on a single dataset.

Now, don't get too excited! The performance is not extraordinary or anything, and it still awaits proper testing and further updating with better trained models (we are working on it...). Currently not differientating types of gestures (as far as that is possible; we are working on it...). But it is good enough for some purposes to have a quick pass over on a set of videos and get some prominent gestures out. Once we have the gestures, we can do all kinds of other interesting things, e.g., generate gesture kinematic statistics, or generate gesture networks. But now all automatically!

## Package info
https://pypi.org/project/envisionhgdetector/

### What does envisionhgdetecotor do
* It tracks upper body, hands, and face landmarks (generating 29 features)
* It makes an inference based on 25 frames of data, whether it labels no gesture (default implicit label), gesture (label: Gesture), or some kind of movement that is not a gesture (label: Move).
* It outputs a labeled video, an ELAN file, a confidence timeseries, and a gesture segment list (with labels and start and end times).
* UPCOMING: It will in the future add a bunch of analyses on the gestures it isolates

## Installation
It is best to install in a conda environment. 

conda create -n envision python = 3.9

conda activate envision

Then proceed: 

pip install -r requirements.txt

## citation for this notebook
* Pouw, W. (2024). EnvisionBOX modules for social signal processing (Version 1.0.0) [Computer software]. https://github.com/WimPouw/envisionBOX_modulesWP

## Citation
If you use this package, please cite:

* Pouw, W. (2024). envisionhgdetector: Hand Gesture Detection Using a Convolutional Neural Network (Version 0.0.2) [Computer software]. https://github.com/WimPouw/envisionhgdetector

### Citations for the packages and datasets
Original Noddingpigeon Training code:
* Yung, B. (2022). Nodding Pigeon (Version 0.6.0) [Computer software]. https://github.com/bhky/nodding-pigeon

Zhubo dataset (used for training):
* Bao, Y., Weng, D., & Gao, N. (2024). Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures. Electronics, 13(16), 3315.

SAGA dataset (used for training)
* Lücking, A., Bergmann, K., Hahn, F., Kopp, S., & Rieser, H. (2010). The Bielefeld speech and gesture alignment corpus (SaGA). In LREC 2010 workshop: Multimodal corpora–advances in capturing, coding and analyzing multimodality.

TED M3D:
* Rohrer, Patrick. A temporal and pragmatic analysis of gesture-speech association: A corpus-based approach using the novel MultiModal MultiDimensional (M3D) labeling system. Diss. Nantes Université; Universitat Pompeu Fabra (Barcelone, Espagne), 2022.

MediaPipe:
* Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

# Lets get started
For this tutorial, I have two videos that I would like to segment for hand gestures. They all live in the folder: './videos_to_label/'

In [2]:
import os
import glob as glob

videofoldertoday = './videos_to_label/'
outputfolder = './output/'

In [3]:
from moviepy import VideoFileClip
# list all videos in the folder
videos = glob.glob(videofoldertoday + '*.mp4')

# show one video with the labels using moviepy called 'example_1'
clip = VideoFileClip(videos[0]) # this is an opencv video so we need to rerender it, to show it in this notebook
clip.write_videofile("./temp/example_1.mp4")
clip = VideoFileClip(videos[1])
clip.write_videofile("./temp/example_2.mp4")

MoviePy - Building video ./temp/example_1.mp4.
MoviePy - Writing audio in example_1TEMP_MPY_wvf_snd.mp3


                                                                   

MoviePy - Done.
MoviePy - Writing video ./temp/example_1.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready ./temp/example_1.mp4
MoviePy - Building video ./temp/example_2.mp4.
MoviePy - Writing audio in example_2TEMP_MPY_wvf_snd.mp3


                                                                   

MoviePy - Done.
MoviePy - Writing video ./temp/example_2.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready ./temp/example_2.mp4




In [4]:
# now show the two videos
from IPython.display import Video
Video("./temp/example_1.mp4", width=640, height=480)


In [5]:
Video("./temp/example_2.mp4", width=640, height=480)

From the pypi package info we see that we can simply use this to get started:

```
from envisionhgdetector import GestureDetector

# Initialize detector
detector = GestureDetector(
    motion_threshold=0.8,    # Sensitivity to motion
    gesture_threshold=0.8,   # Confidence threshold for gestures
    min_gap_s=0.3,          # Minimum gap between gestures
    min_length_s=0.3        # Minimum gesture duration
)

# Process videos
results = detector.process_folder(
    video_folder="path/to/videos",
    output_folder="path/to/output"
)
```

## play around
The gesture annotations can be finetuned with the settings you have:
1. confidence level of movement
2. if movement, then what is the confidence level for the gesture or move category
3. when should gestures be merged (x second gap) into one
4. what is the shortest gesture you want to consider (oterwhise remove)


In [6]:
from envisionhgdetector import GestureDetector
import os

# absolute path 
videofoldertoday = os.path.abspath('./videos_to_label/')
outputfolder = os.path.abspath('./output/')

# create a detector object
detector = GestureDetector(motion_threshold=0.9, gesture_threshold=0.9, min_gap_s =0.2, min_length_s=0.5)

# just do the detection on the folder
detector.process_folder(
    input_folder=videofoldertoday,
    output_folder=outputfolder,
)


Successfully loaded weights from d:\Programs\Conda_packages\envs\envision\lib\site-packages\envisionhgdetector\model\SAGAplus_gesturenogesture_trained_binaryCNNmodel_weightsv1.h5

Processing videoplayback (2).mp4...
Generating labeled video...
Generating elan file...
Done processing videoplayback (2).mp4, go look in the output folder

Processing videoplayback (2)_2_1.mp4...
Generating labeled video...
Generating elan file...
Done processing videoplayback (2)_2_1.mp4, go look in the output folder


{'videoplayback (2).mp4': {'stats': {'average_motion': 0.7977464066799542,
   'average_gesture': 0.9636104600598113,
   'average_move': 0.036389540797428725},
  'output_path': 'd:\\Research_projects\\envisionBOX_modulesWP\\UsingEnvisionHGdetector_package\\output\\videoplayback (2).mp4.eaf'},
 'videoplayback (2)_2_1.mp4': {'stats': {'average_motion': 0.7232601379768716,
   'average_gesture': 0.9748914827380264,
   'average_move': 0.025108517972719773},
  'output_path': 'd:\\Research_projects\\envisionBOX_modulesWP\\UsingEnvisionHGdetector_package\\output\\videoplayback (2)_2_1.mp4.eaf'}}

In [8]:
import pandas as pd
import os
# lets list the output
outputfiles = glob.glob(outputfolder + '/*')
for file in outputfiles:
    print(os.path.basename(file))

# load one of the predictions
csvfilessegments = glob.glob(outputfolder + '/*segments.csv')
df = pd.read_csv(csvfilessegments[0])
df.head()

labeled_videoplayback (2).mp4
labeled_videoplayback (2)_2_1.mp4
videoplayback (2).mp4.eaf
videoplayback (2).mp4_predictions.csv
videoplayback (2).mp4_segments.csv
videoplayback (2)_2_1.mp4.eaf
videoplayback (2)_2_1.mp4_predictions.csv
videoplayback (2)_2_1.mp4_segments.csv


Unnamed: 0,start_time,end_time,labelid,label,duration
0,1.310345,3.862069,1,Gesture,2.551724
1,5.034483,5.724138,2,Gesture,0.689655
2,7.551724,8.37931,3,Gesture,0.827586
3,8.586207,9.931034,4,Gesture,1.344828
4,11.689655,12.413793,5,Gesture,0.724138


# now assess the labeled video data


In [9]:
videoslabeled = glob.glob(outputfolder + '/*.mp4')

# another one
clip = VideoFileClip(videoslabeled[1])
clip.write_videofile("./temp/example_2_labeled.mp4")
Video("./temp/example_2_labeled.mp4", width=640, height=480)


MoviePy - Building video ./temp/example_2_labeled.mp4.
MoviePy - Writing video ./temp/example_2_labeled.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready ./temp/example_2_labeled.mp4


In [10]:
videoslabeled = glob.glob(outputfolder + '/*.mp4')

# another one
clip = VideoFileClip(videoslabeled[0])
clip.write_videofile("./temp/example_1_labeled.mp4")
Video("./temp/example_1_labeled.mp4", width=640, height=480)


MoviePy - Building video ./temp/example_1_labeled.mp4.
MoviePy - Writing video ./temp/example_1_labeled.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready ./temp/example_1_labeled.mp4




# Concluding remarks
It is important to test the accuracy of your classifier against some hand-labeled data that was not used to train your model on. Indeed, you would report a confusion matrix (e.g., false positive rate, hits, etc.) or you the machine-human interrater reliability. In the future I would like to add such code in this module, as well train a general model for detecting general gestures. Do you have data suitable for this and you would like to use it, you can contact me (wim.pouw@donders.ru.nl). In general it would be great to know if this module is valuable for your behaviors, and knowing the boundary conditions of this pipeline.