# Deep Learning for Hand Gesture Recognition with PyTorch: Quickstart

This notebook is inherited from a demo pytorch implementation of the deep learning model for hand gesture recognition introduced in the article [Deep Learning for Hand Gesture Recognition on Skeletal Data](https://ieeexplore.ieee.org/document/8373818) from G. Devineau, F. Moutarde, W. Xi and J. Yang.

If you find this code useful in your research, please consider citing:

```
@inproceedings{devineau2018deep,
  title={Deep learning for hand gesture recognition on skeletal data},
  author={Devineau, Guillaume and Moutarde, Fabien and Xi, Wang and Yang, Jie},
  booktitle={2018 13th IEEE International Conference on Automatic Face \& Gesture Recognition (FG 2018)},
  pages={106--113},
  year={2018},
  organization={IEEE}
}
```



### 0. Understanding the model

You can find an intuitive explanation of the model at: https://github.com/guillaumephd/deep_learning_hand_gesture_recognition

* Training Dataset Specification from: http://www-rech.telecom-lille.fr/shrec2017-hand/


The dataset contains sequences of 14 hand gestures performed in two ways: using one finger and the whole hand. Each gesture is performed between 1 and 10 times by 28 participants in 2 ways - described above - , resulting in 2800 sequences. All participants are right handed. Sequences are labelled following their gesture, the number of fingers used, the performer and the trial. Each frame of sequences contains a depth image, the coordinates of 22 joints both in the 2D depth image space and in the 3D world space forming a full hand skeleton. The Intel RealSense short range depth camera is used to collect our dataset. The depth images and hand skeletons were captured at 30 frames per second, with a resolution of the depth image of 640x480. The length of sample gestures ranges from 20 to 50 frames.


For a sequence of size N:

* depth_n.png contains the depth image of the nth frame of the sequence.
* general_informations.txt contains a matrix of size Nx5 (one line by frame). The format is as follows: Timestamp in 10-7 seconds and hand region of interest in the depth image (x, y, width, height).
* skeletons_image.txt contains a matrix of size Nx44. Each line contains the 2D hand joints coordinates in the depth image space. The format is as follows: x1 y1 z1 - x2 y2 z2 - ... - x22 y22 z22.
* skeletons_world.txt contains a matrix of size Nx66. Each line contains the 3D hand joints coordinates in the world space. The format is as follows: x1 y1 z1 - x2 y2 z2 - ... - x22 y22 z22.

  The order of the joints in the line is: 
1.Wrist,

2.Palm, 

3.thumb_base, 

4.thumb_first_joint, 

5.thumb_second_joint, 

6.thumb_tip, 

7.index_base, 

8.index_first_joint, 

9.index_second_joint, 

10.index_tip, 

11.middle_base, 

12.middle_first_joint, 

13.middle_second_joint, 

14.middle_tip, 

15.ring_base, 

16.ring_first_joint, 

17.ring_second_joint, 

18.ring_tip, 

19.pinky_base, 

20.pinky_first_joint, 

21.pinky_second_joint, 

22.pinky_tip.

* train_gestures.txt and test_gestures.txt contains information about the train and the test sequences. These files contains respectively 1960 (70%) and 840 (30%) lines. Each line follow the following pattern: id_gesture     id_finger    id_subject    id_essai    14_labels    28_labels    size_sequence
* display_sequence.m and display_sequence.py is respectively a matlab and a python script which charge and display a sequence. Dependencies for the python script: Scipy, Numpy and Matplotlib.



### 1. Imports

In [None]:
from __future__ import unicode_literals, print_function, division
import sys
if sys.version_info.major < 3:
    print('You are using python 2, but you should rather use python 3.')
    print('    If you still want to use python 2, ensure you import:')
    print('    >> from __future__ import unicode_literals, print_function, division')

import numpy
import pandas as pd
import pickle
import torch
import itertools
import time
import math
from torch.utils.data import Dataset, DataLoader
from hashlib import new
import matplotlib.pyplot as plt 
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data.sampler import SubsetRandomSampler

If you encounter an python error regarding a missing module at some point, uncomment the appropriate line in the cell below and run it. 

In [None]:
# (bonus) plot acc with tensorboard
#   Command to start tensorboard if installed (requires tensorflow):
#   $  tensorboard --logdir ./runs
try:
    from tensorboardX import SummaryWriter
except:
    # tensorboardX is not installed, just fail silently
    class SummaryWriter():
        def __init__(self):
            pass
        def add_scalar(self, tag, scalar_value, global_step=None, walltime=None):
            pass

### 2. Hyperparameters

If you use a custom dataset, you'll likely want to change `n_channels` and `n_classes` to match your values.

In [None]:
n_classes = 14
duration = 100
n_channels = 66
learning_rate = 1e-3

### 3. Create a model

In [None]:
class HandGestureNet(torch.nn.Module):
    """
    [Devineau et al., 2018] Deep Learning for Hand Gesture Recognition on Skeletal Data

    Summary
    -------
        Deep Learning Model for Hand Gesture classification using pose data only (no need for RGBD)
        The model computes a succession of [convolutions and pooling] over time independently on each of the 66 (= 22 * 3) sequence channels.
        Each of these computations are actually done at two different resolutions, that are later merged by concatenation
        with the (pooled) original sequence channel.
        Finally, a multi-layer perceptron merges all of the processed channels and outputs a classification.
    
    TL;DR:
    ------
        input ------------------------------------------------> split into n_channels channels [channel_i]
            channel_i ----------------------------------------> 3x [conv/pool/dropout] low_resolution_i
            channel_i ----------------------------------------> 3x [conv/pool/dropout] high_resolution_i
            channel_i ----------------------------------------> pooled_i
            low_resolution_i, high_resolution_i, pooled_i ----> output_channel_i
        MLP(n_channels x [output_channel_i]) -------------------------> classification

    Article / PDF:
    --------------
        https://ieeexplore.ieee.org/document/8373818

    Please cite:
    ------------
        @inproceedings{devineau2018deep,
            title={Deep learning for hand gesture recognition on skeletal data},
            author={Devineau, Guillaume and Moutarde, Fabien and Xi, Wang and Yang, Jie},
            booktitle={2018 13th IEEE International Conference on Automatic Face \& Gesture Recognition (FG 2018)},
            pages={106--113},
            year={2018},
            organization={IEEE}
        }
    """
    
    def __init__(self, n_channels=66, n_classes=14, dropout_probability=0.2):

        super(HandGestureNet, self).__init__()
        
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.dropout_probability = dropout_probability

        # Layers ----------------------------------------------
        self.all_conv_high = torch.nn.ModuleList([torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=1, out_channels=8, kernel_size=7, padding=3),
            torch.nn.ReLU(),
            torch.nn.AvgPool1d(2),

            torch.nn.Conv1d(in_channels=8, out_channels=4, kernel_size=7, padding=3),
            torch.nn.ReLU(),
            torch.nn.AvgPool1d(2),

            torch.nn.Conv1d(in_channels=4, out_channels=4, kernel_size=7, padding=3),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=self.dropout_probability),
            torch.nn.AvgPool1d(2)
        ) for joint in range(n_channels)])

        self.all_conv_low = torch.nn.ModuleList([torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=1, out_channels=8, kernel_size=3, padding=1),
            torch.nn.ReLU(),
            torch.nn.AvgPool1d(2),

            torch.nn.Conv1d(in_channels=8, out_channels=4, kernel_size=3, padding=1),
            torch.nn.ReLU(),
            torch.nn.AvgPool1d(2),

            torch.nn.Conv1d(in_channels=4, out_channels=4, kernel_size=3, padding=1),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=self.dropout_probability),
            torch.nn.AvgPool1d(2)
        ) for joint in range(n_channels)])

        self.all_residual = torch.nn.ModuleList([torch.nn.Sequential(
            torch.nn.AvgPool1d(2),
            torch.nn.AvgPool1d(2),
            torch.nn.AvgPool1d(2)
        ) for joint in range(n_channels)])

        self.fc = torch.nn.Sequential(
            torch.nn.Linear(in_features=9 * n_channels * 12, out_features=1936),  # <-- 12: depends of the sequences lengths (cf. below)
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=1936, out_features=n_classes)
        )

        # Initialization --------------------------------------
        # Xavier init
        for module in itertools.chain(self.all_conv_high, self.all_conv_low, self.all_residual):
            for layer in module:
                if layer.__class__.__name__ == "Conv1d":
                    torch.nn.init.xavier_uniform_(layer.weight, gain=torch.nn.init.calculate_gain('relu'))
                    torch.nn.init.constant_(layer.bias, 0.1)

        for layer in self.fc:
            if layer.__class__.__name__ == "Linear":
                torch.nn.init.xavier_uniform_(layer.weight, gain=torch.nn.init.calculate_gain('relu'))
                torch.nn.init.constant_(layer.bias, 0.1)

    def forward(self, input):
        """
        This function performs the actual computations of the network for a forward pass.

        Arguments
        ---------
            input: a tensor of gestures of shape (batch_size, duration, n_channels)
                   (where n_channels = 3 * n_joints for 3D pose data)
        """

        # Work on each channel separately
        all_features = []

        for channel in range(0, self.n_channels):
            input_channel = input[:, :, channel]

            # Add a dummy (spatial) dimension for the time convolutions
            # Conv1D format : (batch_size, n_feature_maps, duration)
            input_channel = input_channel.unsqueeze(1)

            high = self.all_conv_high[channel](input_channel)
            low = self.all_conv_low[channel](input_channel)
            ap_residual = self.all_residual[channel](input_channel)

            # Time convolutions are concatenated along the feature maps axis
            output_channel = torch.cat([
                high,
                low,
                ap_residual
            ], dim=1)
            all_features.append(output_channel)

        # Concatenate along the feature maps axis
        all_features = torch.cat(all_features, dim=1)
        
        # Flatten for the Linear layers
        all_features = all_features.view(-1, 9 * self.n_channels * 12)  # <-- 12: depends of the initial sequence length (100).
        # If you have shorter/longer sequences, you probably do NOT even need to modify the modify the network architecture:
        # resampling your input gesture from T timesteps to 100 timesteps will (surprisingly) probably actually work as well!

        # Fully-Connected Layers
        output = self.fc(all_features)

        return output

In [None]:
# -------------
# Network instantiation
# -------------
model = HandGestureNet(n_channels=n_channels, n_classes=n_classes)

To use files from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cp drive/MyDrive/dhg_data.pckl ./

In [None]:
# !cp drive/MyDrive/new_dataset_20_20.pkl ./
# !cp drive/MyDrive/new_dataset_100_20.pkl ./
!cp drive/MyDrive/new_dataset_100_30.pkl ./

In [None]:
!cp drive/MyDrive/gesture_pretrained_model.pt ./
# !cp drive/MyDrive/gesture_trained_20_20.pt ./
# !cp drive/MyDrive/gesture_trained_100_20.pt ./
# !cp drive/MyDrive/gesture_trained_100_30.pt ./

### 2. Load data

We load a dataset already created in numpy format.

In [None]:
# We load a gesture dataset:
#
#   x.shape should be (dataset_size, duration, channel)
#   y.shape should be (dataset_size, 1)


# If you want to use the DHG dataset, go to: https://colab.research.google.com/drive/1ggYG1XRpJ50gVgJqT_uoI257bspNogHj
use_dhg_dataset = True
# Uncomment the line below to load a new dataset
# use_dhg_dataset = False

if use_dhg_dataset:
    # ------------------------
    # DHG Dataset
    # ------------------------
    try:
        # Connect Google Colab instance to Google Drive
        from google.colab import drive
        drive.mount('/gdrive')
        # Load the dataset (you already have created in the other notebook) from Google Drive
        !cp /MyDrive/dhg_data.pckl
        #https://drive.google.com/file/d/1BUxbFwcPRrXn3-ZbwVIOTRgn8zkyBU0_/view?usp=sharing
    except:
        print("You're not in a Google Colab!")

    def load_data(filepath='./shrec_data.pckl'):
        """
        Returns hand gesture sequences (X) and their associated labels (Y).
        Each sequence has two different labels.
        The first label  Y describes the gesture class out of 14 possible gestures (e.g. swiping your hand to the right).
        The second label Y describes the gesture class out of 28 possible gestures (e.g. swiping your hand to the right with your index pointed, or not pointed).
        """
        file = open(filepath, 'rb')
        data = pickle.load(file, encoding='latin1')  # <<---- change to 'latin1' to 'utf8' if the data does not load
        file.close()

        return data['x_train'], data['x_test'], data['y_train_14'], data['y_train_28'], data['y_test_14'], data['y_test_28']

    #x_train, x_test, y_train_14, y_train_28, y_test_14, y_test_28 = load_data('dhg_data.pckl')
    x_train, x_test, y_train_14, y_test_14, y_train_28, y_test_28 = load_data('dhg_data.pckl')
    y_train_14, y_test_14 = numpy.array(y_train_14), numpy.array(y_test_14)
    y_train_28, y_test_28 = numpy.array(y_train_28), numpy.array(y_test_28)
    if n_classes == 14:
        y_train = y_train_14
        y_test = y_test_14
    elif n_classes == 28:
        y_train = y_train_28
        y_test = y_test_28

else:
    # ------------------------
    # Custom Dataset
    # ------------------------
    # On the left bar of this colaboratory notebook there is a section called "Files".
    # Upload your files there and use a path like "/content/each_file_you_just_uploaded" to load your data
    # 
    # For now, for the sake of demonstration purposes, let's create fake data

    print("You're in else!")

    def load_data(filepath='./new_dataset_100_30.pkl'):
 
        file = open(filepath, 'rb')
        data = pickle.load(file, encoding='latin1')  # <<---- change to 'latin1' to 'utf8' if the data does not load
        file.close()

        return data['x_train'], data['y_train'], data['x_test'], data['y_test']
    
    x_train, y_train, x_test, y_test = load_data('new_dataset_100_30.pkl')
    x_train, x_test = numpy.array(x_train), numpy.array(x_test)
    y_train, y_test = numpy.array(y_train), numpy.array(y_test)

Prints for dhg_dataset

In [None]:
x_train.shape, x_test.shape, y_train_14.shape, y_test_14.shape, y_train_28.shape, y_test_28.shape

We now convert numpy data into the torch format, and create a pytorch dataset.

In [None]:
class GestureDataset(Dataset):
 
    def __init__(self, x, y):
        self.x = x
        self.y = y
 
    def __len__(self):
        return len(self.x)
 
    def __getitem__(self, i):
        return self.x[i], self.y[i]

In [None]:
# ------------------------
# Create pytorch datasets and dataloaders:
# ------------------------
# Convert from numpy to torch format
x_train, x_test = torch.from_numpy(x_train), torch.from_numpy(x_test)
y_train, y_test = torch.from_numpy(y_train), torch.from_numpy(y_test)

# Ensure the label values are between 0 and n_classes-1
if y_train.min() > 0:
  y_train = y_train - 1
if y_test.min() > 0:
  y_test = y_test - 1

# Ensure the data type is correct
x_train, x_test = x_train.float(), x_test.float()
y_train, y_test = y_train.long(), y_test.long()

# Create the datasets
train_dataset = GestureDataset(x=x_train, y=y_train)
test_dataset = GestureDataset(x=x_test, y=y_test)

# Pytorch dataloaders are used to group dataset items into batches
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
test_dataloader  = DataLoader(test_dataset,  batch_size=32, shuffle=True, num_workers=4)

We define some functions we'll need later on.

In [None]:
def time_since(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '{:02d}m {:02d}s'.format(int(m), int(s))


def get_accuracy(model, x, y_ref):
    """Get the accuracy of the pytorch model on a batch"""
    acc = 0.
    model.eval()
    with torch.no_grad():
        predicted = model(x)
        _, predicted = predicted.max(dim=1)
        acc = 1.0 * (predicted == y_ref).sum().item() / y_ref.shape[0]
    return acc

### 5. Training the model

Note: reduce the learning rate (in the hyperparameters section) to get smoother accuracy curves.

In [None]:
# -----------------------------------------------------
# Loss function & Optimizer
# -----------------------------------------------------
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)

Note: this very basic training loop is intended for demonstration purposes only.

You should improve it with -at least- an evaluation step which is missing here, and whatever you need/want: cross-validation, early-stopping, model checkpoints, and so on.

In [None]:
# -------------
# Training
# -------------


def train(model, criterion, optimizer, dataloader,
          x_train, y_train, x_test, y_test,
          force_cpu=False, num_epochs=5):
    
    # use a GPU (for speed) if you have one
    device = torch.device("cuda") if torch.cuda.is_available() and not force_cpu else torch.device("cpu")
    model = model.to(device)
    x_train, y_train = x_train.to(device), y_train.to(device)
    x_test, y_test = x_test.to(device), y_test.to(device)

    # (bonus) log accuracy values to visualize them in tensorboard:
    writer = SummaryWriter()
    
    # Training starting time
    start = time.time()

    print('[INFO] Started to train the model.')
    print('Training the model on {}.'.format('GPU' if device == torch.device('cuda') else 'CPU'))

    data = {
            "Epoch": [],
            "Time Elapsed": [],
            "Loss": [],
            "Accuracy Train": [],
            "Accuracy Test": []
        }
    
    for ep in range(num_epochs):

        # Ensure we're still in training mode
        model.train()

        current_loss = 0.0

        for idx_batch, batch in enumerate(dataloader):

            # Move data to GPU, if available
            x, y = batch
            x, y = x.to(device), y.to(device)

            # zero the gradient parameters
            optimizer.zero_grad()

            # forward
            y_pred = model(x)

            # backward + optimize
            # backward
            loss = criterion(y_pred, y)
            loss.backward()
            # optimize
            optimizer.step()
            # for an easy access
            current_loss += loss.item()
        
        train_acc = get_accuracy(model, x_train, y_train)
        test_acc = get_accuracy(model, x_test, y_test)
        
        writer.add_scalar('data/accuracy_train', train_acc, ep)
        writer.add_scalar('data/accuracy_test', test_acc, ep)
        print('Epoch #{:03d} | Time elapsed : {} | Loss : {:.4e} | Accuracy_train : {:.2f}% | Accuracy_test : {:.2f}% '.format(
                ep + 1, time_since(start), current_loss, 100 * train_acc, 100 * test_acc))
        data["Epoch"].append(ep + 1)
        data["Time Elapsed"].append(time_since(start))
        data["Loss"].append(current_loss)
        data["Accuracy Train"].append(100 * train_acc)
        data["Accuracy Test"].append(100 * test_acc)


    print('[INFO] Finished training the model. Total time : {}.'.format(time_since(start)))
    # Change the name of the .csv according to the dataset
    pd.DataFrame(data).to_csv('prediction_balanced_medium_dataset.csv')

You can now train the model on your dataset!

Note: You can use the GPUs provided by Google Colab to make the training faster: go to the “runtime” dropdown menu, select “change runtime type” and select GPU in the hardware accelerator drop-down menu.

In [None]:
# Please adjust the training epochs count, and the other hyperparams (lr, dropout, ...), for a non-overfitted training according to your own needs.
# tip: use tensorboard to display the accuracy (see cells above for tensorboard usage)

num_epochs = 20

train(model=model, criterion=criterion, optimizer=optimizer, dataloader=train_dataloader,
      x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test,
      num_epochs=num_epochs)

### 6. (When you're happy with the training:) Save the trained model

In [None]:
torch.save(model.state_dict(), '/content/drive/MyDrive/gesture_trained_100_30.pt')

### 7. Get a trained model

In [None]:
# Reminder: first redefine/load the HandGestureNet class before you use it, if you want to use it elsewhere
model = HandGestureNet(n_channels=n_channels, n_classes=14)
model.load_state_dict(torch.load('gesture_pretrained_model.pt'))
# If error in load check if current runtime have GPU... maybe unavalable at runtime, use the line bellow instead.
# model.load_state_dict(torch.load('gesture_pretrained_model.pt',torch.device('cpu')))
model.eval()

# make predictions
with torch.no_grad():
    demo_gesture_batch = torch.randn(32, duration, n_channels)
    predictions = model(demo_gesture_batch)
    _, predictions = predictions.max(dim=1)
    print("Predicted gesture classes: {}".format(predictions.tolist()))

In [None]:
# play with the model!
n_classes = 15
duration = 100
n_channels = 66
learning_rate = 1e-3
num_epochs = 25

In [None]:
for param in model.parameters():
    param.requires_grad = False

# Parameters of newly constructed modules have requires_grad=True by default
# num_ftrs = model.fc.in_features
model.fc = torch.nn.Sequential(
            torch.nn.Linear(in_features=9 * n_channels * 12, out_features=1936),  # <-- 12: depends of the sequences lengths (cf. below)
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=1936, out_features=n_classes)
        )

# Xavier initialization (copyed from main paper code)
for layer in model.fc:
            if layer.__class__.__name__ == "Linear":
                torch.nn.init.xavier_uniform_(layer.weight, gain=torch.nn.init.calculate_gain('relu'))
                torch.nn.init.constant_(layer.bias, 0.1)

In [None]:
# -----------------------------------------------------
# Loss function & Optimizer
# -----------------------------------------------------
optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)

# Please adjust the training epochs count, and the other hyperparams (lr, dropout, ...), for a non-overfitted training according to your own needs.
# tip: use tensorboard to display the accuracy (see cells above for tensorboard usage)

train(model=model, criterion=criterion, optimizer=optimizer, dataloader=train_dataloader,
      x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, num_epochs=num_epochs)

## Confusion Matrix Plots

In [None]:
model.eval()
# make predictions
with torch.no_grad():
    yhat_test = model(x_test)
    _, yhat_test = yhat_test.max(dim=1)
    print("Predicted gesture classes: {}".format(yhat_test.tolist()))

from sklearn.metrics import confusion_matrix
import seaborn as sn

cf_matrix = confusion_matrix(y_test, yhat_test)
plt.figure(figsize = (12,7))
sn.heatmap(df_cm, annot=True)

In [None]:
data= pd.read_csv("prediction_balanced_medium_dataset.csv")
data

## Training and Test Accuracy Plots

In [None]:
df = pd.read_csv("prediction_balanced_medium_dataset.csv")

properties = list(df.columns.values)
properties.remove('Unnamed: 0')

accuracy_train = df['Accuracy Train']
accuracy_test = df['Accuracy Test']


In [None]:
# print(accuracy_train)
# print(accuracy_test)
epochs = range(1,26)
plt.plot(epochs, accuracy_train, 'g', label='Training accuracy')
plt.plot(epochs, accuracy_test, 'b', label='Test accuracy')
plt.title('Training and Test accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()