# Hand Gesture Recognition for Smart TVs

In this project, a neural network will be built to identify 5 different hand gestures performed by the user in order to control a specific function of a Smart TV.

**Libraries Used:** Numpy, Scipy, Keras, TensorFlow

## List of Contents

1. Introduction
2. Data Preparation
3. Model Building
4. Model Training and Evaluation

## 1. Introduction

### 1.1 - Understanding the Business Problem

A home electronics company manufactures state of the art smart televisions. They want to develop a new feature where the TV can recognize 5 different hand gestures that a user can perform in order to control the TV without the use of a remote. The gestures are continuously monitored by the webcam mounted on the TV. Each gesture corresponds to a specific command:

**Thumbs up:**  Increase the volume <br>
**Thumbs down:** Decrease the volume <br>
**Left swipe:** 'Jump' backwards 10 seconds <br>
**Right swipe:** 'Jump' forward 10 seconds <br>
**Stop:** Pause the movie

### 1.2 - Data Understanding

The training data consists of a few hundred videos categorized into one of the above mentioned 5 classes. Each video is around 2-3 seconds long and is divided into 30 frames or images. These videos have been performed by various people performing one of the five gestures infront of a webcam similar to the one mounted on the smart TV.

There are two data folders and two csv files. 

The first data folder *('train')* contains 5 subfolders for each of the gestures where each subfolder represents a video of a gesture. Each subfolder contains 30 frames representing a video. 

The second data folder *('val')* contains the same as above and is meant to be used for validation purposes.

All images in a particular subfolder have the same dimensions but different videos may have different dimensions. Videos have two type of dimensions, either 360x360 or 120x160. The different dimensions are due to using 2 different webcams.

The first csv file *('train.csv')* is associated with the train folder. Each row of the CSV file represents one video and contains three main pieces of information. The name of the subfolder which contains 30 images of the video, the name of the gesture, and the numeric label (0-4) of the video. The numeric label suggests different people taking the same video.

The second csv file *('val.csv')* is associated with the val folder and follows the same naming structure of its rows as the first csv file.

### 1.3 - Importing Modules and Data

In [1]:
# importing basic modules
import numpy as np
import os
from scipy.misc import imread, imresize
import datetime
import os

The random seed is set to ensure results dont drastically vary.

In [2]:
np.random.seed(30)
import random as rn
rn.seed(30)
from keras import backend as K
import tensorflow as tf
tf.random.set_seed(30)

Using TensorFlow backend.


In [3]:
# importing various keras modules
from keras.models import Sequential, Model
from keras.layers import Dense, GRU, Dropout, Flatten, BatchNormalization, Activation, TimeDistributed
from keras.layers.convolutional import Conv3D, MaxPooling3D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras import optimizers
from keras.applications.vgg16 import VGG16

In [4]:
# importing csv files
train_doc = np.random.permutation(open(r'C:/Users/Avinash Bandlapalli/Desktop/Main Folder/Academics/Post Grad Diploma in Data Science/Courses/Course 6 - Neural Networks and Deep Learning/Module 5 - Gesture Recognition Case Study/Project_data/Project_data/train.csv').readlines())
val_doc = np.random.permutation(open(r'C:/Users/Avinash Bandlapalli/Desktop/Main Folder/Academics/Post Grad Diploma in Data Science/Courses/Course 6 - Neural Networks and Deep Learning/Module 5 - Gesture Recognition Case Study/Project_data/Project_data/val.csv').readlines())
batch_size = 16

Above, the csv files which contain the folder names for training and validation.

The `batch_size` is initialized at .. to ensure GPU usage at maximum capacity.

## 2. Data Preparation

In this sections, we have to normalize and resize images to a specific size.

In [5]:
# initializing for preprocessing
rows = 120 #X 
cols = 160 #Y 
channel = 3 #number of channels in images 3 for color(RGB)
frames=10

In [6]:
#resizing all the images, so we can have all the images in a specific size
def crop_resize_img(img):
    if img.shape[0]!=img.shape[1]:
        img = img[0:120,10:150]
    resized_image = imresize(img,(rows,cols))
    return resized_image

In [7]:
#using percentile to deal with outliers in the data
def normalize_image(img):
    normalized_image= (img - np.min(img))/(np.max(img)- np.min(img))
    return normalized_image

## 3. Model Building

To analyze videos using neural networks, two types of architectures are commonly used. 

First is the standard CNN + RNN architecture in which the images of the videos are passed through a CNN which then extracts a feature vector for each image, and then the sequence of the feature vectors are passed through a RNN. The other architecture used is an extension of CNNs - the 3D convolutional network. 

In this project, both the architectures will be used.

### 3.1 - CNN + RNN architecture

#### 3.1.1 - Defining the Generator Function

In [8]:
# This function preprocesses the images and feeds the data to the model in batches
def fetch_aug_batchdata(source_path, folder_list, batch_num, batch_size, t,validation):
    batch_data = np.zeros((batch_size,frames,rows,cols,channel))
    batch_labels = np.zeros((batch_size,5))
    batch_data_aug,batch_label_aug = batch_data,batch_labels
    batch_data_flip,batch_label_flip = batch_data,batch_labels
    img_idx = [x for x in range(0, 30,2)] 
    for folder in range(batch_size): 
        imgs = sorted(os.listdir(source_path+'/'+ t[folder + (batch_num*batch_size)].split(';')[0]))
        dx, dy = np.random.randint(-1.7, 1.8, 2)
        M = np.float32([[1, 0, dx], [0, 1, dy]])
        for idx, item in enumerate(img_idx):             
            image = cv2.imread(source_path+'/'+ t[folder + (batch_num*batch_size)].strip().split(';')[0]+'/'+imgs[item], cv2.IMREAD_COLOR)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            resized_image=crop_resize_img(image)
            batch_data[folder,idx,:,:,0] = normalize_image(resized_image[:, : , 0])#normalise and feed in the image
            batch_data[folder,idx,:,:,1] = normalize_image(resized_image[:, : , 1])#normalise and feed in the image
            batch_data[folder,idx,:,:,2] = normalize_image(resized_image[:, : , 2])#normalise and feed in the image
            x =resized_image.shape[0]
            y =resized_image.shape[1]
            batch_data_aug[folder,idx] = (cv2.warpAffine(resized_image, M, (x,y)))
            batch_data_flip[folder,idx]= np.flip(resized_image,1)
        batch_labels[folder, int(t[folder + (batch_num*batch_size)].strip().split(';')[2])] = 1
        batch_label_aug[folder, int(t[folder + (batch_num*batch_size)].strip().split(';')[2])] = 1
        if int(t[folder + (batch_num * batch_size)].strip().split(';')[2])==0:
            batch_label_flip[folder, 1] = 1
        elif int(t[folder + (batch_num*batch_size)].strip().split(';')[2])==1:
            batch_label_flip[folder, 0] = 1                    
        else:
            batch_label_flip[folder, int(t[folder + (batch_num*batch_size)].strip().split(';')[2])] = 1
    batch_data_final = np.append(batch_data, batch_data_aug, axis = 0)
    batch_data_final = np.append(batch_data_final, batch_data_flip, axis = 0)
    batch_label_final = np.append(batch_labels, batch_label_aug, axis = 0) 
    batch_label_final = np.append(batch_label_final, batch_label_flip, axis = 0)  
    if validation:
        batch_data_final=batch_data
        batch_label_final= batch_labels       
    return batch_data_final,batch_label_final

In [9]:
def generator1(source_path, folder_list, batch_size, validation=False,ablation=None):
    print('Source path = ', source_path,'; batch size =',batch_size)
    if(ablation!=None):
        folder_list=folder_list[:ablation]
    while True:
        t = np.random.permutation(folder_list)
        num_batches = len(folder_list)//batch_size # calculate the number of batches
        for batch in range(num_batches): # we iterate over the number of batches
            # you yield the batch_data and the batch_labels, remember what does yield do
            yield fetch_aug_batchdata(source_path, folder_list, batch, batch_size, t,validation)
        # Code for the remaining data points which are left after full batches
        if (len(folder_list) != batch_size*num_batches):
            batch_size = len(folder_list) - (batch_size*num_batches)
            yield fetch_aug_batchdata(source_path, folder_list, batch, batch_size, t,validation)

In [10]:
# calculating number of training sequences, validation sequences, and epochs
curr_dt_time = datetime.datetime.now()
train_path = r'C:/Users/Avinash Bandlapalli/Desktop/Main Folder/Academics/Post Grad Diploma in Data Science/Courses/Course 6 - Neural Networks and Deep Learning/Module 5 - Gesture Recognition Case Study/Project_data/Project_data/train'
val_path = r'C:/Users/Avinash Bandlapalli/Desktop/Main Folder/Academics/Post Grad Diploma in Data Science/Courses/Course 6 - Neural Networks and Deep Learning/Module 5 - Gesture Recognition Case Study/Project_data/Project_data/val'
num_train_sequences = len(train_doc)
print('# training sequences =', num_train_sequences)
num_val_sequences = len(val_doc)
print('# validation sequences =', num_val_sequences)
num_epochs = 70
print ('# epochs =', num_epochs)

# training sequences = 663
# validation sequences = 100
# epochs = 70


#### 3.1.2 - Generating and Compling the Model

The model uses `TimeDistributed`, `GRU`, and other RNN structures after transfer learning. Last layer is softmax. The network is created to ensure the model is able to fit in the memory of the webcam.

In [11]:
base_model = VGG16(include_top=False, weights='imagenet', input_shape=(120,120,3))
x = base_model.output
x = Flatten()(x)
#x.add(Dropout(0.5))
features = Dense(64, activation='relu')(x)
conv_model = Model(inputs=base_model.input, outputs=features)
    
for layer in base_model.layers:
    layer.trainable = False
        
model = Sequential()
model.add(TimeDistributed(conv_model, input_shape=(15,120,120,3)))
model.add(GRU(32, return_sequences=True))
model.add(GRU(16))
model.add(Dropout(0.5))
model.add(Dense(8, activation='relu'))
model.add(Dense(5, activation='softmax'))

Now the model has been generated. Next step is to compile it.

In [12]:
# compiling the model and printing summary
sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.7, nesterov=True)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
print (model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
time_distributed_1 (TimeDist (None, 15, 64)            15009664  
_________________________________________________________________
gru_1 (GRU)                  (None, 15, 32)            9312      
_________________________________________________________________
gru_2 (GRU)                  (None, 16)                2352      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 45        
Total params: 15,021,509
Trainable params: 306,821
Non-trainable params: 14,714,688
____________________________________

We can see the total amount of parameters we have to train.

Below the `train_generator` and the `val_generator` will be created to be used in `.fit_generator`.

In [13]:
# creating train and val generators
train_generator = generator1(train_path, train_doc, batch_size)
val_generator = generator1(val_path, val_doc, batch_size)

In [14]:
model_name = 'model_init_conv_lstm' + '_' + str(curr_dt_time).replace(' ','').replace(':','_') + '/'
    
if not os.path.exists(model_name):
    os.mkdir(model_name)
        
filepath = model_name + 'model-{epoch:05d}-{loss:.5f}-{categorical_accuracy:.5f}-{val_loss:.5f}-{val_categorical_accuracy:.5f}.h5'

checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)

LR = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1, mode='min', epsilon=0.0001, cooldown=0, min_lr=0.00001)
callbacks_list = [checkpoint, LR]



In [15]:
if (num_train_sequences%batch_size) == 0:
    steps_per_epoch = int(num_train_sequences/batch_size)
else:
    steps_per_epoch = (num_train_sequences//batch_size) + 1

if (num_val_sequences%batch_size) == 0:
    validation_steps = int(num_val_sequences/batch_size)
else:
    validation_steps = (num_val_sequences//batch_size) + 1

#### 3.1.3 - Fitting the Model

The model will be fit in this section. Checkpoints will save the model at the end of each epoch.

In [None]:
model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, epochs=num_epochs, verbose=1, 
                    callbacks=callbacks_list, validation_data=val_generator, 
                    validation_steps=validation_steps, class_weight=None, workers=1, initial_epoch=0)

Epoch 1/70
Source path =  /notebooks/storage/Final_data/Project_data/train ; batch size = 16
Source path =  /notebooks/storage/Final_data/Project_data/val ; batch size = 16


`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  del sys.path[0]
`imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``skimage.transform.resize`` instead.


 4/42 [=>............................] - ETA: 1:45 - loss: 1.6539 - categorical_accuracy: 0.2031

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.


 5/42 [==>...........................] - ETA: 1:28 - loss: 1.6864 - categorical_accuracy: 0.2000

`imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``skimage.transform.resize`` instead.



Epoch 00001: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00001-1.68776-0.21719-1.59857-0.28000.h5
Epoch 2/70

Epoch 00002: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00002-1.63751-0.23831-1.58996-0.24000.h5
Epoch 3/70

Epoch 00003: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00003-1.59086-0.24434-1.56217-0.27000.h5
Epoch 4/70

Epoch 00004: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00004-1.54839-0.29864-1.55305-0.30000.h5
Epoch 5/70

Epoch 00005: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00005-1.53755-0.31523-1.50906-0.40000.h5
Epoch 6/70

Epoch 00006: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00006-1.48738-0.34540-1.48694-0.40000.h5
Epoch 7/70

Epoch 00007: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00007-1.45203-0.36953-1.44123-0.40000.h5
Epoch 8/70

Epoch 00008: saving model to model_init_conv_lstm_2018

Epoch 29/70

Epoch 00029: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00029-0.78113-0.78582-1.07283-0.63000.h5
Epoch 30/70

Epoch 00030: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00030-0.78145-0.79336-1.07999-0.65000.h5
Epoch 31/70

Epoch 00031: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00031-0.74719-0.82504-1.08317-0.61000.h5

Epoch 00031: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 32/70

Epoch 00032: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00032-0.78666-0.77828-1.08582-0.63000.h5
Epoch 33/70

Epoch 00033: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00033-0.75921-0.79638-1.08120-0.62000.h5

Epoch 00033: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 34/70

Epoch 00034: saving model to model_init_conv_lstm_2018-10-0414_07_55.144483/model-00034-0.75076-0.81599-1.08205-0.62000.h5
Epoch 35/70

As we can see above, the accuracy of the training set is **81.28%**, while the accuracy of the validation set is **62.0%**.

### 3.2 - Conv3D architecture

The next architecture is on another link here: https://github.com/abandlap/Data-Science-Portfolio/blob/master/Post%20Grad%20Diploma/Hand_Gesture_Recognition_TV_project_part2.ipynb