# CNN to recognize sign language

#### Have you ever wondered how a speech or hearing impaired person communicates? Yes, sign language. But sign language doesn't exactly fulfils its pupose unless both speaker and listener know it. Well, one can always take up upon learning the sign language but owing to the advent of machine learning, let's learn how to implement a sign language interpreter.

<img src='ASL.png'></img>


Post your doubt/feedback/discussion in our FB group unit <a href='https://www.facebook.com/groups/colearninglounge/learning_content/?filter=465390340908184'>here</a> in the appropriate section.

## Table of Contents:
<ul>
    <li>Introduction</li>
    <li>Import the libraries</li>
    <li>Load the dataset</li>
    <li>Explore the dataset</li>
    <li>Pre-processing the data</li>
    <ul>
        <li> Read and resize the images</li>
        <li> Perform train, validation and test split</li>
    </ul>
    <li>Building the model</li>
    <li>Test the model</li>
    <li>Summary</li>
    <li>Credits</li>
    
</ul>

### Introduction

In this tutorial, we will learn how to do image classification on hand gestures of a sign language using a convolutional neural network, `AlexNet`. We will start by importing and exploring a Kaggle dataset consisting images of hand gestures in american sign language (ASL) for all english alphabets. For simplicity, we will only be considering the first 3 letters of english language 'A', 'B' and 'C' to train, validate and test our network. Second, we will process our data to obtain train, validation and test sets in the format our neural network will be using them. Third, we will build and train an AlexNet to classify the data into the 3 categories. Next, we will interpret the performance of our model by testing it on test data and obtaining an accuracy score.

### Import the libraries
Python provides a variety of libraries to ease out the computational challenges of coding and handle relatively complex problems rather easily. Here we import the essential libraries for hand gesture image classification task.

<ul> 
    <li> <b> Cv2 </b> : This is an open source computer vision (OpenCV) library which provides programming functionality for real-time computer vision. </li>
    <li> <b> OS </b> : This library allows you to interface with the Operating System (OS), and provides OS related functionality. </li>
    <li> <b> Random </b> : This library is used to generate pseudo random numbers for different distributions. </li>
    <li> <b> Numpy </b> : Numerical Python works on an N-dimensional array object and provides basic and complex mathematical functionality for it. </li>
    <li> <b> Matplotlib </b> : This library provides data visualization functionality. </li>
    <li> <b> Keras </b> : This library provides a convenient way of making neural network based models and uses tensorflow, CNTK or theano at the backend. </li>
    <li> <b> Shutil </b> : This library provides functionality to deal with files or collection of files </li>
    <li> <b> Warnings </b> : This library specifies how to deal with warnings.
</ul>

In [1]:
import cv2
import os
from sklearn import model_selection
import random
from random import shuffle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
from keras.utils import np_utils
from shutil import unpack_archive

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

Using TensorFlow backend.


Now that the libraries have been imported successfully, let's move on to loading the data.


### Load the dataset

You can download the dataset from <a href='https://www.kaggle.com/grassknoted/asl-alphabet#asl_alphabet_test.zip'>here</a>. The dataset contains around 3000 images for each alphabet covering different variations of the images such as different zoom in ratio, different angles and different lighting conditions. A diverse dataset like this one allows a model to be trained efficiently.

### Preprocess the data

For simplicity, we will only be considering 3 classes of the dataset, namely 'A', 'B' and 'C'. First, let's read the data and corresponding labels of the three classes.

In [21]:
paths = ['A','B','C']
dataX , dataY = [], []
for p in paths:
    files = os.listdir(p)
    for f in files:
        image_ = cv2.imread (p+'/'+f)
        image = cv2.resize(image_,(224,224))
        label = f[0]
        dataX.append(image)
        dataY.append(label)

Of the 9000 images of all the classes, we will first perform an 80-20 split, to make train and test set containing 7200 and 1800 images respectively, then we will again perform an 90-10 split on train data to make train and validation set containing 6480 and 720 images respectively

In [22]:
data = list(zip(dataX,dataY))
train_, test = model_selection.train_test_split(data, test_size=0.2, random_state=1)
train, validation = model_selection.train_test_split(data, test_size=0.1, random_state=1)

Now, that we have different train, validation and test sets, we will separate the image data and labels in all sets.

In [23]:
trainX, trainY = zip (*train)
valX, valY = zip(*validation)
testX, testY = zip(*test)

Let's convert the labels into one-hot encoding format and save the data.

In [24]:
trainX = list(trainX)
trainY = list(trainY)
valX = list(valX)
valY = list(valY)
testX = list(testX)
testY = list(testY)
trainY = pd.Categorical(trainY).codes
valY = pd.Categorical(valY).codes
testY = pd.Categorical(testY).codes
trainY_ = np_utils.to_categorical(trainY)
valY_ = np_utils.to_categorical(valY)
testY_ = np_utils.to_categorical(testY)

In [25]:
np.save('trainX.npy', trainX)
np.save('trainY.npy', trainY)
np.save('valX.npy', valX)
np.save('valY.npy', valY)
np.save('testX.npy', testX)
np.save('testY.npy', testY)

Here, we have used sklearn library function `train_test_split` to split the data which automatically shuffles the data. You can choose to not use the library, as it is done <a href='https://medium.com/free-code-camp/asl-using-alexnet-training-from-scratch-cfec9a8acf84'>here</a>.

### Build the model
Now that we have data in appropriate format, let's build the CNN model, AlexNet.

AlexNet consists of 5 convolutional layers followed by 3 fully connected layers as depicted in the architecture given below.

<img src='AlexNet.png'></img>

You can dive deep to understand the architecture of an AlexNet <a href='https://medium.com/@smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637'>here</a>.

In [7]:
train_X = np.load('trainX.npy')
train_Y = np.load('trainY.npy')
val_X = np.load('valX.npy')
val_Y = np.load('valY.npy')
test_X = np.load('testX.npy')
test_Y = np.load('testY.npy')

In [8]:
train_Y = trainY_
val_Y = valY_
test_Y = testY_

In [9]:
from keras.optimizers import SGD
from keras.models import Sequential
from keras.preprocessing import image
from keras.layers.normalization import BatchNormalization
from keras.layers import Dense, Activation, Dropout, Flatten,Conv2D, MaxPooling2D

model = Sequential()
# 1st Convolutional Layer
model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11),strides=(4,4), padding='valid'))
model.add(Activation('relu'))
# Max Pooling 
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())
# 2nd Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Max Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())
# 3rd Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())
# 4th Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())
# 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Max Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())
# Passing it to a dense layer
model.add(Flatten())
# 1st Dense Layer
model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation('relu'))
# Add Dropout to prevent overfitting
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# 2nd Dense Layer
model.add(Dense(4096))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.6))
# Batch Normalisation
model.add(BatchNormalization())
# 3rd Dense Layer
model.add(Dense(2048))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.5))
# Batch Normalisation
model.add(BatchNormalization())
# Output Layer
model.add(Dense(3))
model.add(Activation('softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 54, 54, 96)        34944     
_________________________________________________________________
activation_1 (Activation)    (None, 54, 54, 96)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 27, 27, 96)        0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 27, 27, 96)        384       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 17, 17, 256)       2973952   
_________________________________________________________________
activation_2 (Activation)    (None, 17, 17, 256)       0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 256)         0         
__________

In [40]:
sgd = SGD(lr=0.002)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
checkpoint = keras.callbacks.ModelCheckpoint("Checkpoint/weights.{epoch:02d}-{val_loss:.2f}.hdf5", monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', period=1)
model.fit(train_X/255.0, train_Y, batch_size=32, epochs=60, verbose=1, shuffle=True, validation_data=(val_X/255.0,val_Y/255.0), callbacks=[checkpoint])
model.save_weights("Weights/model_weights.h5")

Train on 8100 samples, validate on 900 samples
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60




Now after we have built and validated the model, we save the model weights, and test it. We will quantify the performance of the model with the help of accuracy score.

Accuracy is a ratio of correctly classified samples to that of total samples, a high accuracy is desirable in classification problems.

In [47]:
from sklearn.metrics import accuracy_score 

model.load_weights('Checkpoint/weights.58-0.00.hdf5')
test_X=np.load("testX.npy")
test_Y=np.load("testY.npy")
predict_Y = model.predict(test_X)
predict_X = [np.argmax(r) for r in predict_Y]
acc_score = accuracy_score(test_Y, predict_X)

print("Accuracy: " + str(acc_score))

Accuracy: 0.33944444444444444


### Summary
Voila! You just implemented an AlexNet for image classification. It will be a worthwhile effort to apply other neural network architectures and see how they perform compared to AlexNet. If you are fairly confident of this network, a more useful approach for real world may be to try interpreting sign language in a continuous streaming video.

> This tutorial is intended to be a public resource. If you see any glaring inaccuracies or a missing critical topic, please feel free to point it out or submit a pull request to improve the tutorial. 
Also, we are always looking to improve the scope of this article. For any suggestions and feedback, mail us @ colearninglounge@gmail.com
### Credits
> This article is authored by: <li>Naveksha Sood : Follow her on <a href='https://www.linkedin.com/in/naveksha-sood-8b6824160/'>LinkedIn</a>, <a href='https://medium.com/@navekshasood'>Medium</a> and <a href='https://github.com/search?q=naveksha+sood'>GitHub</a>.</li><li>Vagdevi Kommineni : Follow her on <a href='www.linkedin.com/in/vagdevi-kommineni-427599114'>LinkedIn</a>, <a href='https://medium.com/@vagdevi.k15'>Medium</a>, <a href='https://vagdevik.wordpress.com'>WordPress</a> and <a href='https://github.com/vagdevik'>Github</a>.</li>