In [1]:
import cv2
import os
import numpy as np

import pandas as pd
import imutils
import time
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from PIL import Image

from tensorflow.keras.models import load_model

Most of the below code has to be run in scripts as cv2 and jupyter have a problem with running cv2.show, it sort of just makes the whole notebook crash which is not ideal. I have put all of the code into this notebook because I wanted to show my thought process and explain clearly what is happening.

There isn't too much actual code in this notebook but the convoluted nerual net will take a while to run. If you would want to load the cnn model directly I will include it in my git hub repository.

## Get the data

In this mini project I really wanted to create my own dataset mainly because I thought it would be really fun but also I wanted to get more practice using the cv2 library as it was not something we used in our studies.

After a load of different iterations I found that there were a couple different ways of isolating the hand. I have included the method that I used to the most success.

In [None]:
video_file_path="/Users/danielchow/Downloads/videos_for_nueral_net/mp4s/1_finger_stationary.mp4"
cam = cv2.VideoCapture(video_file_path) 
folder_name='data'+'_'+video_file_path.split('/')[-1].split('_')[0]+'_'+video_file_path.split('/')[-1].split('_')[-1].split('.')[0]+'gaussian'
fgbg= cv2.createBackgroundSubtractorMOG2()

try: 
      
    # creating a folder named data 
    if not os.path.exists(folder_name): 
        os.makedirs(folder_name) 
  
    # if not created then raise error 
except OSError: 
    print ('Error: Creating directory of data') 
  
    # frame 
currentframe = 1
  
while(True): 
      
    # reading from frame 
    ret,frame = cam.read() 
  
    if ret: 
        # if video is still left continue creating images
        name = './'+folder_name+'/frame' + str(currentframe) + '.jpg'
        print ('Creating...' + name)
        cropped = frame[100:1100,50:500]
        height , width , layers = cropped.shape
        new_h=int(height/2)
        new_w=int(width/2)
        resize = cv2.resize(cropped, (new_w, new_h))
        fgmask=fgbg.apply(resize)
        median=cv2.medianBlur(fgmask,5)
        ret,thresh1 = cv2.threshold(median,20,1000,cv2.THRESH_BINARY)
        

        # writing the extracted images 
        cv2.imwrite(name, thresh1) 
  
        #stop duplicate images
        currentframe += 1
    else: 
        break
  
    #Release all space and windows once done 
cam.release() 
cv2.destroyAllWindows()

For someone who has not seen the cv2 library this may seem quite confusing so I will break it down slightly and explain some of the concepts.

In [None]:
while(True): 
      
    # reading from frame 
    ret,frame = cam.read() 
  
    if ret: 
    
    else: 
        break

The above code creates a loop that loops over a video (in this case) and then breaks when the video is finished.

In [None]:
if ret: 
        # if video is still left continue creating images
        name = './'+folder_name+'/frame' + str(currentframe) + '.jpg'
        print ('Creating...' + name)
        cropped = frame[100:1100,50:500]
        height , width , layers = cropped.shape
        new_h=int(height/2)
        new_w=int(width/2)
        resize = cv2.resize(cropped, (new_w, new_h))

The above code takes a frame from the video crops it so only the bottom half (where the hand is) is showing and then resizes the image so its only half as big. This is done so the neural network has less pixels to process.

In [None]:
fgbg= cv2.createBackgroundSubtractorMOG2()
fgmask=fgbg.apply(resize)

This is where it gets slightly more complicated. The MOG2 background subtractor looks at the previous frames in a video and sees where the difference is compared to the current video. Therefore if the only thing moving is your hand it turns everything but your hand black - this is very useful as we only care about what the hand is doing. Below is an example from the cv2 website on what happens.

![example](https://docs.opencv.org/3.4/Background_Subtraction_Tutorial_Scheme.png)

In [None]:
median=cv2.medianBlur(fgmask,5)
ret,thresh1 = cv2.threshold(median,20,1000,cv2.THRESH_BINARY)

The next step is to apply a median blur on the image. This computes the median of all the pixels in a window and then the middle pixel is given this value. This is very good at removing salt and pepper noise - which I found quite common when using MOG2 background subtractor. Below is an example of median blur being used.

![example](https://blog.photopea.com/wp-content/uploads/2016/09/head.jpg)

Finally we apply a threshold to the image. Thresholding is quite a simple concept in that we state a value for a pixel and everything below that value is black and everything above that value is white. Some examples for different types of thresholding are shown below but in this case normal binary thresholding worked well. The only thing left to do now is save the image into a predefined folder.

![example](https://opencv-python-tutroals.readthedocs.io/en/latest/_images/threshold.jpg)

Overall I did a couple videos of myself doing various hand signals and obtained over 1300 pictures. The hand signals I picked were- 1 finger pointing, the peace sign and the okay sign. 

![h](https://raw.githubusercontent.com/danch12/Images_for_neural_hands/master/Data/Train/1_finger/frame134.jpg "1 finger")

![](https://raw.githubusercontent.com/danch12/Images_for_neural_hands/master/Data/Train/ok_sign/frame522.jpg)

![](https://raw.githubusercontent.com/danch12/Images_for_neural_hands/master/Data/Train/peace_sign/frame115.jpg)

## Running the neural network

Now that we have obtained our dataset of images we can use a nerual network to process them. I used the xception cnn as a base and then put a new layer on top for my classes. I will go into more detail in the read me. Additionally I used google collab which is why the training path may look a bit weird.

In [None]:
training_path='/content/drive/My Drive/pics_for_neural/Images_for_neural_hands/Data/Train'
train_datagen=ImageDataGenerator(rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    preprocessing_function=keras.applications.xception.preprocess_input)



train_generator=train_datagen.flow_from_directory(training_path,
                                                  target_size=(224,224),
                                                  color_mode='rgb',
                                                  batch_size=32,
                                                  class_mode='categorical',
                                                  seed=1,
                                                shuffle=True
                                                 )

In [None]:
validation_path='/content/drive/My Drive/pics_for_neural/Images_for_neural_hands/Data/Validation'
test_datagen = ImageDataGenerator(rescale=1./255,
                                  preprocessing_function=keras.applications.xception.preprocess_input)

validation_generator = test_datagen.flow_from_directory(
    validation_path,
    target_size=(224,224),
    batch_size=32,
    shuffle=True,
    class_mode='categorical',
    
    seed=1
    )

In [None]:
base_model = keras.applications.xception.Xception(weights="imagenet",
                                                  include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(3, activation="softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=output)

In [None]:
 # Setup a model checkpoint to save our best model

checkpoint = tf.keras.callbacks.ModelCheckpoint(
        '/content/drive/My Drive/Capstone/models/{epoch:02d}-{val_accuracy:.2f}.h5',
        monitor=  'val_accuracy',
        verbose=1,
        save_best_only=True,
    )

#also created an early stopping, both measures to help stop overfitting
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',
                               min_delta=0,
                               patience=5,
                               verbose=1,
                               mode='auto',
                               restore_best_weights=True)

We need to freeze the base layers as the new output layer is initialized randomly therefore could make large errors and therefore there will be a large error gradient that could destory the reused weights. To avoid this we freeze the weights until the new layer has been given some time to learn reasonable weights.

In [None]:
#freezing the base layers
for layer in base_model.layers:
    layer.trainable = False

In [None]:
#using SGD as it seems to have the best quality convergence
optimizer = keras.optimizers.SGD(lr=0.2, momentum=0.9, decay=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit_generator(
    train_generator,
    epochs=10,
    validation_data=validation_generator,
    callbacks=[checkpoint,early_stopping])

In [None]:
#unfreezing the base layers
for layer in base_model.layers:
    layer.trainable = True

In [None]:
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.001)
model.compile(loss="categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit_generator(
    train_generator,
    epochs=10,
    validation_data=validation_generator,
    callbacks=[checkpoint,early_stopping])

## Putting it all together

To be able to do predictions in real time was actually relatively easy, the script was very similar to the one used in the data gathering stage with only a couple of changes.

In [None]:
class_dict={0:'1_finger',
           1:'ok_sign',
           2:'peace_sign'}

def preprocess(image):
    image = np.stack((image,) * 3, axis=-1)
    resized_image = tf.image.resize(image, [224, 224])
    final_image = keras.applications.xception.preprocess_input(resized_image)
    return final_image




model=load_model("/Users/danielchow/Downloads/videos_for_nueral_net/models/best_home_model.h5")


prob=None
cl=None
font                   = cv2.FONT_HERSHEY_SIMPLEX
bottomLeftCornerOfText = (100,200)
fontScale              = 1
fontColor              = (0,255,255)
lineType               = 2




cam = cv2.VideoCapture(0) 
fgbg= cv2.createBackgroundSubtractorMOG2()


  
    # frame 
currentframe = 0
  
while(True): 
      
    # reading from frame 
    ret,frame = cam.read() 

    if ret: 
        


        # resize the frame


        
        
        cropped = frame[290:1100,200:1000]
        height , width , layers = cropped.shape
        new_h=int(height/2)
        new_w=int(width/2)
        resize = cv2.resize(cropped, (new_w, new_h))
        fgmask=fgbg.apply(resize)
        median=cv2.medianBlur(fgmask,5)
        ret,thresh1 = cv2.threshold(median,20,1000,cv2.THRESH_BINARY)
        print(thresh1.shape)




        if currentframe % 5 ==0:

            img=preprocess(thresh1)
            img=img/255
            im2 = np.expand_dims(img, axis=0)
            prob=max(model.predict(im2)[0])
            cl=class_dict[np.argmax(model.predict(im2))]
            font                   = cv2.FONT_HERSHEY_SIMPLEX
            bottomLeftCornerOfText = (100,200)
            fontScale              = 1
            fontColor              = (0,255,255)
            lineType               = 2



        cv2.putText(frame,
            f'{prob}-{cl}', 
        bottomLeftCornerOfText, 
        font, 
        fontScale,
        fontColor,
        lineType)

        # showing the 
        cv2.imshow('segment', thresh1) 
        cv2.imshow('real',frame)
        #stop duplicate images
        currentframe += 1

         # observe the keypress by the user
        keypress = cv2.waitKey(1) & 0xFF

        # if the user pressed "q", then stop looping
        if keypress == ord("q"):
            break 
    else: 
        break
  
    #Release all space and windows once done 
cam.release() 
cv2.destroyAllWindows()

There are two main changes in this script with the first being the if statement.

In [None]:
def preprocess(image):
    image = np.stack((image,) * 3, axis=-1)
    resized_image = tf.image.resize(image, [224, 224])
    final_image = keras.applications.xception.preprocess_input(resized_image)
    return final_image


if currentframe % 5 ==0:

            img=preprocess(thresh1)
            img=img/255
            im2 = np.expand_dims(img, axis=0)
            prob=max(model.predict(im2)[0])
            cl=class_dict[np.argmax(model.predict(im2))]
            

The preprocessing function takes the image and turns it into a format that is suitable for the model. First the np.stack takes the image and broadcasts it across all three colour channels. Next the image is resized into the size that will be accepted by the xception model. Then the image is put though the special xception preprocessing function. Finally we scale the input pixels so all values are between 0 and 1 then exand the dimensions so it can be put into the model.

The model then outputs the class with the highest probability and what that probability is.

The only other major part that changes is the below bit which is needed for a video feed as it allows you to stop the video when you press a certain button. In this case q.

In [None]:
 # observe the keypress by the user
        keypress = cv2.waitKey(1) & 0xFF

        # if the user pressed "q", then stop looping
        if keypress == ord("q"):
            break 