# Final Project
## Author: Yu Mi, yxm319; Boning Zhao, bxz213
Recognizing human actions is one of most popular computer vision method which finds mutiple applications in lots of fields such as video surveillance, customer attributes, shopping behavior analysis.

In our final project, we consider the automated recognition of human actions in some videos. We proposed to build up a 3D CNN model for action recognition. In order to capture motion information from multiple adjacent frames, we proposed to extract features from both spatial and temporal dimensions. Based on this feature extractor, a 3D convolutional neural network will be built up. This CNN will generates multiple channels of information and performs convolution and subsampling separately. The final feature representation is obtained by conbining information from all channels.

In [1]:
# Import standard and supportive libraries
import tensorflow as tf
import os
import matplotlib.pyplot as plt
import numpy as np
import cv2
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
from sklearn import preprocessing
from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print("Available devices for trainning:", get_available_devices())

  from ._conv import register_converters as _register_converters


Available devices for trainning: ['/device:CPU:0', '/device:GPU:0']


## Nerual network framework
In this project, we are going to apply [Keras](https://keras.io) as our neural network framework since it is already introduced in Homework3. It is capable of running on top of TensorFlow, CNTK or Theano. It was developed with a focus on enabling fast experimentation, being able to go from idea to result with the least possible delay and good for research.

In [2]:
#Try to use tensorflow in GPU
config_tf = tf.ConfigProto(log_device_placement=True)
config_tf.gpu_options.allow_growth = True
session = tf.Session(config=config_tf)

# Import models and layers
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, Model
from keras.layers.core import Dense, Flatten, Activation, Flatten, Dropout
from keras.layers.convolutional import Conv3D, MaxPooling3D

# Import utilities
from keras.optimizers import SGD,RMSprop
from keras.utils.vis_utils import plot_model
from keras.utils import np_utils, generic_utils
from keras.backend import set_session
from keras import backend as K
from keras.utils.generic_utils import get_custom_objects

set_session(session)

Using TensorFlow backend.


## KTH dataset
[KTH dataset](http://www.nada.kth.se/cvap/actions/) is a database provided by KTH Royal institute of Technology. The current video database contains six tyes of human actions, including walking, jogging, running, boxing, hand waving and hand clapping. All the actions are performed several times by 25 different individuals in for scenarios: outdoors $s1$, outdoors with scale variation $s2$, outdoors with different clothes $s3$ and indoors $s4$ as illustrated below. 
![KTH scenarios and actions](figure/KTH_Intro.gif)
Currently we have $600$ sequences in the dataset and all the sequences were taken over homogeneous backgrounds with a static camera with $25$fps frame rate. The sequences were downsampled to the spatial resolution of $160\times120$ pixels and have a length of four seconds in average.

### Loading the KTH data as input
In loading the KTH dataset, we try to import every second of the frames (including $24$ frames) as a sequence, since each video file have much more than $1$ second, we select the first $4$ seconds of the frames to be the input, which means we are going to extract nearly $2400$ sequences as our input:

In [3]:
inflation = 4
# image attributes
img_r, img_c, img_d = 64, 48, 24
#img_r, img_c, img_d = 15, 15, 16
#Training set
#Entire dataset
Training_set=[]
#Loading boxing class
box_listing = os.listdir('data/kth_database/boxing')
for box_id in box_listing:
    box_id = 'data/kth_database/boxing/'+box_id
    capture = cv2.VideoCapture(box_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
        #print(ipt.shape)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Boxing class successfully loaded.")  

#Loading hand clapping class
hc_listing = os.listdir('data/kth_database/handclapping')
for hc_id in hc_listing:
    hc_id = 'data/kth_database/handclapping/'+hc_id
    frame_list = []
    capture = cv2.VideoCapture(hc_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Hand-clapping class successfully loaded.")

#Loading hand waving class
hw_listing = os.listdir('data/kth_database/handwaving')
for hw_id in hw_listing:
    hw_id = 'data/kth_database/handwaving/'+hw_id
    frame_list = []
    capture = cv2.VideoCapture(hw_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Hand-waving class successfully loaded.")

#Loading jogging class
jog_listing = os.listdir('data/kth_database/jogging')
for jog_id in jog_listing:
    jog_id = 'data/kth_database/jogging/'+jog_id
    frame_list = []
    capture = cv2.VideoCapture(jog_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Jogging class successfully loaded.")

#Loading running class
run_listing = os.listdir('data/kth_database/running')
for run_id in run_listing:
    run_id = 'data/kth_database/running/'+run_id
    frame_list = []
    capture = cv2.VideoCapture(run_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Running class successfully loaded.")

#Loading walking class
walk_listing = os.listdir('data/kth_database/walking')
for walk_id in walk_listing:
    walk_id = 'data/kth_database/walking/'+walk_id
    frame_list = []
    capture = cv2.VideoCapture(walk_id)
    fps = capture.get(cv2.CAP_PROP_FPS)
    #print("Frames per second using video.get(cv2.CAP_PROP_FPS): {0}".format(fps))
    for j in range(inflation):
        frame_list = []
        for i in range(img_d):
            success, frame = capture.read()
            frame = cv2.resize(frame,(img_r,img_c),interpolation=cv2.INTER_AREA)
            gray =  cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            frame_list.append(gray)
        cv2.destroyAllWindows()
        ipt = np.asarray(frame_list)
        ipt = np.rollaxis(np.rollaxis(ipt,2,0),2,1)
    #     frame_length = len(ipt)
    #     fourth = int(frame_length/4)
    #     for i in range(0,frame_length,fourth):
    #         Training_set.append(ipt[i:i+fourth])
        Training_set.append(ipt)
    capture.release()
print("Walking class successfully loaded.")

Boxing class successfully loaded.
Hand-clapping class successfully loaded.
Hand-waving class successfully loaded.
Jogging class successfully loaded.
Running class successfully loaded.
Walking class successfully loaded.


In [4]:
#convert the frames into array
#print(len(Training_set),len(Training_set[0]),len(Training_set[0][0]),len(Training_set[0][0][0]))
for i in range(len(Training_set)):
    assert(len(Training_set[i])==64),"actual len: {}".format(len(Training_set[i]))
    for j in range(len(Training_set[i])):
        assert((len(Training_set[i][j])==48)),"actual len: {}".format(len(Training_set[i][j]))
        for k in range(len(Training_set[i][j])):
            assert((len(Training_set[i][j][k])==24)),"actual len: {} {}".format(len(Training_set[i][j][k]),k)
                                                                                
Training_data=np.asarray(Training_set)
sample_num = len(Training_data)
#Assign Label
label = np.ones((sample_num,),dtype = int)
label[0:100*inflation] = 0
label[100*inflation:100*inflation+99*inflation] = 1
label[100*inflation+99*inflation:200*inflation+99*inflation] = 2
label[200*inflation+99*inflation:300*inflation+99*inflation] = 3
label[300*inflation+99*inflation:400*inflation+99*inflation] = 4
label[400*inflation+99*inflation:] = 5
#print(Training_data.shape)
#print(label.shape)
train = [Training_data,label]
train_set = np.zeros((sample_num, img_r,img_c,img_d,1))

for i in range(sample_num):
    for j in range(img_r):
        for k in range(img_c):
            for l in range(img_d):
                train_set[i][j][k][l][0]=train[0][i,j,k,l]
# for h in range(sample_num):
#     train_set[h][:][:][:][0]=train[h]

#training parameter for CNN
classes = 6
epoch = 25
batch_size = 2
#number of frames used in each video
patch_size = 15

(X_train, y_train) = (train[0],train[1])
Y_train = np_utils.to_categorical(y_train, classes)

#number of convoluntional filters
filt =[32, # 1st layer 
       32  # 2nd layer
      ]
#level of pooling 
pool = [3,3]
#level of convolution
conv = [5,5]

In [5]:
#preprocessing part
def softmax(x):
    exp_x = np.exp(x)
    softmax_x = exp_x / np.sum(exp_x)
    return softmax_x 
train_set = train_set.astype('float32')
train_set -= np.mean(train_set)
train_set /= np.max(train_set)

### 3D CNN 
In order to get a better understanding of 3D CNN. We will compare it with 2D CNN.

The way of using 2D CNN to operate on video is generally to use CNN to identify each frame of the video. This method does not take the inter-frame motion information in the time dimension into account. The following is the traditional 2DCNN convolution operation on the image using 2D convolution kernel:
![2DCNN](figure/2D.PNG)
$$2D\ CNN$$
In the 2DCNN, on the convolutional layer, the 2D convolution operation extracts features in the local neighborhood of the upper level feature map. Then apply an added offset value and pass the result to a sigmoid function. Formally, at the position (x,y), on the i-th layer, and on the j-th feature map, the unit value is expressed as $v_{ij}^{xy}$, and is given by the following formula:
$$v_{ij}^{xy}= tanh(b_{ij}+\sum_m \sum_{p=0}^{P_i-1} \sum_{q=0}^{Q_i-1}w^{pq}_{ijm}v^{(x+p)(y+q)}_{(i-1)m})$$
Where $tanh$ is a hyperbolic tangent function,$b_{ij}$ is an offset value for this feature map, m is the coordinates of an i-1 layer feature graph connected to the current feature graph set, and$w^{pq}_{ijm}$ isthe weight at the nuclear position (p, q) in mth connoected graph, $P_i$ and $Q_i$ are the height and width of the core. In the downsampling layer, on the feature map of the previous layer, the resolution of the feature map is reduced by the pooling of local neighborhoods, thereby increasing the invariance of the input distortion. A CNN architecture can be constructed by stacking multiple convolution operations and downsampling operations in an alternating manner. CNN parameters are usually trained using supervised or unsupervised methods.

In the 2DCNN, the convolution operation is only applied to the 2D feature map to calculate features from the spatial dimension. When dealing with video analysis problems, it is necessary to capture the motion information encoded between consecutive frames. For this purpose, we propose that during the convolutional operation phase of CNN, a 3D convolution operation is performed to simultaneously capture features from the temporal and spatial dimensions. The 3D convolution operation is achieved by convolving a cube formed by stacking a plurality of consecutive frames simultaneously with a 3D kernel. With this construction, the feature map on the convolutional layer is connected to multiple successive frames in the previous layer, capturing the action information. Formally, at the (x, y, z) position of the i-th and j-th feature map, the value is given by the following formula:
$$v_{ij}^{xyz}=tanh(b_{ij}+\sum_m \sum_{p=0}^{P_i-1} \sum_{q=0}^{Q_i-1} \sum_{r=0}^{R_i-1}w^{pqr}_{ijm}v_{(i-1)m}^{(x+p)(y+q)(z+r)})$$
Where $R_i$ represents the size of the 3D core in the time dimension; $w^{pqr}_{ijm}$the value of (p,q,r) of the core of the m-th feature graph connected to the previous layer.

Using 3D CNN can better capture the temporal and spatial characteristic information in the video. The following is a 3D CNN convolution operation using 3D convolution kernel for the image sequence (video):
![3DCNN](figure/3D.PNG)
$$3D\ CNN$$
The temporal dimension of the convolution operation above is 3, that is, a convolution operation is performed on consecutive three frames of images. The above 3D convolution is to form a cube by stacking a plurality of consecutive frames, and then use a 3D convolution kernel in the cube. In this structure, each feature map in the convolutional layer is connected with multiple adjacent consecutive frames in the previous layer, thus capturing motion information. For example, in the left image above, the value of a position of a convolutional map is obtained by convolving local receptive fields at the same position of three consecutive frames one level above.

It should be noted that the 3D convolution kernel can only extract one type of feature from the cube because the weights of the convolution kernels are the same in the entire cube, that is, the shared weight values are all the same convolution kernel. (The same color in the figure shows the same weight). We can use a variety of convolution kernels to extract a variety of features.

### Our Model
![Model](figure/model.PNG)
This architecture consists of 8 layers including the input. There are convolutional, rectification and sub-sampling layer each of one as C1,R1 and S1 and four neuron layers N1 to N4

#### Swish activation function
Instead of the common-used activation function ReLU, we tried to use a newly proposed activation function called ['swish'](https://arxiv.org/abs/1710.05941), which is typically $$f(x)=x\cdot \text{sigmoid}(x),$$ and may have ever better performance than the original ReLU. The function image of swish is shown as follows:
![swish](figure/swish.png)
The swish function is unbounded above and bounded below and it is the non-monotonic attribute that actually creates the difference. Here in our model, we tried to apply this activation function and see what it can help us in improving our model:

In [6]:
#Building the CNN model
model = Sequential()

def swish(x):
    return (K.sigmoid(x) * x)

get_custom_objects().update({'swish': Activation(swish)})

model.add(Conv3D(
        filters=filt[0],
        kernel_size = (5,5,5),
        input_shape=(img_r, img_c, img_d,1),
        activation='swish'
    ))
model.add(MaxPooling3D(pool_size=(pool[0], pool[0], pool[0])))
#model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, init='normal', activation='relu'))
#model.add(Dropout(0.25))
model.add(Dense(64,init='normal',activation='relu'))
#model.add(Dropout(0.25))
model.add(Dense(32,init='normal',activation='relu'))
#model.add(Dropout(0.25))
model.add(Dense(16,init='normal',activation='relu'))
model.add(Dense(classes,init='normal'))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='RMSprop', metrics=['mse', 'accuracy'])
print('Ready to Test')

Ready to Test




### Experiment and Result
In this section, we want to verify the accuracy of the 3DCNN model with KTH dataset. The size of training set is more than $2000$, for testing set is $240$.

In [7]:
#print(Y_train.shape)
#Split the data for Train and Test
X_train_new, X_val_new, y_train_new,y_val_new = train_test_split(train_set, Y_train, test_size=0.1, random_state=4)
#print(X_train_new.shape)
#print(y_train_new.shape)
#Training
hist = model.fit(
    X_train_new,
    y_train_new,
    validation_data=(X_val_new,y_val_new),
    batch_size=batch_size,
    nb_epoch = epoch,
    shuffle=True
    )

#Testing
score = model.evaluate(
    X_val_new,
    y_val_new,
    batch_size=batch_size,
    #show_accuracy=True
    )

#print(model.metrics_names);

print('Test score:', score)

#print('History', hist.history)




Train on 2156 samples, validate on 240 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Test score: [2.586045242365005, 0.08206570617768945, 0.6916666666666667]


We could see that after 25 rounds, the accuracy is near 85%

### Conclusion

In our model, we adopted the 3D CNN network to do classification of human activities and tried a new activation function which is newly proposed to verify that our model can actually work. As a result, the trainning accuracy can reach $85\%$ and the verifying accuracy can reach $70\%$, which means our model can do the classification and have resonable outcome. However, this model can also have increased accuracy and may need to do some modification of the model structure such as increasing the neural network depth and we can also add the size of training dataset to make our model better classify human actions. Other work can be done in the future include better selecting the training dataset and extend our model to mutiple channel frames which could provide more information for the model to do classify work.

### Division of labor
#### Yu Mi:
+ Moving tensorflow from CPU to GPU
+ Finishing division of training dataset
+ Implementing and applying swish function
+ Training and modifying model
#### Boning Zhao:
+ Importing training dataset
+ Building initial model
+ Developing trainning and testing work flow
+ Text editing work