# Action Recognition @ UCF101  
**Due date: 11:59 pm on Mar. 2, 2018 (Friday)**

## Description
---
In this project, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the data amount. Consequently, a training sample is a tuple of 3D volumn with one dimension encoding *temporal correlation* between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training a end-to-end neural network from scratch whose computation is prohibitively expensive for CPUs. We divide this into two steps: feature extractoin and modelling. Below are the things you need to implement for this homework:
- **{40 pts} Feature extraction**. Use the pretrained VGG network to extract features from each frame. Specifically, we recommend  to use the activations of the first fully connected layer `torchvision.models.vgg16` (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video. (**hints**: use `scipy.io.savemat()` to save feature to '.mat' file and `scipy.io.loadmat()` load feature.)
- **{40 pts} Modelling**. With the extracted features, build an LSTM network which takes a 4096x25 sample as input, and outputs the action label of that sample.
- **{20 pts} Evaluation**. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy. Moreover, you need to compare the result of your network with that of support vector machine (SVM) (stacking the 4096x25 feature matrix to a long vector and train a SVM).

Notice that the size of the raw images is 256x340, whereas VGG16 takes 224x224 images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five 224x224 images at the image center and four corners and compute the 4096-dim VGG16 features for each of them, and average these five 4096-dim feature to get final feature representation for the raw image.

In order to save your computational time, we did the feature extraction of most samples for you except for class 1. For class 1, we provide you with the raw images, and you need to write code to extract the feature of the samples in class 1. Besides, instead of training over the whole dataset on CPUs which mays cost you serval days, you could just use the first 10 or 20 classes of the whole dataset. But for those who have access to GPUs, you can try more classes or even the whole dataset.

You may also notice the dimension of feature vector is high. You can reduce its dimension using PCA if you find your SVM model is highly overfitting or your machine cannot afford the memory cost.


## Dataset
Download dataset at [UCF101](http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_dimitris_course.zip). 

The dataset is consist of the following two parts: video images and extracted features.

### 1. Video Images  

UCF101 dataset contains 101 actions and 13,320 videos in total.  

+ `annos/actions.txt`  
  + lists all the actions (`ApplyEyeMakeup`, .., `YoYo`)   
  
+ `annots/videos_labels_subsets.txt`  
  + lists all the videos (`v_000001`, .., `v_013320`)  
  + labels (`1`, .., `101`)  
  + subsets (`1` for train, `2` for test)  

+ `images_class1/`  
  + contains videos belonging to class 1 (`ApplyEyeMakeup`)  
  + each video folder contains 25 frames  


### 2. Video Features 

+ `vgg16_relu6/`  
   + contains all the video features, EXCEPT those belonging to class 1 (`ApplyEyeMakeup`) 


## Some Tutorials
- Good materials for understanding RNN and LSTM
    - http://blog.echen.me
    - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Implementing RNN and LSTM with PyTorch
    - [LSTM with PyTorch](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py)
    - [RNN with PyTorch](http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)
- Tutorial of Pytorch
    - [60 min Blitz](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
    
We also provide you with some framework code for feature extraction. You can complete and modify it accordingly or write from scratch. For any functions used in this framework code, please refer to [pytorch doc](http://pytorch.org/docs/0.3.1/) for details. If you are new to pytorch, you can also learn from this code.

In [62]:
# Feature extraction
import os
import sys
from PIL import Image
import scipy

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.autograd import Variable

import glob

# load pretrained model: vgg16
model = torchvision.models.vgg16(pretrained=True)
model.eval()  # evaluation mode (Dropout, BatchNorm)

# use GPU ?
useGPU = False
if useGPU:
    model.cuda()

# register forward_hook_func to record layer's output
relu6_layer = model.classifier[1]
Buffer = {}
def relu6_hook(self, input, output):
    Buffer['relu6'] = output.data.clone()


# create folder to store features
arch = 'vgg16'
layer = 'relu6'
saveDir = arch + '_' + layer
if not os.path.exists(saveDir):
    os.makedirs(saveDir)

# define image preprocess
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
prep = transforms.Compose([ transforms.ToTensor(), normalize ])
#to_tensor = transforms.ToTensor()

# enumerate videos to extract features
bsdir = 'UCF101_release/'
videos = os.listdir(os.path.join(bsdir, 'images_class1/'))
for i, video in enumerate(videos):
    #print(bsdir+'images_class1/'+video)
    videodir = bsdir+'images_class1/'+video
    imgs = os.listdir(videodir)
    for j, img in enumerate(imgs):
        
        cur = Image.open(videodir+'/'+img)
        topleft=cur.crop((0,0,224,224))
        topright=cur.crop((116,0,340,224))
        cent=cur.crop((58,16,282,240))
        botleft=cur.crop((0,32,224,256))
        botright=cur.crop((116,32,340,256))
        
        pyvar1 = Variable(prep(topleft)).unsqueeze(0)
        hvar1 = relu6_layer.register_forward_hook(relu6_hook)
        model(pyvar1)
        hvar1.remove()
        temp1 = Buffer['relu6']
        
        pyvar2 = Variable(prep(topright)).unsqueeze(0)
        hvar2 = relu6_layer.register_forward_hook(relu6_hook)
        model(pyvar2)
        hvar2.remove()
        temp2 = Buffer['relu6']
        
        pyvar3 = Variable(prep(cent)).unsqueeze(0)
        hvar3 = relu6_layer.register_forward_hook(relu6_hook)
        model(pyvar3)
        hvar3.remove()
        temp3 = Buffer['relu6']
        
        pyvar4 = Variable(prep(botleft)).unsqueeze(0)
        hvar4 = relu6_layer.register_forward_hook(relu6_hook)
        model(pyvar4)
        hvar4.remove()
        temp4 = Buffer['relu6']
        
        pyvar5 = Variable(prep(botright)).unsqueeze(0)
        hvar5 = relu6_layer.register_forward_hook(relu6_hook)
        model(pyvar5)
        hvar5.remove()
        temp5 = Buffer['relu6']
        
        addres1 = torch.add(temp1,1.0,temp2)
        addres2 = torch.add(addres1,1.0,temp3)
        addres3 = torch.add(addres2,1.0,temp4)
        addres4 = torch.add(addres3,1.0,temp5)
        
        finaltorch = torch.div(addres4, 5.0)
        
        if j==0:
            temp = finaltorch
        else:
            temp = torch.cat((temp, finaltorch), 0)
            
    
    scipy.io.savemat(saveDir+"/"+video+".mat", {'feature':temp.numpy()})
    print("Done: "+video)
    #print(temp.size())
    #sys.exit()


Done: v_000001
Done: v_000002
Done: v_000003
Done: v_000004
Done: v_000005
Done: v_000006
Done: v_000007
Done: v_000008
Done: v_000009
Done: v_000010
Done: v_000011
Done: v_000012
Done: v_000013
Done: v_000014
Done: v_000015
Done: v_000016
Done: v_000017
Done: v_000018
Done: v_000019
Done: v_000020
Done: v_000021
Done: v_000022
Done: v_000023
Done: v_000024
Done: v_000025
Done: v_000026
Done: v_000027
Done: v_000028
Done: v_000029
Done: v_000030
Done: v_000031
Done: v_000032
Done: v_000033
Done: v_000034
Done: v_000035
Done: v_000036
Done: v_000037
Done: v_000038
Done: v_000039
Done: v_000040
Done: v_000041
Done: v_000042
Done: v_000043
Done: v_000044
Done: v_000045
Done: v_000046
Done: v_000047
Done: v_000048
Done: v_000049
Done: v_000050
Done: v_000051
Done: v_000052
Done: v_000053
Done: v_000054
Done: v_000055
Done: v_000056
Done: v_000057
Done: v_000058
Done: v_000059
Done: v_000060
Done: v_000061
Done: v_000062
Done: v_000063
Done: v_000064
Done: v_000065
Done: v_000066
Done: v_00

'sampleimg = Image.open(\'UCF101_release/images_class1/v_000001/i_0001.jpg\')\ntop=sampleimg.crop((0,0,224,224))\ntwo=sampleimg.crop((116,0,340,224))\ncent=sampleimg.crop((58,16,282,240))\nth=sampleimg.crop((0,32,224,256))\nfo=sampleimg.crop((116,32,340,256))\ntop.save(\'top.jpg\')\ntwo.save(\'two.jpg\')\ncent.save(\'cent.jpg\')\nth.save(\'th.jpg\')\nfo.save(\'fo.jpg\')\n\npyvar = Variable(prep(top)).unsqueeze(0)\nhvar = relu6_layer.register_forward_hook(relu6_hook)\nmodel(pyvar)\nhvar.remove()\ntemp = Buffer[\'relu6\']\nprint(Buffer[\'relu6\'].size())\npyvar1 = Variable(prep(two)).unsqueeze(0)\nhvar1 = relu6_layer.register_forward_hook(relu6_hook)\nmodel(pyvar1)\nhvar1.remove()\ntemp1 = Buffer[\'relu6\']\n\nff = torch.add(temp,1.0,temp1)\nff1 = torch.div(ff, 2.0)\ntemp = torch.cat((temp, temp1), 0)\nprint(temp.size())\n\nscipy.io.savemat("abc.mat", {\'feature\':ff1.numpy()})'

In [90]:
class lstmMod(nn.Module):
    def __init__(self, ip_dim, hidden_dim, class_size):
        super(lstmMod, self).__init__()
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(ip_dim, hidden_dim)
        self.hidden_to_tag = nn.Linear(hidden_dim, class_size)
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

cur_lstm = nn.LSTM(4096,10)

tempdict = {}
inp = scipy.io.loadmat("vgg16_relu6/v_000001.mat")
cur_tor = torch.from_numpy(inp["feature"])
#dd = cur_tor.unsqueeze(0)
dd = cur_tor[:,None]
print(dd.size())
final_ip = torch.autograd.Variable(dd)
print(final_ip)


torch.Size([25, 1, 4096])
Variable containing:
( 0  ,.,.) = 
  0.0828  0.1058  0.0000  ...   0.0000  1.5073  0.8851

( 1  ,.,.) = 
  0.5592  0.0403  0.0000  ...   0.0000  1.4422  0.7082

( 2  ,.,.) = 
  0.5225  0.0685  0.0249  ...   0.0000  1.3261  0.9588
 ... 

( 22 ,.,.) = 
  0.3575  0.3818  0.0000  ...   0.0000  1.0930  1.0707

( 23 ,.,.) = 
  0.3264  0.2589  0.0000  ...   0.0000  1.0927  0.7916

( 24 ,.,.) = 
  0.2116  0.2557  0.0000  ...   0.0000  0.7866  1.2178
[torch.FloatTensor of size 25x1x4096]



## Submission
---
**Runnable source code and a report are required**.

The report should be of 4 to 6 pages (depending on the amount of images and experimental settings) describing what you have done in this project and report performance of your model. If you have tried multiple methods, please compare your results. If you are using any external code, please indicate it in your report.