# Action Recognition @ UCF101  
**Due date: 11:59 pm on Nov. 19, 2019 (Tuesday)**

## Description
---
In this homework, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action, there will be 145 samples. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the amount of data. Consequently, a training sample is an image tuple that forms a 3D volume with one dimension encoding *temporal correlation* between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training an end-to-end neural network from scratch whose computation is prohibitively expensive, we divide this into two steps: feature extraction and modelling. Below are the things you need to implement for this homework:
- **{35 pts} Feature extraction**. Use any of the [pre-trained models](https://pytorch.org/docs/stable/torchvision/models.html) to extract features from each frame. Specifically, we recommend not to use the activations of the last layer as the features tend to be task specific towards the end of the network. 
    **hints**: 
    - A good starting point would be to use a pre-trained VGG16 network, we suggest first fully connected layer `torchvision.models.vgg16` (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video. 
    - Normalize your images using `torchvision.transforms` 
    ```
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    prep = transforms.Compose([ transforms.ToTensor(), normalize ])
    prep(img)
    The mean and std. mentioned above is specific to Imagenet data
    
    ```
    More details of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    
- **{35 pts} Modelling**. With the extracted features, build an LSTM network which takes a **dx25** sample as input (where **d** is the dimension of the extracted feature for each frame), and outputs the action label of that sample.
- **{20 pts} Evaluation**. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy **(5 points)**. The baseline test accuracy for this data is 75%, and **10 points** out of 20 is for achieving test accuracy greater than the baseline. Moreover, you need to compare **(5 points)** the result of your network with that of support vector machine (SVM) (stacking the **dx25** feature matrix to a long vector and train a SVM).
- **{10 pts} Report**. Details regarding the report can be found in the submission section below.

Notice that the size of the raw images is 256x340, whereas your pre-trained model might take **nxn** images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five **nxn** images, one at the image center and four at the corners and compute the **d**-dim features for each of them, and average these five **d**-dim feature to get a final feature representation for the raw image.
For example, VGG takes 224x224 images as inputs, so we take the five 224x224 croppings of the image, compute 4096-dim VGG features for each of them, and then take the mean of these five 4096-dim vectors to be the representation of the image.

In order to save you computational time, you need to do the classification task only for **the first 25** classes of the whole dataset. The same applies to those who have access to GPUs. **Bonus 10 points for running and reporting on the entire 101 classes.**


## Dataset
Download **dataset** at [UCF101](http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_images.tar)(Image data for each video) and the **annos folder** which has the video labels and the label to class name mapping is included in the assignment folder uploaded. 


UCF101 dataset contains 101 actions and 13,320 videos in total.  

+ `annos/actions.txt`  
  + lists all the actions (`ApplyEyeMakeup`, .., `YoYo`)   
  
+ `annots/videos_labels_subsets.txt`  
  + lists all the videos (`v_000001`, .., `v_013320`)  
  + labels (`1`, .., `101`)  
  + subsets (`1` for train, `2` for test)  

+ `images/`  
  + each folder represents a video
  + the video/folder name to class mapping can be found using `annots/videos_labels_subsets.txt`, for e.g. `v_000001` belongs to class 1 i.e. `ApplyEyeMakeup`
  + each video folder contains 25 frames  



## Some Tutorials
- Good materials for understanding RNN and LSTM
    - http://blog.echen.me
    - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Implementing RNN and LSTM with PyTorch
    - [LSTM with PyTorch](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py)
    - [RNN with PyTorch](http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd '/content/drive/My Drive/Sharma_Gaurav_112680958_hw5/'

/content/drive/My Drive/Sharma_Gaurav_112680958_hw5


In [0]:
#!wget http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_images.tar

In [0]:
#!tar -xvf 'UCF101_images.tar'

images/
images/v_000001/
images/v_000001/i_0001.jpg
images/v_000001/i_0002.jpg
images/v_000001/i_0003.jpg
images/v_000001/i_0004.jpg
images/v_000001/i_0005.jpg
images/v_000001/i_0006.jpg
images/v_000001/i_0007.jpg
images/v_000001/i_0008.jpg
images/v_000001/i_0009.jpg
images/v_000001/i_0010.jpg
images/v_000001/i_0011.jpg
images/v_000001/i_0012.jpg
images/v_000001/i_0013.jpg
images/v_000001/i_0014.jpg
images/v_000001/i_0015.jpg
images/v_000001/i_0016.jpg
images/v_000001/i_0017.jpg
images/v_000001/i_0018.jpg
images/v_000001/i_0019.jpg
images/v_000001/i_0020.jpg
images/v_000001/i_0021.jpg
images/v_000001/i_0022.jpg
images/v_000001/i_0023.jpg
images/v_000001/i_0024.jpg
images/v_000001/i_0025.jpg
images/v_000002/
images/v_000002/i_0001.jpg
images/v_000002/i_0002.jpg
images/v_000002/i_0003.jpg
images/v_000002/i_0004.jpg
images/v_000002/i_0005.jpg
images/v_000002/i_0006.jpg
images/v_000002/i_0007.jpg
images/v_000002/i_0008.jpg
images/v_000002/i_0009.jpg
images/v_000002/i_0010.jpg
images/v_0000

In [0]:
# import packages here
import cv2
import numpy as np
import matplotlib.pyplot as plt
import glob
import random 
import time
import datetime
from scipy import stats

import torch
import torchvision
import torchvision.transforms as transforms

from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import pandas as pd
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

from sklearn.utils import shuffle

---
---
## **Problem 1.** Feature extraction

In [0]:
from datetime import datetime
def load_dataset(model,batch_num=1, shuffle=False):
    
    model = model.cuda()
    train_data = []
    train_labels = []

    test_data = []
    test_labels = []


    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    prep = transforms.Compose([ transforms.ToTensor(), normalize])
    
    path = './images/'
    annos = pd.read_csv('./annos/videos_labels_subsets.txt', sep="\t",header = None)
    annos = np.array(annos)
    class_names = annos[:,0]

    w,h = 224,224


    with torch.no_grad():
      for i in range(0,3361):
        if i%20 == 0:
          current_time = datetime.now()
          print("\nIn folder %s\n", path + class_names[i], " at time: ", current_time)


        img_path_class = glob.glob(path + class_names[i] + '/*.jpg')

        
        count = 1
        pntr = 0
        batch_size = 125
        stacked_tensor = (torch.empty([batch_size,3,h,w], dtype=torch.float)).cuda()

        for filename in img_path_class:
              
          img = cv2.imread(filename)
          [img_row,img_col,channels] = np.shape(img)
          
          crop_img_1 = torch.Tensor(prep(img[:h, :w])).float().cuda()
          #crop_img_1 = crop_img_1.permute(2,0,1).float()

          crop_img_2 = torch.Tensor(prep(img[img_row-h:, :w])).float().cuda()
          #crop_img_2 = crop_img_2.permute(2,0,1).float()


          crop_img_3 = torch.Tensor(prep(img[:h, img_col-w:])).float().cuda()
          #crop_img_3 = crop_img_3.permute(2,0,1).float()


          crop_img_4 = torch.Tensor(prep(img[img_row-h:, img_col-w:])).float().cuda()
          #crop_img_4 = crop_img_4.permute(2,0,1).float()


          crop_img_5 = torch.Tensor(prep(img[(img_row-h)//2:(img_row - (img_row-h)//2), (img_col-w)//2:(img_col - (img_col-w)//2)])).float().cuda()
          #crop_img_5 = crop_img_5.permute(2,0,1).float()


          stacked_tensor[pntr:pntr+5,:,:,:] = (torch.stack([crop_img_1,crop_img_2,crop_img_3,crop_img_4,crop_img_5]))
          pntr += 5 

          if annos[i,2] == 2: # 2 means it is test data
            test_labels.append(annos[i,1])

          else:
            train_labels.append(annos[i,1])

          '''
          mean_out_vggs = (stacked_tensor[0,:] + stacked_tensor[1,:] + stacked_tensor[2,:] + stacked_tensor[3,:] + stacked_tensor[4,:])/5

          if annos[i,2] == 2: # 2 means it is test data
            test_labels.append(annos[i,1])
            test_data.append(mean_out_vggs)

          else:
            train_labels.append(annos[i,1])
            train_data.append(mean_out_vggs)
          '''
          #print("image %d done"%(count))
          #count+=1
        #stacked_tensor = torch.Tensor(stacked_tensor)
        stacked_tensor_out = model(stacked_tensor)

        for j in range(0,batch_size,5):
          mean_out_vggs = (stacked_tensor_out[j,:] + stacked_tensor_out[j+1,:] + stacked_tensor_out[j+2,:] + stacked_tensor_out[j+3,:] + stacked_tensor_out[j+4,:])/5

          if annos[i,2] == 2: # 2 means it is test data
            test_data.append(mean_out_vggs)

          else:
            train_data.append(mean_out_vggs)         

    return([train_data,train_labels,test_data,test_labels])



In [0]:
import torchvision.models as models

vgg16 = models.vgg16(pretrained=True)
vgg16.classifier = nn.Sequential(*list(vgg16.classifier.children())[:-6])
[train_data_set,train_label_set,test_data_set,test_label_set] = load_dataset(vgg16)


In folder %s
 ./images/v_000001  at time:  2019-11-22 10:28:20.782643

In folder %s
 ./images/v_000021  at time:  2019-11-22 10:30:44.100761

In folder %s
 ./images/v_000041  at time:  2019-11-22 10:33:39.637369

In folder %s
 ./images/v_000061  at time:  2019-11-22 10:36:58.322371

In folder %s
 ./images/v_000081  at time:  2019-11-22 10:40:26.366408

In folder %s
 ./images/v_000101  at time:  2019-11-22 10:43:55.327236

In folder %s
 ./images/v_000121  at time:  2019-11-22 10:47:22.798022

In folder %s
 ./images/v_000141  at time:  2019-11-22 10:50:50.679376

In folder %s
 ./images/v_000161  at time:  2019-11-22 10:53:41.303933

In folder %s
 ./images/v_000181  at time:  2019-11-22 10:56:32.589021

In folder %s
 ./images/v_000201  at time:  2019-11-22 11:00:06.128317

In folder %s
 ./images/v_000221  at time:  2019-11-22 11:03:37.436804

In folder %s
 ./images/v_000241  at time:  2019-11-22 11:07:08.794713

In folder %s
 ./images/v_000261  at time:  2019-11-22 11:10:35.605146

In fo

In [0]:
'''
import pickle
file = open("./train_data.pkl",'wb')
pickle.dump(train_data_set,file)
file.close()
file = open("./train_label.pkl",'wb')
pickle.dump(train_label_set,file)
file.close()
file = open("./test_data.pkl",'wb')
pickle.dump(test_data_set,file)
file.close()
file = open("./test_label.pkl",'wb')
pickle.dump(test_label_set,file)
file.close()
'''


In [0]:
import pickle
file=open('./train_data.pkl', 'rb')
train_data = pickle.load(file)
file.close()

file=open('./train_label.pkl', 'rb')
train_label = pickle.load(file)
file.close()


In [0]:
import pickle
file=open('./test_data.pkl', 'rb')
test_data = pickle.load(file)
file.close()

file=open('./test_label.pkl', 'rb')
test_label = pickle.load(file)
file.close()

In [0]:
############## training data processing from pickle data ###############

train_data_np = np.zeros(shape=[60225,4096])

for i in range(len(train_data)):
  train_data_np[i,:]  = torch.Tensor.cpu(train_data[i])

train_data_np_batch = np.zeros(shape=[60225//25,25,4096])

pntr = 0
for i in range(60225//25):
  train_data_np_batch[i,:,:]  = train_data_np[pntr:pntr+25,:]
  pntr += 25


train_label_np = np.array(train_label)

train_label_np_batch = np.zeros(shape=[60225//25,25])

pntr = 0
for i in range(60225//25):
  train_label_np_batch[i,:]  = train_label_np[pntr:pntr+25]
  pntr += 25


X,y = shuffle(train_data_np_batch,train_label_np_batch)

In [0]:
############## testing data processing from pickle data ###############

test_data_np = np.zeros(shape=[len(test_data),4096])

for i in range(len(test_data)):
  test_data_np[i,:]  = torch.Tensor.cpu(test_data[i])

test_data_np_batch = np.zeros(shape=[len(test_data)//25,25,4096])

pntr = 0
for i in range(len(test_data)//25):
  test_data_np_batch[i,:,:]  = test_data_np[pntr:pntr+25,:]
  pntr += 25

test_label_np = np.array(test_label)

test_label_np_batch = np.zeros(shape=[len(test_data)//25,25])

pntr = 0
for i in range(len(test_data)//25):
  test_label_np_batch[i,:]  = test_label_np[pntr:pntr+25]
  pntr += 25

In [0]:
del train_data
del test_data
del train_label
del test_label
del train_data_np
del test_data_np
del train_label_np
del test_label_np

***
***
## **Problem 2.** Modelling

* ##### **Print the size of your training and test data**

In [10]:
# Don't hardcode the shape of train and test data
print('Shape of training data is :', np.shape(X))
print('Shape of test/validation data is :', np.shape(test_data_np_batch))

Shape of training data is : (2409, 25, 4096)
Shape of test/validation data is : (952, 25, 4096)


In [0]:
# \*write your codes for modelling using the extracted feature (You can use multiple cells, this is just a place holder)


In [0]:
###################### LSTM class ####################

class LSTM(nn.Module):

    def __init__(self, input_dim, hidden_dim, batch_size, output_dim=1,
                    num_layers=2):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        # Define the LSTM layer
        self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers)

        # Define the output layer
        self.linear = nn.Linear(self.hidden_dim, output_dim)

    def init_hidden(self):
        # This is what we'll initialise our hidden state as
        return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, self.batch_size, self.hidden_dim))

    def forward(self, input):
        # Forward pass through LSTM layer
        # shape of lstm_out: [input_size, batch_size, hidden_dim]
        # shape of self.hidden: (a, b), where a and b both 
        # have shape (num_layers, batch_size, hidden_dim).
        lstm_out, self.hidden = self.lstm(input.view(len(input), self.batch_size, -1))
        
        # Only take the output from the final timetep
        # Can pass on the entirety of lstm_out to the next layer if it is a seq2seq prediction
        y_pred = self.linear(lstm_out[-1].view(self.batch_size, -1))
        return y_pred.view(-1)

model = LSTM(4096, 200, batch_size=1, output_dim=25, num_layers = 1)

In [13]:
################### Training #####################

criterion = nn.CrossEntropyLoss()
lr = 0.01
optimizer = optim.SGD(model.parameters(), lr)

start_time = time.time()

max_epoch = 10
model = model.cuda()

for epoch in range(max_epoch):
      torch.cuda.empty_cache()
      total, correct = 0, 0
      for i in range(len(X)):
          model.zero_grad()

          imgs = torch.Tensor(X[i,:,:]).float().cuda()
          labels = torch.Tensor([y[i,0]-1]).long().cuda()
          #imgs = torch.Tensor(X[i,:,:]).float()
          #labels = torch.Tensor(y[i,0:1] - 1).long()
        
          model.hidden = model.init_hidden()
          
          outputs = model(imgs)
          outputs = outputs[None,:]

          #print(outputs.shape)
          #print(labels.shape)
          loss = criterion(outputs, labels)
          
          optimizer.zero_grad()

          loss.backward()
          optimizer.step()

          _, predicted = torch.max(outputs.data, 1)

          total += labels.size(0)
          correct += (predicted == labels).sum().item()
          
      print("Epoch %d done"%(epoch), end=' || ')
      print('Loss in this epoch: %f'%(loss.item()))
      accuracy_training = (correct/total)

print("\nTraining Done...\n")
print("Time consumption for training in seconds : %s seconds"%(time.time()-start_time))


Epoch 0 done || Loss in this epoch: 0.074263
Epoch 1 done || Loss in this epoch: 0.062681
Epoch 2 done || Loss in this epoch: 0.023163
Epoch 3 done || Loss in this epoch: 0.011352
Epoch 4 done || Loss in this epoch: 0.005188
Epoch 5 done || Loss in this epoch: 0.005598
Epoch 6 done || Loss in this epoch: 0.004407
Epoch 7 done || Loss in this epoch: 0.003889
Epoch 8 done || Loss in this epoch: 0.003494
Epoch 9 done || Loss in this epoch: 0.003200

Training Done...

Time consumption for training in seconds : 129.76984429359436 seconds


---
---
## **Problem 3.** Evaluation

In [0]:
# \*write your codes for evaluation (You can use multiple cells, this is just a place holder)

In [14]:
criterion = nn.CrossEntropyLoss()
lr = 0.01
optimizer = optim.SGD(model.parameters(), lr)

start_time = time.time()

max_epoch = 5
model = model.cuda()

total = 0
correct = 0
with torch.no_grad():
      for i in range(len(test_label_np_batch)):

          imgs = torch.Tensor(test_data_np_batch[i,:,:]).float().cuda()
          labels = torch.Tensor([test_label_np_batch[i,0]-1]).long().cuda()

          outputs = model(imgs)
          outputs = outputs[None,:]


          _, predicted = torch.max(outputs.data, 1)
          #print('predicted = ',predicted)
          #print('actual = ',y[i,0])
          
          #print('for i = ',i)
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
          accuracy_testing = correct/total
      

print("\nTesting Done...\n")
print("Time consumption for training in seconds : %s seconds"%(time.time()-start_time))
print('Testing Accuracy for epoch %d: %f'%((epoch + 1), (accuracy_testing) * 100))


Testing Done...

Time consumption for training in seconds : 2.120577335357666 seconds
Testing Accuracy for epoch 10: 79.936975


In [9]:
################### SVM MODEL #####################

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

clf = LinearSVC(C=0.0001, random_state = 1)

train_data_svm = np.reshape(X,(np.shape(X)[0],np.shape(X)[1]*np.shape(X)[2]))
train_label_svm = np.max(y,axis= 1)

test_X = np.reshape(test_data_np_batch,(np.shape(test_data_np_batch)[0],np.shape(test_data_np_batch)[1]*np.shape(test_data_np_batch)[2]))
test_Y = np.max(test_label_np_batch,axis= 1)

start_time =time.time()
clf.fit(train_data_svm, train_label_svm)
print("Time consumption for training: %s seconds" %(time.time()-start_time))

prediction_train =clf.predict(train_data_svm)
accuracy_train_svm = accuracy_score(prediction_train, train_label_svm)


start_time = time.time()
prediction_test =clf.predict(test_X)
print("Time consumption for testing: %s seconds" %(time.time()-start_time))

accuracy_test_svm = accuracy_score(prediction_test, test_Y)

print("Accuracy: ",accuracy_test_svm*100)

Time consumption for training: 522.8843989372253 seconds
Time consumption for testing: 0.27826905250549316 seconds
Accuracy:  83.40336134453781


* ##### **Print the train and test accuracy of your model** 

In [15]:
# Don't hardcode the train and test accuracy
print('Training accuracy is %2.3f' %(accuracy_training*100.00) )
print('Test accuracy is %2.3f' %(accuracy_testing*100.00) )

Training accuracy is 100.000
Test accuracy is 79.937


* ##### **Print the train and test and test accuracy of SVM** 

In [11]:
# Don't hardcode the train and test accuracy
print('Training accuracy is %2.3f' %(accuracy_train_svm*100.00) )
print('Test accuracy is %2.3f' %(accuracy_test_svm*100.00) )

Training accuracy is 100.000
Test accuracy is 83.403


## **Problem 4.** Report

## **Bonus**


* ##### **Print the size of your training and test data**

In [0]:
# Don't hardcode the shape of train and test data
print('Shape of training data is :', )
print('Shape of test/validation data is :', )

* ##### **Modelling and evaluation**

In [0]:
#Write your code for modelling and evaluation

## Submission
---
**Runnable source code in ipynb file and a pdf report are required**.

The report should be of 3 to 4 pages describing what you have done and learned in this homework and report performance of your model. If you have tried multiple methods, please compare your results. If you are using any external code, please cite it in your report. Note that this homework is designed to help you explore and get familiar with the techniques. The final grading will be largely based on your prediction accuracy and the different methods you tried (different architectures and parameters).

Please indicate clearly in your report what model you have tried, what techniques you applied to improve the performance and report their accuracies. The report should be concise and include the highlights of your efforts.
The naming convention for report is **Surname_Givenname_SBUID_report*.pdf**

When submitting your .zip file through blackboard, please
-- name your .zip file as **Surname_Givenname_SBUID_hw*.zip**.

This zip file should include:
```
Surname_Givenname_SBUID_hw*
        |---Surname_Givenname_SBUID_hw*.ipynb
        |---Surname_Givenname_SBUID_hw*.pdf
        |---Surname_Givenname_SBUID_report*.pdf
```

For instance, student Michael Jordan should submit a zip file named "Jordan_Michael_111134567_hw5.zip" for homework5 in this structure:
```
Jordan_Michael_111134567_hw5
        |---Jordan_Michael_111134567_hw5.ipynb
        |---Jordan_Michael_111134567_hw5.pdf
        |---Jordan_Michael_111134567_report*.pdf
```

The **Surname_Givenname_SBUID_hw*.pdf** should include a **google shared link**. To generate the **google shared link**, first create a folder named **Surname_Givenname_SBUID_hw*** in your Google Drive with your Stony Brook account. 

Then right click this folder, click ***Get shareable link***, in the People textfield, enter two TA's emails: ***bo.cao.1@stonybrook.edu*** and ***sayontan.ghosh@stonybrook.edu***. Make sure that TAs who have the link **can edit**, ***not just*** **can view**, and also **uncheck** the **Notify people** box.

Colab has a good feature of version control, you should take advantage of this to save your work properly. However, the timestamp of the submission made in blackboard is the only one that we consider for grading. To be more specific, we will only grade the version of your code right before the timestamp of the submission made in blackboard. 

You are encouraged to post and answer questions on Piazza. Based on the amount of email that we have received in past years, there might be dealys in replying to personal emails. Please ask questions on Piazza and send emails only for personal issues.

Be aware that your code will undergo plagiarism check both vertically and horizontally. Please do your own work.