<a href="https://colab.research.google.com/github/Vall98/Vall98.github.io/blob/master/ShortVideoClipsMemorabilityPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name, Surname: Valentin Lebon

Student ID: 19102232

Programe: ECSAOO

Module: 2019/2020 CA684 Machine Learning

# 1. Get data and define useful functions

### a. Import data in Google Collab

This code mounts the user's Google Drive directory in this Ipython Notebook, and changes the working directory to Dev-set.

It assumes that the Dev-set directory is available under the "My Drive" directory.

Please also put the provided file "my_saved_model.h5" in a directory named "models" under the "My Drive" directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

import os
os.chdir("/content/drive/My Drive/Dev-set")

Mounted at /content/drive/


### b. Get ground truth
To be able to perform predictions over the ground truth, we need to retrieve it from the associate .csv file.

In [None]:
ground_truth_file = 'Ground-truth/ground-truth.csv'

(Optional) We are then able to examine the ground truth.

In [None]:
import pandas as pd

ground_truth = pd.read_csv(ground_truth_file)
ground_truth.head()
ground_truth.describe()


Unnamed: 0,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
count,6000.0,6000.0,6000.0,6000.0
mean,0.860243,36.2915,0.778942,12.764667
std,0.080655,8.356285,0.144692,3.544815
min,0.388,30.0,0.0,9.0
25%,0.811,33.0,0.7,10.0
50%,0.867,34.0,0.8,12.0
75%,0.923,34.0,0.9,14.0
max,0.989,100.0,1.0,40.0


### c. Functions to read the selected features
Those functions were provided for the assignment. They will parse some of the provided files to extract the data related to the videos.

In [None]:
# data analysis packages
import pandas as pd # We put it again as the first import could not be always executed.
import numpy as np

def read_C3D(fname):
    """Scan vectors from file"""
    with open(fname) as f:
        for line in f:
            C3D =[float(item) for item in line.split()] # convert to float type, using default separator
    return C3D

def read_HMP(fname):
    """Scan HMP(Histogram of Motion Patterns) features from file"""
    with open(fname) as f:
        for line in f:
            pairs=line.split()
            HMP_temp = { int(p.split(':')[0]) : float(p.split(':')[1]) for p in pairs}
    # there are 6075 bins, fill zeros
    HMP = np.zeros(6075)
    for idx in HMP_temp.keys():
        HMP[idx-1] = HMP_temp[idx]
    return HMP

def read_caps(fname):
    """Load the captions into a dataframe"""
    vn = []
    cap = []
    df = pd.DataFrame();
    with open(fname) as f:
        for line in f:
            pairs = line.split()
            vn.append(pairs[0])
            cap.append(pairs[1])
        df['video']=vn
        df['caption']=cap
    return df

### d. Spearman Score
This function, provided for the assignment and modified to return the results, will determine the Spearman score for short-term and long-term memorability, based on the result of the neural network and a ground truth.


In [None]:
def Get_score(Y_pred,Y_true):
    '''Calculate the Spearmann"s correlation coefficient'''
    result = []
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            print('The Spearman\'s correlation coefficient is: %.3f' % score_mat.iloc[1][0])
            return '%.3f' % score_mat.iloc[1][0]
        else:
            for ii in range(Y_pred.shape[1]):
                temp = Get_score(Y_pred[:,ii],Y_true[:,ii])
                if (temp != None):
                  result.append(temp)
    return result

# 2. CAP Feature

### a. Load the feature's data
This feature has all its data in a single file. We extract the text caption from here.

In [None]:
df_cap=read_caps('Captions/dev-set_video-captions.txt')

We then remove the English stop words with the nltk (Natural Language ToolKit) library and the punctuation with the string library.

We create a counter that we are going to use later with the Keras Tokenizer.


In [None]:
"""remove ponctuation and stop words from the captions and initialize the Counter (to give a word a number thanks to its position)"""

from string import punctuation
from collections import Counter

%pip install nltk
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
common_words = set(stopwords.words('english'))

counts = Counter()

for i, cap in enumerate(df_cap['caption']):
    # replace punctuations with space
    # convert words to lower case
    text = ''.join([c if c not in punctuation else ' ' for c in cap]).lower()

    #remove common words
    querywords = text.split()
    text = ' '.join([word for word in querywords if word not in common_words]).lower()

    counts.update(text.split())



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We remove the words that are present only once in the counter, as we cannot find any correlation between them and the short-term and long-term memorability. Futhermore, it reduces the risk of overfitting.

In [None]:
counts_captions = Counter({k: counts for k, counts in counts.items() if counts >= 2})

We use a tokenizer to convert the words in the counter into integers, and we use the method 'fit_on_texts' to constitute an internal dictionary. This is required before using the method "texts_to_matrix".

This method will allow us to transform our data to an [one-hot-res matrix](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f), which provide better performance (see in Results - Optimizing the datas)

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=len(counts_captions))

#Updates internal vocabulary based on a list of texts.
tokenizer.fit_on_texts(list(counts_captions)) #turning each text into either a sequence of integers (for the transformation of the data for the one-hot encoding)

#Convert a list of texts to a Numpy matrix.
one_hot_res_captions = tokenizer.texts_to_matrix(list(df_cap.caption.values),mode='binary') #one-hot encoding (creating a matrix optimized for training)

#We don't need sequences if we use one_hot_res
#sequences = tokenizer.texts_to_sequences(list(df_cap.caption.values)) #location of words for each captions

one_hot_res_captions.shape

(6000, 3020)

### b. Define the model
Here we define the model used for the Caption Feature. We use different regularizers to avoid over-fitting.

We use one layer with a number of nodes related to the size of the input (3020/10 => 302). See 6.Results to see how we ended up with this architecture.

In [None]:
def CAPModel():
  len_token = len(counts_captions)

  inputX = Input(shape=(len_token,))
  x = layers.Dense(int(len_token / 10), activation="relu", bias_regularizer=regularizers.l2(0.001), activity_regularizer=regularizers.l2(0.001), kernel_regularizer=regularizers.l2(0.001))(inputX)
  return Model(inputs=inputX, outputs=x)

#3. C3D Feature

### a. Load the feature's data
To extract the data related to this feature, we need to parse one file per video. To do so, we use the video list generated while processing the Caption feature.

This process can take a long time to finish.

In [None]:
from tqdm.notebook import tqdm #load bar

def getC3D():
  C3D = [] #Python list keep the order of assertion.
  for video in tqdm(df_cap['video']):
    fname = video.split('.')[0] + '.txt';
    C3D.append(read_C3D('C3D/' + fname))
  return C3D

C3D = getC3D()

HBox(children=(IntProgress(value=0, max=6000), HTML(value='')))




### b. Apply padding to the data
We apply padding to be sure that the shape of the data extracted is homogeneous.

In [None]:
#padding C3D

C3D_len = max([len(i) for i in C3D])

C3D_seq = np.zeros((len(C3D), C3D_len))
for i in range(len(C3D)):
    n = len(C3D[i])
    if n==0:
        print(i)
    else:
        C3D_seq[i,-n:] = C3D[i]
C3D_seq.shape

(6000, 101)

### c. Define the model
Here we define the model used for the C3D Feature. We use different regularizers to avoid over-fitting.

We use one layer with a number of nodes related to the size of the input (101/10 => 10). See 6.Results to see how we ended up with this architecture.

In [None]:
def C3DModel():

  inputX = Input(shape=(C3D_len,))
  x = layers.Dense(int(C3D_len / 10), activation="relu", bias_regularizer=regularizers.l2(0.001), activity_regularizer=regularizers.l2(0.001), kernel_regularizer=regularizers.l2(0.001))(inputX)
  return Model(inputs=inputX, outputs=x)

#4. HMP Feature


### a. Load the feature's data
As for C3D feature, to extract the data related to this feature, we need to parse one file per video. To do so, we use the video list generated while processing the Caption feature.

This process can take a long time to finish

In [None]:
def getHMP():
  HMP = []
  for video in tqdm(df_cap['video']):
    fname = video.split('.')[0] + '.txt';
    HMP.append(read_HMP('HMP/' + fname))
  return HMP

HMP = getHMP()

HBox(children=(IntProgress(value=0, max=6000), HTML(value='')))





### b. Apply padding to the data
We apply padding to be sure that the shape of the data extracted is homogeneous.

In [None]:
#padding HMP

HMP_len = max([len(i) for i in HMP])

HMP_seq = np.zeros((len(HMP), HMP_len))
for i in range(len(HMP)):
    n = len(HMP[i])
    if n==0:
        print(i)
    else:
        HMP_seq[i,-n:] = HMP[i]
HMP_seq.shape

(6000, 6075)

### c. Define the model
Here we define the model used for the HMP Feature. We use different regularizers to avoid over-fitting.

We use one layer with a number of nodes related to the size of the input (6075/10 => 607). See 6.Results to see how we ended up with this architecture.

In [None]:
def HMPModel():

  inputX = Input(shape=(HMP_len,))
  x = layers.Dense(int(HMP_len / 10), activation="relu", bias_regularizer=regularizers.l2(0.001), activity_regularizer=regularizers.l2(0.001), kernel_regularizer=regularizers.l2(0.001))(inputX)
  return Model(inputs=inputX, outputs=x)

#5. Build Network and assemble features

### a. Define the model
Here we are defining the final model that we are gonna use for our predictions. This model combine and encapsulate the other models defined before. See 7.Result - Optimizing the model #2 for more details.

In [None]:
from tensorflow.python.keras import Input
from tensorflow.python.keras import layers
from tensorflow.python.keras import Model
from tensorflow.python.keras import regularizers
from tensorflow.python.keras import callbacks
from tensorflow.keras import optimizers

def finalModel(models):
  inputs = [] # Python lists keep the order
  outputs = []
  for model in models:
    inputs.append(model.input)
    outputs.append(model.output)
  combined = layers.concatenate(outputs)
  #z = layers.Dense(len(outputs) * 2, activation="relu", bias_regularizer=regularizers.l2(0.001), activity_regularizer=regularizers.l2(0.001), kernel_regularizer=regularizers.l2(0.001))(combined)
  z = layers.Dense(2, activation="sigmoid")(combined)
  return Model(inputs=inputs, outputs=z)

###b. Define the training and the test sets
Here we are defining the two sets (train and test) that we need in order to feed our model, and to test it (See 1.d - Spearman Score).

- X_train is the data that we are going to feed our network with
- Y_train is the ground trust associated with the data in X_train
- X_test is the data that we are going to test our network with
- Y_test is the ground trust associated with the data in X_test

We divide our video set so we have 4800 video to train on and 1200 video reserved for testing the model.


In [None]:
from sklearn.model_selection import train_test_split

def getdata():
  GT = ground_truth[['short-term_memorability','long-term_memorability']].values
  CAP = one_hot_res_captions
  C3D = C3D_seq
  HMP = HMP_seq
  CAP_train, CAP_test, C3D_train, C3D_test, HMP_train, HMP_test, GT_train, GT_test = train_test_split(CAP, C3D, HMP, GT, test_size=0.2, random_state=42) #random_state for reproductability

  X_train = [CAP_train, C3D_train, HMP_train]
  Y_train = GT_train
  X_test = [CAP_test, C3D_test, HMP_test]
  Y_test = GT_test
  return X_train, Y_train, X_test, Y_test

###c. Display the model history
To detect over-fitting, we display some information about the accuracy and the loss over the epochs.

In [None]:
import matplotlib.pyplot as plt

def plot_results(history):
  loss = history.history['loss']
  epochs = range(1,len(loss)+1)

  plt.figure() #loss plot
  val_loss = history.history['val_loss']
  plt.plot(epochs,loss,'bo',label='Training loss')
  plt.plot(epochs,val_loss,'b',label='Validation loss')
  plt.title('Training and validation loss')
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.legend()
  plt.show()

  plt.figure() #accuracy plot
  acc = history.history['accuracy']
  val_acc = history.history['val_accuracy']
  plt.plot(epochs, acc, 'bo', label='Training acc')
  plt.plot(epochs, val_acc, 'b', label='Validation acc')
  plt.title('Training and validation accuracy')
  plt.xlabel('Epochs')
  plt.ylabel('Acc')
  plt.legend()
  plt.show()

### d. Training the model

Here we are defining a function to train the model. We save the model's weigth each time we train, so we can go back (in the case we trained the model too much) or continue where we stopped.

In [None]:
def train(epochs_input, n):

  # compile the model
  model.compile(optimizer='rmsprop', loss='mse',metrics=['accuracy'])

  # This callback will stop the training when there is no improvement in
  # the validation loss for three consecutive epochs.
  callback = callbacks.EarlyStopping(monitor='loss', patience=3)

  # training the model
  history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=epochs_input, callbacks=[callback])

  #display training history
  plot_results(history)

  test(f'my_model_{n}')

  model.save(f'../models/my_model_{n}.h5')


###e. Test the model
We need to test the model to determine the predictions scores. Then we create a .csv file related to the model.

In [None]:
#from tensorflow.python.keras.utils.vis_utils import plot_model
import csv

def test(model_name):
  predictions = model.predict(X_test)
  results = Get_score(predictions, Y_test) #See 1.d Spearman Score
  with open(f'../models/{model_name}.csv', mode='w') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(['Memorability', 'Score'])
    csv_writer.writerow(['Short-Term', results[0]])
    csv_writer.writerow(['Long-Term', results[1]])
#  plot_model(model, to_file=f'../models/{model_name}.png', show_shapes=True, show_layer_names=True)

###f. (Optional) Script to train the model
This is the script we can use to train our model.
It asks (by several inputs) the behaviour the user wishes to use.

For a simpler script that only test a pre-chose model and generate the .csv file with the results, see g. Test the saved model

In this script:

- First, the user can choose to load a model
- Next, the user can type a number. 0 will test the model, 1 or more will train the model the choosen number of times. In that case, a saved model would be generated each times with a associated .csv file.
- Finally, if the user chose to train the model, it will ask how many epochs to use per training.

In [None]:
X_train, Y_train, X_test, Y_test = getdata()
model = finalModel([CAPModel(), C3DModel(), HMPModel()])

n = int(input("number of the model to load (see MyDrive/models for a list - -1 for no model)\n"))
if (n >= 0):
  model.load_weights(f'../models/my_model_{n}.h5')
print("")

action = int(input("choose an action:\n0 => test the model - n => train the model n times\n"))
print("")
if (action <= 0):
  test(f'my_model_{n}')
else:
  epochs_input = int(input("number of epochs (>=10 && <=500)\n"))
  print("")
  if (epochs_input < 10):
    epochs_input = 10
  elif (epochs_input > 500):
    epochs_input = 500
  for i in range(0, action):
    n += 1
    train(epochs_input, n)


number of the model to load (see MyDrive/models for a list - -1 for no model)
22

choose an action:
0 => test the model - n => train the model n times
0

The Spearman's correlation coefficient is: 0.448
The Spearman's correlation coefficient is: 0.200


###g. Test the saved model
This script is the same as f. Script to train the model, but it doesn't require the user input.

Instead, the following action are chose:
- load the 'my_saved_model.h5' model
- test it

The 'my_saved_model.h5' file stores the weights that performed the best predictions compared to the other generated models.

In [None]:
X_train, Y_train, X_test, Y_test = getdata()
model = finalModel([CAPModel(), C3DModel(), HMPModel()])

model.load_weights(f'../models/my_saved_model.h5')
test('my_saved_model')

The Spearman's correlation coefficient is: 0.445
The Spearman's correlation coefficient is: 0.206


#6. Results

This part reports the several steps I took to complete this assigment.

## The first model
The first model I tried was the one provided as an example. It uses the captions feature.

After comparing the efficiency with and without "one-hot-res", I chose to transform my data (captions)  in a "one-hot-res" matrix that led the model to offer better performances on the predictions.

## Choosing the features

I chose to implement 3 features based on the results of Rohit Gupta and Kush Motwani[[1]](http://ceur-ws.org/Vol-2283/MediaEval_18_paper_31.pdf) during the 2018 Media Eval Competition on the same subject. They got better results on, from the best performance: Captions, C3D, HMP, LBP.

I chose to use only the first three because their implementations were provided. In opposition, whereas the Color Histogram implementation was provided, I chose to not use it because of the weak performance it achieved.

## Choosing a model

Because the first model was designed to treat only one feature, I had to change it.

I followed the method explained in this [tutorial](https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/). I created three models (one for each feature) to avoid the mixing of different input data so I can have more flexibility by being able to choose different architetures for each network. Then I joined the inputs and outputs in a final model. This resulted in three sub-network in a single network.

- CAP-model, C3D-model and HMP-model describe the three features models.
- Final-model describes the model where the other models are joined, producing the output.

Then I used the provided implementations of the differents features (Captions, C3D, HMP) to extract the data concerning the videos and I fed the model with it.

## Optimizing the model #1

I first tried several architectures before choosing one.

The first architectures I tried were working on this model:

- CAP-model, C3D-model and HMP-model had an input layer, then 3 layers with each n\*4 nodes, n\*2 nodes and n nodes, n being generally 8.

- Final-model had two layers, one with (number of features-model \*2) nodes and the second with 2 nodes, the output layer.

## Optimizing the datas

Like Rohit Gupta and Kush Motwani[[1]](http://ceur-ws.org/Vol-2283/MediaEval_18_paper_31.pdf), I chose to remove English stop-words from the captions while generating data for the Captions feature. I also chose to remove the words that were mentioned only once, to remove the less used words that the model would have not learned a pattern on.

The performance improved drastically, multiplying per two the predictions accuracy.

## Optimizing the model #2

After conducting some research on the Internet, I figured out that ["One hidden layer is sufficient for the large majority of problems"](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw).

For the number of nodes per layers, I chose, based on the same recommendations, to use (lenght of the inputed data / 2) nodes for CAP-model, C3D-model and HMP-model.

The architecure was then working on this model:
- an input layer, then one layers with (lenght of the inputed data / 2) nodes for CAP-model, C3D-model and HMP-model.
- Final-model was only composed of the output layer (2 nodes)

This architecture was taking a long time to train and was offerfitting really quickly (less than ten epochs)

I decided to used the three types of [regularizer Keras offer](https://keras.io/regularizers/) to fix the issue of overfitting. The performance of the network improved.

<br/>

I then decided to reduce per five the size of the layers on the features-model. In fact, the combined inputs represent a length of 9196. If we divide the length of each input per two, it give 4597 neurones, which was a lot for this task. After reducing the size of the layer, the model finally have 919 neurones.

Because of a performance improvment on the prediction, I used this architecture, which worked on this model:
- an input layern, then one layers with (lenght of the inputed data / 10) nodes for CAP-model, C3D-model and HMP-model.
- Final-model was only composed of the output layer (2 nodes)
