<a href="https://colab.research.google.com/github/cyprienmaes/ELEC-H304-RayTracing/blob/master/Deep_AE_CF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
 
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.layers import Input, Dense, Dropout
from tensorflow.python.keras.regularizers import l2
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.optimizers import Adam, RMSprop
from sklearn.model_selection import train_test_split

# Training deep autoencoders for collaborative filtering with movielens dataset
Inspired of Zheda (Marco) Mai (RaptorMai): https://github.com/RaptorMai/Deep-AutoEncoder-Recommendation

## Data pre-processing:

1.   Clone github repository to recover csv files of movielens dataset.
2.   Read csv file and take user id, movie id and ratings
3.   Split the data into train and test. The train data are again splitted to take in consideration validation data. The same thing can be done in splitting the test data as the data are not time splitted. The validation is used during the training session to see the trend of the neural network for new kind of data.
3.   Create sparse matrix R(i,j) with different constant for the missing entries.





In [3]:
! git clone https://github.com/cyprienmaes/Matrix_Completion_Algorithm.git

fatal: destination path 'Matrix_Completion_Algorithm' already exists and is not an empty directory.


In [4]:
def create_matrix_data(data, num_users, num_items, init_value=0, avg=False):
  """ Create a matrix data with ratings knowing the number of users and items.

      The matrix is created thanks to the id of each users and each items.

      PARAMETERS: 
      -----------
      data: pandas DataFrame.
          columns=['userID', 'itemID', 'rating' ...]
      num_users: int.
          number of users (row matrix)
      num_items: int. 
          number of items (column matrix)
      init_value: float.
          constant that are place into the missing entries
      avg: bool.
          the constant is replaced by the average of the notation for 
          each users.

      RETURN:
      -------
      matrix: 2D numpy array.
          matrix R(i,j) used into the autoencoder neural network.

  """
  if avg:
    matrix = np.full((num_users, num_items), 0.0)
    for (_, userID, itemID, rating, timestamp) in data.itertuples():
      matrix[userID, itemID] = rating
    average = np.true_divide(matrix.sum(1), np.maximum((matrix!=0).sum(1), 1))
    indx = np.where(matrix == 0)
    matrix[indx] = np.take(average, indx[0])
    
  else:
    matrix = np.full((num_users, num_items), float(init_value))
    for (_, userID, itemID, rating, timestamp) in data.itertuples():
      matrix[userID, itemID] = rating

  print("First row and 20 first columns :")
  print(matrix[0,0:20])
  return matrix

data = pd.read_csv('Matrix_Completion_Algorithm/ml1m_ratings.csv',sep='\t', encoding='latin-1', 
                      usecols=['user_emb_id', 'movie_emb_id', 'rating', 'timestamp'])

#+1 is the real size, as they are zero based
num_users = data['user_emb_id'].unique().max() + 1
num_items = data['movie_emb_id'].unique().max() + 1

print('Data:')
print(data.head(5))

# 10% of the full data are used for test.
# Kind of time intervals as netflix prize of the Deep-AutoRec paper.
# The stratify is used to correctly split each user between train and test.
train_data, test_data = train_test_split(data,
                                         stratify=data['user_emb_id'],
                                         test_size=0.1,
                                         random_state=999613182)

print('Train data:')
print(train_data.head(5))

# 10% of the train data are used for validation and tuned the parameters if 
# necessary.
train_data, validate_data = train_test_split(train_data,
                                             stratify=train_data['user_emb_id'],
                                             test_size=0.1,
                                             random_state=999613182)

# Creating sparse matrix with different constants for missing entries.
train_zero = create_matrix_data(train_data, num_users, num_items, 0).T
train_one = create_matrix_data(train_data, num_users, num_items, 1).T
train_two = create_matrix_data(train_data, num_users, num_items, 2).T
train_three = create_matrix_data(train_data, num_users, num_items, 3).T
train_four = create_matrix_data(train_data, num_users, num_items, 4).T
train_five = create_matrix_data(train_data, num_users, num_items, 5).T
train_average = create_matrix_data(train_data, num_users, num_items, avg=True).T

validate = create_matrix_data(validate_data, num_users, num_items, 0).T
test = create_matrix_data(test_data, num_users, num_items, 0).T

Data:
   user_emb_id  movie_emb_id  rating  timestamp
0            0          1192       5  978300760
1            0           660       3  978302109
2            0           913       3  978301968
3            0          3407       4  978300275
4            0          2354       5  978824291
Train data:
        user_emb_id  movie_emb_id  rating  timestamp
339280         1997          1018       2  974679219
881301         5322          1293       4  960848231
926190         5597          2293       4  959214335
726548         4344            26       2  966273825
153039          983          3860       3  975115869
First row and 20 first columns :
[5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
First row and 20 first columns :
[5. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
First row and 20 first columns :
[5. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
First row and 20 first columns :
[5. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 

## Deep AutoEncoder model implementation

In [0]:
def deep_autoencoders_model(matrix, layers, activation, last_activation, 
                            re_feeding, dropout, regularizer_encode, 
                            regularizer_decode):
  """ Construction of the deep autoencoders for collaborative filtering.
      
      The deep autoencoders is constructed with classical fully connected layers.
      The number of hidden layers can be chosen by the user as the different
      activation functions between each layer. A dropout rate can be applied 
      after the latent layer to avoid overfitting. Moreover, re-feefing algorithm
      is also possible with a number of updates. The parameters are initialized
      with the Xavier initializer (or glorot uniform) which is the default one
      in keras. 

      PARAMETERS:
      ----------- 
      matrix: 2D numpy array.
          Sparse matrix to complete.
      layers: List.
          each element is the number of neuron for a layer
      activation: List.
          each element is the activation function to use between layers except for
          the last.
      last_activation: str.
          activation function for the last dense layer
      re_feeding: int.
          number of re_feeding updates
      dropout: float.
          dropout rate between 0 and 1
      regularizer_encode: float.
          regularizer for encoder and added to the loss function.
      regularizer_decode: float. 
          regularizer for decoder and added to the loss function.

      RETURN:
      -------
      model: keras Model.
          configuration of the model to use
  """

  x = [None]*(len(layers)+2)
  # Input
  input_layer = new_dense = Input(shape=(matrix.shape[1],), name='sparse_ratings')
  num_enc = int(len(layers)/2)

  # Encoder
  for i in range(num_enc):
    x[i] = Dense(layers[i], 
                 activation=activation[i],
                 name='encoded_layer{}'.format(i), 
                 kernel_regularizer=l2(regularizer_encode))
    
    
  # Latent layer
  x[num_enc] = Dense(layers[num_enc], 
                     activation=activation[num_enc], 
                     name='latent_layer', 
                     kernel_regularizer=l2(regularizer_encode))
  
  # Dropout rate
  x[num_enc+1] = Dropout(rate = dropout)
  
  # Decoder
  for i in range(num_enc+1,len(layers)):
    x[i+1] = Dense(layers[i], 
                   activation=activation[i], 
                   name='decoded_layerr{}'.format(i), 
                   kernel_regularizer=l2(regularizer_decode))
    
  # Output
  output_layer = x[len(layers)+1] = Dense(matrix.shape[1],
                                          activation=last_activation, 
                                          name='predict_ratings', 
                                          kernel_regularizer=l2(regularizer_decode))
  
  # Re-feeding algorithm
  for j in range(re_feeding):
    for layer in x:
      new_dense = layer(new_dense)

  model = Model(input_layer, new_dense)

  return model

def loss_function(y_true, y_pred):
  """ The loss function returns the value of the minimization objective. 

      This loss is not directly implemented in keras as the empty entries of the
      matrix doesn't need to be taken into account. A mask is then used wich has 
      the same dimension of y_true and y_pred. A 1 is put in the mask when the 
      entry exist in y_true, 0 otherwise.

      PARAMETERS:
      -----------
      y_true: 1D tensor.
          The vector user- or item-based of the input matrix.
      y_pred: 1D tensor.
          The vector user- or item-based of the predict output inside the deep
          AutoEncoder.

      RETURN:
      -------
      loss : float.
          The objective function value at the considered iteration.
  """
  # mask
  mask = K.cast(K.not_equal(y_true, 0), K.floatx())

  # Loss
  squared_error = K.square(mask* (y_true - y_pred))
  loss = K.sum(squared_error, axis=-1) / K.maximum(K.sum(mask, axis=-1), 1)
  return loss

def rmse_function(y_true, y_pred):
  """ Computation of the RMSE normalized by the number of real entry. 

      As the loss function, the RMSE metric is not directly implemented in keras 
      as the empty entries of the matrix doesn't need to be tken into account. 
      The same mask is taken (cf: loss_function). As the rate inside the output 
      can not exceed the interval [1, 5], y_pred is clipped inside this 
      interval.

      PARAMETRS:
      ----------
      y_true: 1D tensor.
          The vector user- or item-based of the input matrix.
      y_pred: 1D tensor.
          The vector user- or item-based of the predict output inside the deep
          AutoEncoder.

      RETURN:
      -------
      rmse : float.
          The RMSE metric value at the considered iteration.

  """
  # mask
  mask = K.cast(K.not_equal(y_true, 0), K.floatx())
  
  # clip
  y_pred = K.clip(y_pred, 1, 5)

  # RMSE
  squared_error = K.square(mask * (y_true - y_pred))
  rmse = K.sqrt(K.sum(squared_error, axis=-1) / K.maximum(K.sum(mask, axis=-1), 1))
  return rmse

# Utility functions

In [0]:
def show_loss(history, skip):
  """ Show the loss function of the neural network history

  PARAMETERS:
  -----------
  hystory: keras model history.
      The history of the model.
  skip: int.
      The number of ignored iterations.
  """
  loss = history.history['loss']
  val_loss = history.history['val_loss']
  plt.plot(np.arange(skip, len(loss), 1), loss[skip:])
  plt.plot(np.arange(skip, len(loss), 1), val_loss[skip:])
  plt.title('model train vs validation loss')
  plt.ylabel('loss')
  plt.xlabel('epoch')
  plt.legend(['train', 'validation'], loc='best')
  plt.show()

def show_rmse(history, skip):
  """ Show the rmse function of the neural network history

    PARAMETERS:
    -----------
    hystory: keras model history.
        The history of the model.
    skip: int.
        The number of ignored iterations.
  """
  rmse = history.history['rmse_function']
  val_rmse = history.history['val_rmse_function']
  plt.plot(np.arange(skip, len(rmse), 1), rmse[skip:])
  plt.plot(np.arange(skip, len(val_rmse), 1), val_rmse[skip:])
  plt.title('model train vs validation masked_rmse')
  plt.ylabel('rmse')
  plt.xlabel('epoch')
  plt.legend(['train', 'validation'], loc='best')
  plt.show()

def load_model(name):
  """ Load model without histoty inside a json file.

  PARAMETER:
  ----------
  name: str.
      The name of the json file without extension.
  """
  # load json and create model
  model_file = open('{}.json'.format(name), 'r')
  loaded_model_json = model_file.read()
  model_file.close()
  loaded_model = model_from_json(loaded_model_json)
  # load weights into new model
  loaded_model.load_weights("{}.h5".format(name))
  print("Loaded model from disk")
  return loaded_model

def save_model(name, model):
  """ Save model without history inside a json file.

  Save weights in a H5 file.

  PARAMETER:
  ----------
  name: str.
      The name of the json file without extension.
  model: keras model.
  """
  # serialize model to JSON
  model_json = model.to_json()
  with open("{}.json".format(name), "w") as json_file:
      json_file.write(model_json)
  # serialize weights to HDF5
  model.save_weights("{}.h5".format(name))
  print("Saved model to disk")

def save_hist_model(name, hist_model):
  """ Save model history inside a npy file.

  PARAMETER:
  ----------
  name: str.
      The name of the npy file without extension.
  hist_model: keras model history.
  """
  np.save('{}.npy'.format(name), hist_model.history)
  print("Saved history model to disk")

def load_hist_model(name):
  """ Load model history inside a npy file.

  PARAMETER:
  ----------
  name: str.
      The name of the npy file without extension.
  """
  history=np.load('{}.npy'.format(name),allow_pickle='TRUE').item()
  print("Loaded model from disk")

# Model compilation

In [0]:
tf.compat.v1.disable_eager_execution()

layers = [512]
# layers = [512, 256, 512, 256]
dropout = 0.0
re_feeding = 1
activation = ['selu']
# activation = ['selu', 'selu', 'selu', 'selu']
last_activation = 'elu'
# last_activation = 'selu'
regularizer_encode = 0.0005
regularizer_decode = 0.0005

deep_ae_cf = deep_autoencoders_model(train_zero, 
                                     layers, 
                                     activation, 
                                     last_activation, 
                                     re_feeding,
                                     dropout, 
                                     regularizer_encode, 
                                     regularizer_decode)

deep_ae_cf.compile(optimizer = Adam(lr=0.0001), loss=loss_function, metrics=[rmse_function]) 
deep_ae_cf.summary()


hist_deep_ae_cf = deep_ae_cf.fit(x=train_zero, y=train_zero,
                  epochs=500,
                  batch_size=256,
                  validation_data=[train_zero, validate], verbose=2)

show_loss(hist_deep_ae_cf,100)
show_rmse(hist_deep_ae_cf,100)

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sparse_ratings (InputLayer)  [(None, 6040)]            0         
_________________________________________________________________
latent_layer (Dense)         (None, 512)               3092992   
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
predict_ratings (Dense)      (None, 6040)              3098520   
Total params: 6,191,512
Trainable params: 6,191,512
Non-trainable params: 0
_________________________________________________________________
Train on 3952 samples, validate on 3952 samples
Epoch 1/500
3952/3952 - 2s - loss: 11.0831 - rmse_function: 2.2735 - val_loss: 8.6351 - val_rmse_function: 1.9609
Epoch 2/500
3952/3952 - 2s - loss: 8.1440 - rmse_function: 2.0482 - val_loss: 6.0911 

In [12]:
save_model('item_zero_256_512_256_Adam_09', deep_ae_cf)
save_hist_model('item_zero_256_512_256_Adam_09', hist_deep_ae_cf)

deep_ae_cf.evaluate(train_zero, test)
predict_deep = deep_ae_cf.predict(train_zero)
print(predict_deep[40,0:20])
deep_ae_cf.evaluate(train_zero, test)

Saved model to disk
Saved history model to disk
[3.9420488 3.4989903 3.31089   3.2247844 3.4711502 4.1326647 3.536756
 3.9780126 3.5331535 3.9746802 3.8072207 3.417676  3.4839902 3.4781642
 3.0274124 2.675367  4.3398833 4.2166114 3.2097883 3.242191 ]


[0.9656348262238599, 0.74664885]