# New Section

#Introduction: 
At ****, we continuously strive to give better song recommendations to our users. As a part of this assignment, you will have to read the research paper, **[Long and Short-Term Recommendations with Recurrent Neural Networks](http://iridia.ulb.ac.be/~rdevooght/papers/UMAP__Long_and_short_term_with_RNN.pdf)**, and implement a recurrent neural network based collaborative filtering recommendation system. 

Collaborative filtering utilizes the user-item interactions to recommend relevant items to a user (based on the interaction of similar users). RNNs are part of growing interest for collaborative filtering based on sequence prediction. This paper shows how recurrent neural networks can be steered towards better short and long-term predictions.

The aim of this assignment is to reproduce **the SPS (short term prediction score) metric** on the test set, which is mentioned in **Table 3**. Please note that only the method **RNN-CCE** is expected to be reproduced and that too only on the **Movielens** dataset.

---


#Dataset: 
**Movielens 1 Million data**
- 1 Million user-movie interactions
- 6040 users
- 3706 unique movies

You are being provided with the preprocessed dataset. 

Each row of training data is an array. The first element of the array is a unique `user_id`, the subsequent elements are `movie_id` and `rating` of the user for that movie, which then repeats.

`i.e. user_id movie_id1 rating1 movie_id2 rating2 …`

For each user, the array contains movies in the order of timestamp in which he/she has rated the movies. 

It is worth mentioning that you do not need to use the values of the ratings in any way, you are only supposed to predict which movie a user will rate, based on what he rated before. But if you can find any way to use it, to improve the metric, all innovations will be welcomed.

**Training Data:** Movies interaction of 5040 users are provided for training purposes.

**Validation Data:** Movies interaction of 500 users are provided for validation purposes.

**Test Data:** Movies interaction of 500 users are provided for testing purposes.

---

# Submission Guidelines: 

1. You can use any programming language and Deep Learning framework of your choice for submission (preferred Keras).
2. Training Step: Create and train a model that matches the SPS metric that is mentioned in the paper. Also, the final model (or just model weights) should be saved so that it could be used for evaluating the metric later. The saved model (or model weights) should be named ‘best_model’ followed by a suitable extension, depending on the Deep Learning framework in use.
3. Model Loading Step: Load the saved model (or model weights). 
4. Prediction Step: Run the loaded model and evaluate it on the **test data**, printing the SPS metric. 
5. If you are using the GPU provided by colab, and your implementaton is taking way beyond 1 hour for training, then you should be worried about your implementation.
6. Please ensure that the submitted code is well-commented and structured for easy comprehension.
7. Please put your colab notebook and saved model (or model weights) file in a zipped file named after you. Also, share the colab link along with the above zipped file over mail.
8. Please write the answers to the questions aksed at the bottom of this notebook. 

---

# Evaluation Guidelines:

1. You are required to run your model on the test set of 500 users and print the SPS metric. The evaluation will be based on the reproducibility of the SPS metric as mentioned in the research paper. 



---
# NOTE:
The code provided in the paper is not very easy to comprehend, and it is difficult to reproduce the results (mentioned in tha paper), using this code. We found it much easier to write our own code from scratch, to reproduce the results mentioned in the paper.  



---



We will be helping you with the preprocessed data. You just need to run the following few cells to download and unzip it. Once you have the data ready, you can take it on from there.  

# We wish you all the best! 😁



---



# Download and Unzip Data

The dataset for training and testing the model has been uploaded to s3. Run the following cell to download the zipped data.  


In [None]:
!wget "https://drive.google.com/uc?export=download&id=1xUB9_bMs-PYnRbO7eedXFh0qS6P4bA-X" -O dataset_assignment.zip

--2022-05-16 00:51:29--  https://drive.google.com/uc?export=download&id=1xUB9_bMs-PYnRbO7eedXFh0qS6P4bA-X
Resolving drive.google.com (drive.google.com)... 142.250.148.102, 142.250.148.138, 142.250.148.139, ...
Connecting to drive.google.com (drive.google.com)|142.250.148.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-3s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ksdla48ut19g57mb9vdeh6uf2rvc9je6/1652662275000/01098719071037536088/*/1xUB9_bMs-PYnRbO7eedXFh0qS6P4bA-X?e=download [following]
--2022-05-16 00:51:30--  https://doc-14-3s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ksdla48ut19g57mb9vdeh6uf2rvc9je6/1652662275000/01098719071037536088/*/1xUB9_bMs-PYnRbO7eedXFh0qS6P4bA-X?e=download
Resolving doc-14-3s-docs.googleusercontent.com (doc-14-3s-docs.googleusercontent.com)... 142.250.159.132, 2607:f8b0:4001:c58::84
Connecting to doc-14-3s-docs.googleusercontent.com (doc-14-3s

Run the following command to unzip the zipped file. 



In [None]:
!unzip dataset_assignment.zip

Archive:  dataset_assignment.zip
  inflating: dataset_assignment/stats  
  inflating: dataset_assignment/test_set_sequences  
  inflating: dataset_assignment/train_set_sequences  
  inflating: dataset_assignment/val_set_sequences  


The unzipped folder contains the below-mentioned  files in it.


In [29]:
!ls dataset_assignment/

cat_rec_test.npy  LSTMRECModel.h5  test_set_sequences	val_set_sequences
cat_rec_val.npy   stats		   train_set_sequences


In [30]:
import os
os.getcwd()

'/content'

In [31]:
# To read the training data from the unzipped file-

'''
# train_set_sequences: File containing data of 5040 users and the movies they rated.
# val_set_sequences  : File containing data of 500  users and the movies they rated.
# test_set_sequences : File containing data of 500  users and the movies they rated.

# For each of the 3 files, data for each user is present in different lines, which follows the following format-
# user_id movie_id1 rating1 movie_id2 rating2 …

with open('dataset_assignment/train_set_sequences') as f:
   # do something
'''

"\n# train_set_sequences: File containing data of 5040 users and the movies they rated.\n# val_set_sequences  : File containing data of 500  users and the movies they rated.\n# test_set_sequences : File containing data of 500  users and the movies they rated.\n\n# For each of the 3 files, data for each user is present in different lines, which follows the following format-\n# user_id movie_id1 rating1 movie_id2 rating2 …\n\nwith open('dataset_assignment/train_set_sequences') as f:\n   # do something\n"

In [41]:
import numpy as np
import random
from keras.utils.np_utils import to_categorical   
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils
from tensorflow.keras.models import load_model

def read_data(file_name):
  """
  This functions reads different data and saves them in dictionary
  :param :str
  :return :dict[list] --> dict_seq 
  :return :dict[list] --> dict_rating 
  """
  with open('dataset_assignment/'+file_name) as f:
    x = f.readlines()
  dict_seq = {}
  dict_rating = {}
  for i in x:
    list_ = list(map(int, i.split()))
    dict_seq[list_[0]] = list_[1::2]
    dict_rating[list_[0]] = list_[2::2]
  return(dict_seq, dict_rating)

test = 'test_set_sequences'
train = 'train_set_sequences'
val = 'val_set_sequences'
seq_train, rat_train = read_data(train)
seq_test, rat_test = read_data(test)
seq_val, rat_val = read_data(val)


In [33]:
# [1,3,4,567,7,4,2,24,6] --> length 9
# [0,1,2,  3,4,5,6, 7,8] --> length 9
# output = 1
# now for each sequence in user we will create sub sequence
def creat_seqsamples_upd(dict_seq, min_window = 5, num_sampl = 5):
  """
  This function creates a sample of sequences on the basis of list given
  :param :dict_user : dict --> key of the seq representing user and value representing seq
  :return : list[int], list[list], list[list[int]] -->user, input_seq, recomendation
  """
  input1 = [] # represents user seq
  output1 = [] # represents input seq
  output2 = [] # represents y as output or recommendation
  for user in dict_seq:
    lis = dict_seq[user]
    seq_l = len(lis)
    for j in range(num_sampl):
      indic = random.randint(min_window,seq_l-2)
      lis_x, lis_y = lis[0: indic+1], lis[indic+1:indic+2]
      output1.append(lis_x)
      output2.append(lis_y)
      input1.append(user)
  return(input1, output1, output2)

In [34]:
usr_train, sam_seq_train, rec_train = creat_seqsamples_upd(seq_train)
usr_val, sam_seq_val, rec_val = creat_seqsamples_upd(seq_val)
usr_test, sam_seq_test, rec_test = creat_seqsamples_upd(seq_test)

In [35]:
def creat_num_data_upd(seq, rec, num_users=6040, num_items=3706, max_seq_len=10):
  """
  This funtion takes in user , sequence and recommendation in list form and returns
  corresponding ecoded outputs to be used by Rnn/Lstm achitecture Model
  :param :user :list[int] --> user_list
  :param :seq :list[list] --> seq_list
  :param :rec :list[list] --> rec_list
  :param :num_users :int --> num of distinct users
  :param :num_items :int --> num of items
  :param :max_seq_len :int --> maximum sequence length to be feeded in model
  :return :array --> categorical_label_x denoting sequence array
  :return :array --> categorical_label_y denoting label array [next prediction]
  """
  pad_seq = sequence.pad_sequences(seq, maxlen= max_seq_len)
  categorical_label_x = to_categorical(pad_seq, num_classes=num_items)
  categorical_label_y = to_categorical(rec, num_classes=num_items)
  return(categorical_label_x, categorical_label_y)

In [36]:
def batch_cat_upd(sam_seq_train, rec_train, step = 500):
  """
  This function tranforms data in batches so to have less memory issue"
  :param :list[list] --> sam_seq_train
  :param :list[list] --> rec_train
  :param :int --> step denoting batch size
  :return :array --> seq array from different batches combined
  :return :array --> label array from different batches combined
  """
  l = len(sam_seq_train)
  ite = int(np.ceil(l/step))
  seq, rec = [], []
  for i in range(0, ite):
    print(i)
    start = i*step
    end = start + step
    if i == ite:
      end = l
    cat_seq_train, cat_rec_train = creat_num_data_upd(sam_seq_train[start:end], rec_train[start:end])
    seq.append(cat_seq_train)
    rec.append(cat_rec_train)
  seqf = np.concatenate(seq, axis=0)
  recf = np.concatenate(rec, axis=0)
  return(seqf, recf)

In [None]:
#cat_seq_train, cat_rec_train = batch_cat_upd(sam_seq_train, rec_train)

In [None]:
cat_seq_test, cat_rec_test = batch_cat_upd(sam_seq_test, rec_test)

In [None]:
cat_seq_val, cat_rec_val = batch_cat_upd(sam_seq_val, rec_val)

In [None]:

#np.save("dataset_assignment/cat_usr_train.npy", cat_usr_train)
np.save("dataset_assignment/cat_seq_train.npy", cat_seq_train)
np.save("dataset_assignment/cat_rec_train.npy", cat_rec_train)

#np.save("dataset_assignment/cat_usr_test.npy", cat_usr_test)
np.save("dataset_assignment/cat_seq_test.npy", cat_seq_test)
np.save("dataset_assignment/cat_rec_test.npy", cat_rec_test)

#np.save("dataset_assignment/cat_usr_val.npy", cat_usr_val)
np.save("dataset_assignment/cat_seq_val.npy", cat_seq_val)
np.save("dataset_assignment/cat_rec_val.npy", cat_rec_val)

In [None]:
# load array
#cat_usr_train = np.load("dataset_assignment/cat_usr_train.npy")
cat_seq_train = np.load("dataset_assignment/cat_seq_train.npy")
cat_rec_train = np.load("dataset_assignment/cat_rec_train.npy")
#cat_usr_test = np.load("dataset_assignment/cat_usr_test.npy")
cat_seq_test = np.load("dataset_assignment/cat_seq_test.npy")
cat_rec_test = np.load("dataset_assignment/cat_rec_test.npy")
#cat_usr_val = np.load("dataset_assignment/cat_usr_val.npy")
cat_seq_val = np.load("dataset_assignment/cat_seq_val.npy")
cat_rec_val = np.load("dataset_assignment/cat_rec_val.npy")

In [None]:
#cat_usr_train, cat_seq_train, cat_rec_train = creat_num_data(usr_train[0:1000], sam_seq_train[0:1000], rec_train[0:1000]) 

# Training Step

Create and train a model that matches the SPS metric on the test set, that is mentioned in the paper. Also, the final model (or just model weights) should be saved so that it could be used for evaluating the metric later. The saved model (or model weights) should be named ‘best_model’ followed by a suitable extension, depending on the Deep Learning framework in use.

In [None]:
model = Sequential()
model.add(LSTM(50, input_shape=(cat_seq_train.shape[1], cat_seq_train.shape[2]), return_sequences=False))
model.add(Dense(cat_rec_train.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model.fit(cat_seq_train, cat_rec_train, epochs=25, batch_size=50)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f7c10253310>

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 50)                751400    
                                                                 
 dense (Dense)               (None, 3706)              189006    
                                                                 
Total params: 940,406
Trainable params: 940,406
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.save('dataset_assignment/best_model.h5')
print('Model Saved!')

Model Saved!


# Load model (or model weights)

Load the saved model (or model weights). 

In [44]:
# load model
savedModel=load_model('dataset_assignment/best_model.h5')
savedModel.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 50)                751400    
                                                                 
 dense (Dense)               (None, 3706)              189006    
                                                                 
Total params: 940,406
Trainable params: 940,406
Non-trainable params: 0
_________________________________________________________________


In [45]:
val_pred_prob = savedModel.predict(cat_seq_val) # propabilites of different items predicted

In [46]:
def trans_out(pred, no_of_rec = 10):
  """
  This function transforms the predicted probabilites array into sereies of output vectors with 1s at
  the item position for best n receommended items
  :param  :array --> pred array consisting of probability vector of item vector for diff user
  :param :int --> no_of_rec denoting no of recommendation to be taken from model
  :return :array -->  array consisting of vectors for different users with 1s at diff indices in a row denoting items present in recommendation 
  """
  if no_of_rec == 1:
    pred1 = np.argmax(pred, axis=1)
    predf = to_categorical(pred1, num_classes=pred.shape[1])
  else:
    pred1 = np.argsort(pred, axis=1)
    pred2 = np.flip(pred1, axis=1)[:,:no_of_rec]
    pred3 = to_categorical(pred2, num_classes=pred.shape[1])
    predf = np.sum(pred3, axis = 1)
  return(predf)

In [47]:
val_pred_rec1 = trans_out(val_pred_prob, 1)
val_pred_rec3 = trans_out(val_pred_prob, 3)
val_pred_rec5 = trans_out(val_pred_prob, 5)
val_pred_rec10 = trans_out(val_pred_prob, 10)

In [48]:
def cal_sps(pred, y, sps_only=True):
  """
  This function compared predicted array with ground truth for calculation of sps metric
  dot product of pred with transpose of Y will result diagonal element of matrix to 
  be dot product of each item vectors for different user. summing of diagonal elemnt and avg over it no 
  of users result into sps per user or avg sps.
  """
  mul = np.dot(pred, y.T)
  bin_vector = np.diagonal(mul).reshape(len(y),1)
  sps = bin_vector.sum(axis=0)[0]*100/len(y)
  if sps_only == True:
    return(sps)
  return(bin_vector, sps)


In [49]:
print("sps @ 1 recommendation : {}%".format(cal_sps(val_pred_rec1, cat_rec_val)))
print("sps @ 3 recommendation : {}%".format(cal_sps(val_pred_rec3, cat_rec_val)))
print("sps @ 5 recommendation : {}%".format(cal_sps(val_pred_rec5, cat_rec_val)))
print("sps @ 10 recommendation : {}%".format(cal_sps(val_pred_rec10, cat_rec_val)))

sps @ 1 recommendation : 1.36%
sps @ 3 recommendation : 2.88%
sps @ 5 recommendation : 4.76%
sps @ 10 recommendation : 8.56%


# Prediction Step

Run the loaded model and evaluate it on the test data, printing the SPS metric. 

In [51]:
test_pred_prob  = savedModel.predict(cat_seq_test)
test_pred_rec1  = trans_out(test_pred_prob, 1)
test_pred_rec3  = trans_out(test_pred_prob, 3)
test_pred_rec5  = trans_out(test_pred_prob, 5)
test_pred_rec10 = trans_out(test_pred_prob, 10)

In [53]:
print("sps @ 1 recommendation : {}%".format(cal_sps(test_pred_rec1, cat_rec_test)))
print("sps @ 3 recommendation : {}%".format(cal_sps(test_pred_rec3, cat_rec_test)))
print("sps @ 5 recommendation : {}%".format(cal_sps(test_pred_rec5, cat_rec_test)))
print("sps @ 10 recommendation : {}%".format(cal_sps(test_pred_rec10, cat_rec_test)))

sps @ 1 recommendation : 1.0%
sps @ 3 recommendation : 3.2%
sps @ 5 recommendation : 4.92%
sps @ 10 recommendation : 8.48%


# QnA Section:
Please answer the following questions based on your understanding of your paper.

### Question 1:

You have a set of 3 sequences:
```
seq1 = ['item1', 'item5', 'item3', 'item2', 'item4']
seq2 = ['item10', 'item5', 'item4', 'item8', 'item2', 'item1', 'item3']
seq3 = ['item8', 'item2', 'item5', 'item3', 'item2', 'item10']
```

We take the first 3 items from all these sequences, and feed them in a RNN model.
The model generates outputs for 3 timesteps for all the 3 input sequences.
Following are the model outputs, where out1 corresponds to the output for input sequence seq1, and so on-
```
out1 = ['item2', 'item4', 'item1']   
out2 = ['item2', 'item10', 'item1']   
out3 = ['item2', 'item3', 'item5']   
```

What will be the sps@3 of this RNN model, for the given set of input sequences?

In [None]:
# Answer 1:  
"""
seq 1 --> 1/3 item 2 present in rec
seq 2 --> 0/3 item 8 not present in rec
seq 3 --> 1/3 item 3 present in rec

sps@3 == 66.67%

### Question 2:

Which do you think is the most significant metric for a recommendation system: sps or recall?


In [None]:
# Answer 2: 
"""
Recall [NDCG is even better] is better for recommendation system. 
sps is good for short term sequence prediction.
"""

### Question 3:

How does the paper propose to measure the short term / long term profiling of a recommender system?

In [None]:
# Answer 3: 
"""
By measuring average of avg-r@N over all users of test data , where avg-r@N is less for small N giving profound usefullness 
for short term prediction but increases linearly as N increases denoting quality of prediction reduces as Long term prediction 
is affected by change in taste etc.
Note : lesser is the avg-r@N , lesser is avg rank over N means each prediction has less rank which means closer to actual truth, meaning 
better prediction/recommendation

lesser the avg-r@N at higher N shows better generalized rec model
"""

### Question 4: 
Suppose you have a user U who has watched 2*n movies, and you have trained RNN based recommender with first n movies. Now, you have a number N <= n, denoting the number of next items in the user sequence taken to do short-term/long-term profiling. 
As per the paper, what impact does increasing N have on this profiling?

In [None]:
# Answer 4:
"""
As N increases avg-r@N increases smoothly/linearly for various model denoting as no of Next Sequence increases, the avg rank also increases leading to 
far less viable recommendation denoting the preference of user changes as time varies
""" 