# Deep Learning for Movies Recommendation System(content_based filtering)

In [1]:
import numpy as np
import pandas as pd
import numpy.ma as ma
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Layer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
import pickle as pickle
from recsysNN_util import *

pd.set_option("display.precision", 1)


## Movie ratings dataset 
The data set is from the [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/) dataset. 

[F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>]

Thes dataset has around 9000 movies rated by 600 users with ratings on a scale of 0.5 to 5 in 0.5 step increments. For each movie, the dataset provides a movie title, release date, and one or more genres. For example "Toy Story 3" was released in 2010 and has several genres: "Adventure|Animation|Children|Comedy|Fantasy". This dataset contains little information about users other than their ratings. This dataset is used to create training vectors for the neural networks. All the necessary cleaning and transforming the data is done in "guide to data prepation for user networn and item network.ipynb". You can find that in the same folder as this one.



##  Content-based filtering with a neural network

In the collaborative filtering lab, you generated two vectors, a user vector and an item/movie vector whose dot product would predict a rating. The vectors were derived solely from the ratings.   

Content-based filtering also generates a user and movie feature vector but recognizes there may be other information available about the user and/or movie that may improve the prediction. The additional information is provided to a neural network which then generates the user and movie vector as shown below.
<figure>
    <center> <img src="images/NN_architecture.png"   style="width:500px;height:280px;" ></center>
</figure>


### Training Data
The movie content provided to the network is a combination of the original data and some 'engineered features'. The original features are the year the movie was released and the movie's genre's presented as a one-hot vector. There are 14 genres. The engineered feature is an average rating derived from the user ratings. 

The user content is composed of engineered features. A per genre average rating is computed per user. Additionally, a user id, rating count and rating average are available but not included in the training or prediction content. They are carried with the data set because they are useful in interpreting data.

The training set consists of all the ratings made by the users in the data set.The training set is split into two arrays with the same number of entries, a user array and a movie/item array.  

Below, let's load and display some of the data.

In [2]:
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()
num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
uvs = 3  # user genre vector start
ivs = 3  # item genre vector start
u_s = 3  # start of columns to use in training, user
i_s = 1  # start of columns to use in training, items
print(f"Number of training vectors: {len(item_train)}")

Number of training vectors: 100631


In [4]:
pprint_train(user_train, user_features, uvs,  u_s, maxcount=5)

[user id],[rating count],[rating ave],Adve nture,Anim ation,Chil dren,Com edy,Fan tasy,Rom ance,Act ion,Crime,Thri ller,Mys tery,Hor ror,Drama,War,Sci -Fi,Mus ical,Docum entary
1,689,4.3,4.4,4.7,4.5,4.3,4.3,4.3,4.3,4.4,4.2,4.2,3.5,4.5,4.5,4.2,4.7,0.0
5,123,3.6,3.2,4.3,4.1,3.5,4.1,3.1,3.1,3.8,3.6,4.0,3.0,3.8,3.3,2.5,4.4,0.0
7,457,3.2,3.3,3.4,3.2,3.2,3.1,2.6,3.3,3.3,3.4,3.2,4.0,3.1,3.3,3.1,3.7,0.0
15,425,3.4,3.3,3.0,2.7,3.4,2.9,3.9,3.2,3.6,3.4,3.4,3.8,3.7,4.1,3.6,2.7,0.0
17,290,4.2,4.3,4.4,4.2,4.2,4.3,3.9,4.2,4.2,4.3,4.0,4.2,4.2,4.4,4.4,4.0,3.5


In [5]:
pprint_train(item_train, item_features, ivs, i_s, maxcount=5, user=False)

[mov ieId],Year,Avg_r ating,Adve nture,Anim ation,Chil dren,Com edy,Fan tasy,Rom ance,Act ion,Crime,Thri ller,Mys tery,Hor ror,Drama,War,Sci -Fi,Mus ical,Docum entary
1,1995,3.9,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1995,3.9,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1995,3.9,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1995,3.9,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1995,3.9,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0


In [6]:
print(f"y_train[:5]: {y_train[:5]}")

y_train[:5]: [4.  4.  4.5 2.5 4.5]


In [7]:
# scale training data
item_train_unscaled = item_train
user_train_unscaled = user_train
y_train_unscaled    = y_train

scalerItem = StandardScaler()
scalerItem.fit(item_train)
item_train = scalerItem.transform(item_train)

scalerUser = StandardScaler()
scalerUser.fit(user_train)
user_train = scalerUser.transform(user_train)


scalerTarget = MinMaxScaler((-1, 1))
scalerTarget.fit(y_train.reshape(-1, 1))
y_train = scalerTarget.transform(y_train.reshape(-1, 1))

print(np.allclose(item_train_unscaled, scalerItem.inverse_transform(item_train))) # just to verify the scaling
print(np.allclose(user_train_unscaled, scalerUser.inverse_transform(user_train)))

True
True


In [8]:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")

movie/item training data shape: (80504, 19)
movie/item test data shape: (20127, 19)


In [9]:
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

[user id],[rating count],[rating ave],Adve nture,Anim ation,Chil dren,Com edy,Fan tasy,Rom ance,Act ion,Crime,Thri ller,Mys tery,Hor ror,Drama,War,Sci -Fi,Mus ical,Docum entary
0,0,-1.8,-2.4,0.2,-0.5,-1.3,-1.0,-1.2,-2.2,-0.8,-1.9,-0.9,-1.0,-0.9,-0.1,-1.7,0.4,-1.4
-1,0,0.8,0.3,0.5,0.4,1.0,0.7,0.9,0.6,0.5,0.7,0.9,0.6,0.8,0.1,0.7,0.6,-0.0
1,2,-1.8,-1.4,-0.5,-1.1,-2.0,-1.3,-1.5,-1.3,-1.5,-1.5,-1.1,-0.9,-1.7,-1.0,-1.2,-0.7,0.3
1,0,-1.1,-0.6,-0.2,-0.4,-0.7,0.1,-0.9,-1.5,-1.3,-1.5,-1.1,-0.3,-1.5,-1.5,-0.9,0.0,0.7
1,0,-0.6,-0.4,-0.1,-0.1,-0.5,-0.3,-0.5,-0.5,-0.7,-0.7,-0.4,-0.4,-0.7,-0.6,-0.3,-0.0,0.8


In [10]:
class L2NormalizeLayer(Layer):
    def __init__(self, axis=-1, **kwargs):
        super().__init__(**kwargs)
        self.axis = axis

    def call(self, inputs):
        return tf.linalg.l2_normalize(inputs, axis=self.axis)


num_outputs = 32 
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=256, activation='relu'),
    tf.keras.layers.Dense(units=num_outputs)
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=256, activation='relu'),
    tf.keras.layers.Dense(units=num_outputs)
])

# creating the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features,)) # Corrected shape here too
vu = user_NN(input_user)

# Apply L2 normalization using the custom layer
vu = L2NormalizeLayer(axis=1)(vu) 

# creating the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features,)) # Corrected shape here
vm = item_NN(input_item)

# Apply L2 normalization using the custom layer
vm = L2NormalizeLayer(axis=1)(vm) # FIX IS HERE

# computing the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specifying the inputs and output of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()




In [11]:
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
              loss=cost_fn)

In [12]:
print(user_train.shape)
print(item_train.shape)
print(y_train.shape)
print(input_user.shape)


(80504, 19)
(80504, 19)
(80504, 1)
(None, 16)


In [13]:
print(f"Shape of user_input: {user_train[:, u_s:].shape}")
print(f"Shape of item_input: {item_train[:, i_s:].shape}")
print(f"Shape of y_train: {y_train.shape}")

Shape of user_input: (80504, 16)
Shape of item_input: (80504, 18)
Shape of y_train: (80504, 1)


In [14]:
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], y_train, epochs=30)

Epoch 1/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 9ms/step - loss: 0.1350
Epoch 2/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 12ms/step - loss: 0.1285
Epoch 3/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 12ms/step - loss: 0.1261
Epoch 4/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 12ms/step - loss: 0.1248
Epoch 5/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 12ms/step - loss: 0.1235
Epoch 6/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 11ms/step - loss: 0.1221
Epoch 7/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 12ms/step - loss: 0.1207
Epoch 8/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 13ms/step - loss: 0.1193
Epoch 9/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 13ms/step - loss: 0.1181
Epoch 10/30
[1m2516/2516[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x1f18a52c830>

In [14]:
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], y_test)

[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - loss: 0.1199


0.11985336989164352

predictions 

In [16]:
new_user_id = 5000
new_rating_ave = 0.0
new_action = 0.0
new_adventure = 5.0
new_animation = 0.0
new_childrens = 0.0
new_comedy = 0.0
new_crime = 0.0
new_documentary = 0.0
new_drama = 0.0
new_fantasy = 5.0
new_horror = 0.0
new_mystery = 0.0
new_romance = 0.0
new_scifi = 0.0
new_thriller = 0.0
new_rating_count = 3
new_musical =0.0
new_war = 0.0

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_adventure, new_animation, new_childrens,
                      new_comedy, new_fantasy, new_romance, new_action, 
                      new_crime, new_thriller, new_mystery, new_horror,
                      new_drama, new_war, new_scifi, new_musical, new_documentary                       
                       ]])


##  Predictions
This model is used to make predictions in a number of circumstances. 
<
###  Predictions for a new user
First, we'll create a new user and have the model suggest movies for that user. After tried this on the example user content, feel free to change the user content to match your own preferences and see what the model suggests. Note that ratings are between 0.5 and 5.0, inclusive, in half-step increments.

In [17]:
# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))

# scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display

print_pred_movies(sorted_ypu, sorted_items, movie_dict, maxcount = 10)

[1m302/302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step


y_p,movie id,rating ave,title,genres
3.9,98809,3.8,"Hobbit: An Unexpected Journey, The (2012)",Adventure|Fantasy|IMAX
3.8,60818,4.0,Hogfather (Terry Pratchett's Hogfather) (2006),Adventure|Fantasy|Thriller
3.8,8368,3.9,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
3.8,180031,3.7,The Shape of Water (2017),Adventure|Drama|Fantasy
3.8,40815,3.8,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX
3.8,128842,4.0,Dragonheart 3: The Sorcerer's Curse (2015),Action|Adventure|Fantasy
3.8,137857,3.6,The Jungle Book (2016),Adventure|Drama|Fantasy
3.8,5952,4.0,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
3.8,122926,4.2,Untitled Spider-Man Reboot (2017),Action|Adventure|Fantasy
3.7,81834,4.0,Harry Potter and the Deathly Hallows: Part 1 (2010),Action|Adventure|Fantasy|IMAX


In [18]:
# for existing user

uid = 2 
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, user_train_unscaled, item_vecs, user_to_genre)

# scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display
sorted_user  = user_vecs[sorted_index]
sorted_y     = y_vecs[sorted_index]

#print sorted predictions for movies rated by the user
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, ivs, uvs, movie_dict, maxcount = 50)

[1m302/302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step


y_p,y,user,user genre ave,movie rating ave,movie id,title,genres
4.4,3.0,2,"[3.8,3.9]",4.4,318,"Shawshank Redemption, The (1994)",Crime|Drama
4.3,4.5,2,"[4.0,3.8,3.9]",4.2,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
4.3,4.5,2,"[4.5,3.9]",4.1,1704,Good Will Hunting (1997),Drama|Romance
4.3,5.0,2,[4.3],5.0,131724,The Jinx: The Life and Deaths of Robert Durst (2015),Documentary
4.3,3.0,2,[3.9],4.0,109487,Interstellar (2014),Sci-Fi|IMAX
4.3,4.0,2,"[3.8,3.7,3.9]",4.3,48516,"Departed, The (2006)",Crime|Drama|Thriller
4.3,4.5,2,"[4.0,3.9,4.5]",4.1,68157,Inglourious Basterds (2009),Action|Drama|War
4.2,5.0,2,[4.3],4.3,80906,Inside Job (2010),Documentary
4.1,4.0,2,[3.9],4.0,112552,Whiplash (2014),Drama
4.1,3.5,2,"[4.2,4.0,3.8]",4.0,91529,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX


The model prediction is generally within 1 of the actual rating though it is not a very accurate predictor of how a user rates specific movies. This is especially true if the user rating is significantly different than the user's genre average. You can vary the user id above to try different users.  User id's were used in the training set are from 1-610.


### Finding Similar Items
The neural network above produces two feature vectors, a user feature vector $v_u$, and a movie feature vector, $v_m$. These are 32 entry vectors whose values are difficult to interpret. However, similar items will have similar vectors. This information can be used to make recommendations. For example, if a user has rated "Toy Story 3" highly, one could recommend similar movies by selecting movies with similar movie feature vectors.

A similarity measure is the squared distance between the two vectors

In [21]:
def sq_dist(a,b):
   
   # Returns the squared distance between two vectors
    # a adn b with vectors of n features
    
  
    d = 0.0 # distance
    for l in range(a.shape[0]):
        d = d + (a[l] - b[l])**2
        
    return d

A matrix of distances between movies can be computed once when the model is trained and then reused for new recommendations without retraining. The first step, once a model is trained, is to obtain the movie feature vector, $v_m$, for each of the movies. To do this, we will use the trained `item_NN` and build a small model to allow us to run the movie vectors through it to generate $v_m$.

In [19]:
input_item_m = tf.keras.layers.Input(shape=((num_item_features,)))    # input layer
vm_m = item_NN(input_item_m)                                       # use the trained item_NN
vm_m = L2NormalizeLayer(axis=1)(vm_m)                      # incorporate normalization as was done in the original model
model_m = tf.keras.Model(input_item_m, vm_m)                                
model_m.summary()

Once you have a movie model, you can create a set of movie feature vectors by using the model to predict using a set of item/movie vectors as input. `item_vecs` is a set of all of the movie vectors. It must be scaled to use with the trained model. The result of the prediction is a 32 entry feature vector for each movie.

In [20]:
scaled_item_vecs = scalerItem.transform(item_vecs)
vms = model_m.predict(scaled_item_vecs[:,i_s:])
print(f"size of all predicted movie feature vectors: {vms.shape}")

[1m302/302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
size of all predicted movie feature vectors: (9664, 32)


Lets compute a matrix of the squared distance between each movie feature vector and all other movie feature vectors:
We can then find the closest movie by finding the minimum along each row. We will make use of [numpy masked arrays](https://numpy.org/doc/1.21/user/tutorial-ma.html) to avoid selecting the same movie. The masked values along the diagonal won't be included in the computation.

In [22]:
count = 50  # number of movies to display
dim = len(vms)
dist = np.zeros((dim,dim))

for i in range(dim):
    for j in range(dim):
        dist[i,j] = sq_dist(vms[i, :], vms[j, :])
        
m_dist = ma.masked_array(dist, mask=np.identity(dist.shape[0]))  # mask the diagonal

disp = [["movie1", "genres", "movie2", "genres"]]
for i in range(count):
    min_idx = np.argmin(m_dist[i])
    movie1_id = int(item_vecs[i,0])
    movie2_id = int(item_vecs[min_idx,0])
    disp.append( [movie_dict[movie1_id]['title'], movie_dict[movie1_id]['genres'],
                  movie_dict[movie2_id]['title'], movie_dict[movie1_id]['genres']]
               )
table = tabulate.tabulate(disp, tablefmt='html', headers="firstrow")
table

movie1,genres,movie2,genres.1
Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
Grumpier Old Men (1995),Comedy|Romance,Clueless (1995),Comedy|Romance
Heat (1995),Action|Crime|Thriller,Thursday (1998),Action|Crime|Thriller
Seven (a.k.a. Se7en) (1995),Mystery|Thriller,True Crime (1996),Mystery|Thriller
"Usual Suspects, The (1995)",Crime|Mystery|Thriller,Reservoir Dogs (1992),Crime|Mystery|Thriller
From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,Ichi the Killer (Koroshiya 1) (2001),Action|Comedy|Horror|Thriller
Bottle Rocket (1996),Adventure|Comedy|Crime|Romance,Annie Hall (1977),Adventure|Comedy|Crime|Romance
Braveheart (1995),Action|Drama|War,Saving Private Ryan (1998),Action|Drama|War
Rob Roy (1995),Action|Drama|Romance|War,Kingdom of Heaven (2005),Action|Drama|Romance|War
Canadian Bacon (1995),Comedy|War,Junior (1994),Comedy|War
