# Lab 6: Implement content-based filtering with tensorflow

In [1]:
import numpy as np
import numpy.ma as ma
import pandas as pd

import tensorflow as tf
from tensorflow import keras

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import tabulate
from lib.recsysNN_utils import *
pd.set_option("display.precision", 1)

## Load Data

The data set is derived from the [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/) dataset. [F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>]

* The original dataset has roughly 9000 movies rated by 600 users with ratings on a scale of 0.5 to 5 in 0.5 step increments. 
* The dataset has been reduced in size to focus on movies from the years since 2000 and popular genres.
* The reduced dataset has $n_u = 397$ users, $n_m= 847$ movies and 25521 ratings.
* For each movie, the dataset provides a movie title, release date, and one or more genres. For example "Toy Story 3" was released in 2010 and has several genres: "Adventure|Animation|Children|Comedy|Fantasy".
* This dataset contains little information about users other than their ratings. 

In [2]:
top10_df = pd.read_csv("./data/lab06/content_top10_df.csv")
top10_df

Unnamed: 0,movie id,num ratings,ave rating,title,genres
0,4993,198,4.1,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
1,5952,188,4.0,"Lord of the Rings: The Two Towers, The",Adventure|Fantasy
2,7153,185,4.1,"Lord of the Rings: The Return of the King, The",Action|Adventure|Drama|Fantasy
3,4306,170,3.9,Shrek,Adventure|Animation|Children|Comedy|Fantasy|Ro...
4,58559,149,4.2,"Dark Knight, The",Action|Crime|Drama
5,6539,149,3.8,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy
6,79132,143,4.1,Inception,Action|Crime|Drama|Mystery|Sci-Fi|Thriller
7,6377,141,4.0,Finding Nemo,Adventure|Animation|Children|Comedy
8,4886,132,3.9,"Monsters, Inc.",Adventure|Animation|Children|Comedy|Fantasy
9,7361,131,4.2,Eternal Sunshine of the Spotless Mind,Drama|Romance|Sci-Fi


In [3]:
bygenre_df = pd.read_csv("./data/lab06/content_bygenre_df.csv")
bygenre_df

Unnamed: 0,genre,num movies,ave rating/genre,ratings per genre
0,Action,321,3.4,10377
1,Adventure,234,3.4,8785
2,Animation,76,3.6,2588
3,Children,69,3.4,2472
4,Comedy,326,3.4,8911
5,Crime,139,3.5,4671
6,Documentary,13,3.8,280
7,Drama,342,3.6,10201
8,Fantasy,124,3.4,4468
9,Horror,56,3.2,1345


In [4]:
# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
uvs = 3  # user genre vector start
ivs = 3  # item genre vector start
u_s = 3  # start of columns to use in training, user
i_s = 1  # start of columns to use in training, items
print(f"Number of training vectors: {len(item_train)}")

Number of training vectors: 50884


### User Vector
* Some of the user and item/movie features are not used in training. In the table below, the features in brackets "[]" such as the "user id", "rating count" and "rating ave" are not included when the model is trained and used.
* The ratings represent how user (id2) rated movie genres on average.
* Entries of zero are genre's which the user has not yet rated any movie.
* The user vector is the same for all the movies rated by a user.
* User `2` rates action movies as 3.9 on average. I think this is a typo. I think it's `4.0` on average.

In [5]:
# Inspect loaded data
pprint_train(user_train, user_features, uvs,  u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9


### Movie Vector
* The movie array contains the year the film was released, the average rating and an indicator for each potential genre.
* The indicator is one for each genre that applies to the movie.
* The movie id is not used in training but is useful when interpreting the data.
* Movie `6874` is an Action/Crime/Thriller movie released in 2003. MovieLens users gave the movie an average rating (`[rating ave]`) of `4`.

In [6]:
pprint_train(item_train, item_features, ivs, i_s, maxcount=5, user=False)

[movie id],year,ave rating,Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
6874,2003,4.0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
8798,2004,3.8,1,0,0,0,0,1,0,1,0,0,0,0,0,1
46970,2006,3.2,1,0,0,0,1,0,0,0,0,0,0,0,0,0
48516,2006,4.3,0,0,0,0,0,1,0,1,0,0,0,0,0,1
58559,2008,4.2,1,0,0,0,0,1,0,1,0,0,0,0,0,0


### Target / Labelled Outcome
* The target (y) is the movie rating actually given by a single user. 
* The first training example `y` is 4. This indicates that user `2` rated movie `6874` as a `4`.
*  A single training example consists of a row from the user arrary (`user_train`), the item array (`item_train`; items are movies in this case) and a real rating from `y_train`.

In [7]:
print(f"y_train[:5]: {y_train[:5]}")

y_train[:5]: [4.  3.5 4.  4.  4.5]


## Prepare Data: Feature scaling and Breaking up the training sets

In [8]:
# FEATURE SCALING

# Setup variables
item_train_unscaled = item_train
user_train_unscaled = user_train
y_train_unscaled    = y_train


# Scale the input features using the scikit learn StandardScaler... first the user data and then the item (movie) data  
scalerUser = StandardScaler()
scalerUser.fit(user_train)
user_train = scalerUser.transform(user_train)

scalerItem = StandardScaler()
scalerItem.fit(item_train)
item_train = scalerItem.transform(item_train)


# Scale the target ratings using a Min Max Scaler which scales the target to be between -1 and 1.
scalerTarget = MinMaxScaler((-1, 1))
scalerTarget.fit(y_train.reshape(-1, 1))
y_train = scalerTarget.transform(y_train.reshape(-1, 1))
#ynorm_test = scalerTarget.transform(y_test.reshape(-1, 1))

# Confirm the inverse_transform produces the original inputs
print(np.allclose(item_train_unscaled, scalerItem.inverse_transform(item_train)))
print(np.allclose(user_train_unscaled, scalerUser.inverse_transform(user_train)))

True
True


In [9]:
# SPLIT INTO TRAINING AND TEST SETS

item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")

movie/item training data shape: (40707, 17)
movie/item test data shape: (10177, 17)


In [10]:
# Show that the scaled, shuffled data now has a mean of 0.
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
1,0,-1.0,-0.8,-0.7,0.1,-0.0,-1.2,-0.4,0.6,-0.5,-0.5,-0.1,-0.6,-0.6,-0.7,-0.7
0,1,-0.7,-0.5,-0.7,-0.1,-0.2,-0.6,-0.2,0.7,-0.5,-0.8,0.1,-0.0,-0.6,-0.5,-0.4
-1,-1,-0.2,0.3,-0.4,0.4,0.5,1.0,0.6,-1.2,-0.3,-0.6,-2.3,-0.1,0.0,0.4,-0.0
0,-1,0.6,0.5,0.5,0.2,0.6,-0.1,0.5,-1.2,0.9,1.2,-2.3,-0.1,0.0,0.2,0.3
-1,0,0.7,0.6,0.5,0.3,0.5,0.4,0.6,1.0,0.6,0.3,0.8,0.8,0.4,0.7,0.7


## Build neural network to derive user and item vectors with the same desired number of output features

It's not necessary for the raw data to have the same number of users as it does items (e.g. movies). Usually those two data sets have very different dimensions.But, in order to calculate the loss, we need one row of user data to have the same number of features as one row of movie data so that we can take a dot product of those two vectors. We use two neural networks to achieve this transformation (prior to combining them by taking the dot product). We control the number of features in the output by specifiying it as the number of functions/nodes in the final output layer of the neural network

NB: If the raw user content was substantially larger than the raw movie content, you might elect to increase the complexity of the user network relative to the movie network. In this case, the content is similar, so the networks have the same number of functions/nodes at each layer.


Use a Keras sequential model:
* The first layer is a dense layer with 256 units and a relu activation.
* The second layer is a dense layer with 128 units and a relu activation.
* The third layer is a dense layer with num_outputs units and a linear or no activation.

In [11]:
# CONSTRUCT THE NEURAL NETWORKS

# GRADED_CELL
# UNQ_C1

num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs) # count of desired features in the derived user matrix (Vu)
    ### END CODE HERE ###  
])

item_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs) # count of desired features in the derived user matrix (Vm)
    ### END CODE HERE ###  
])

# The above is correct. It passes all tests in the course notebook.

In [16]:
# Raises exception here: tf.keras.layers.Input(shape=(num_user_features)) because "ValueError: Cannot convert '14' to a shape."
# But the types and values of num_user_features and num_item_features is the same in this notebook as it is in the course notebook
# print(num_user_features)
# print(type(num_user_features))
# print(num_item_features)
# print(type(num_item_features))

# Added this wrapping to get around errors that are raise using this notebook configuration, which are not raised in the course notebook.
class MyLayer(keras.Layer):
    def call(self, x):
        return tf.linalg.l2_normalize(x, axis=1)

# create the user input and point to the base network
# input_user = tf.keras.layers.Input(shape=(num_user_features)) # original code from course notebook; raises exception
input_user = tf.keras.layers.Input(shape=[num_user_features, ]) 
vu = user_NN(input_user)
# vu = tf.linalg.l2_normalize(vu, axis=1) # original code from course notebook; raises exception
vu = MyLayer()(vu)

# create the item input and point to the base network
# input_item = tf.keras.layers.Input(shape=(num_item_features))  # original code from course notebook; raises exception
input_item = tf.keras.layers.Input(shape=[num_item_features, ])
vm = item_NN(input_item)
# vm = tf.linalg.l2_normalize(vm, axis=1) # original code from course notebook; raises exception
vm = MyLayer()(vm)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()

In [17]:
# Public tests
from public_tests import *
print(user_NN)
print(user_NN.layers)

for layer in user_NN.layers:
    print(layer.output)
    print(layer.output.shape)
    print(type(layer.output.shape))

# Expected
# Tensor("sequential_12/dense_36/Identity:0", shape=(None, 256), dtype=float32)
# (None, 256)
# <class 'tensorflow.python.framework.tensor_shape.TensorShape'>
# Tensor("sequential_12/dense_37/Identity:0", shape=(None, 128), dtype=float32)
# (None, 128)
# <class 'tensorflow.python.framework.tensor_shape.TensorShape'>
# Tensor("sequential_12/dense_38/Identity:0", shape=(None, 32), dtype=float32)
# (None, 32)
# <class 'tensorflow.python.framework.tensor_shape.TensorShape'>

test_tower(user_NN)
test_tower(item_NN)

# Passing in course notebook. 
# The test that checked the values in the tuples that are the shapes of each layer was failing.
# It failed because it expected to see a tensor as the shape, but the shape in my layers is a tuple.
# I adjusted the test to inspect a tuple (as opposed to a tensor object) but I did not change the expected values.

<Sequential name=sequential, built=True>
[<Dense name=dense, built=True>, <Dense name=dense_1, built=True>, <Dense name=dense_2, built=True>]
<KerasTensor shape=(None, 256), dtype=float32, sparse=False, ragged=False, name=keras_tensor_2>
(None, 256)
<class 'tuple'>
<KerasTensor shape=(None, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_3>
(None, 128)
<class 'tuple'>
<KerasTensor shape=(None, 32), dtype=float32, sparse=False, ragged=False, name=keras_tensor_4>
(None, 32)
<class 'tuple'>
[92mAll tests passed!
[92mAll tests passed!


## Define loss function

In [18]:
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
              loss=cost_fn)

## Fit model

In [19]:
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], y_train, epochs=30)

Epoch 1/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 836us/step - loss: 0.1301
Epoch 2/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 825us/step - loss: 0.1159
Epoch 3/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 818us/step - loss: 0.1102
Epoch 4/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 799us/step - loss: 0.1060
Epoch 5/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 818us/step - loss: 0.1029
Epoch 6/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 843us/step - loss: 0.1003
Epoch 7/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 831us/step - loss: 0.0981
Epoch 8/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 839us/step - loss: 0.0962
Epoch 9/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 840us/step - loss: 0.0945
Epoch 10/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x3027f0380>

## Evaluate the model by determining loss on the test data set

In [21]:
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], y_test)

# Expected value roughly: 0.08

[1m319/319[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 389us/step - loss: 0.0867


0.0833355039358139

## Make predictions

We will create a new user and have the model suggest movies for that user. Ratings are between 0.5 and 5.0, inclusive, in half-step increments.

### Make predictions for a new user

* Construct the new user. This new user enjoys movies from the adventure and fantasy genres.
* Use a set of movie/item vectors (`item_vecs`) that have a vector for each movie in the training/test set. Match this with the new user vector and the scaled vectors to predict ratings for all the movies. This will find the top-rated movies for the new user.

In [27]:
new_user_id = 5000
new_rating_ave = 0.0
new_action = 0.0
new_adventure = 5.0
new_animation = 0.0
new_childrens = 0.0
new_comedy = 0.0
new_crime = 0.0
new_documentary = 0.0
new_drama = 0.0
new_fantasy = 5.0
new_horror = 0.0
new_mystery = 0.0
new_romance = 0.0
new_scifi = 0.0
new_thriller = 0.0
new_rating_count = 3

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])

In [28]:
# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))

# scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display

print_pred_movies(sorted_ypu, sorted_items, movie_dict, maxcount = 10)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 804us/step


y_p,movie id,rating ave,title,genres
4.3,5816,3.6,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
4.2,98809,3.8,"Hobbit: An Unexpected Journey, The (2012)",Adventure|Fantasy
4.2,54001,3.9,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy
4.2,106489,3.6,"Hobbit: The Desolation of Smaug, The (2013)",Adventure|Fantasy
4.2,8368,3.9,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy
4.2,4896,3.8,Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),Adventure|Children|Fantasy
4.2,40815,3.8,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller
4.1,81834,4.0,Harry Potter and the Deathly Hallows: Part 1 (2010),Action|Adventure|Fantasy
4.1,6539,3.8,Pirates of the Caribbean: The Curse of the Black Pearl (2003),Action|Adventure|Comedy|Fantasy
4.1,118696,3.4,The Hobbit: The Battle of the Five Armies (2014),Adventure|Fantasy


### Make predictions for existing user (user 2)

The predictions are generally within 1 of the actual rating though it is not a very accurate predictor of how a user rates specific movies. This is especially true if the user rating is significantly different than the user's genre average.

In [30]:
uid = 2 
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, user_train_unscaled, item_vecs, user_to_genre)

# scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction 
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display
sorted_user  = user_vecs[sorted_index]
sorted_y     = y_vecs[sorted_index]

#print sorted predictions for movies rated by the user
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, ivs, uvs, movie_dict, maxcount = 50)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 867us/step


y_p,y,user,user genre ave,movie rating ave,movie id,title,genres
4.5,5.0,2,[4.0],4.3,80906,Inside Job (2010),Documentary
4.3,4.0,2,"[4.0,4.1,3.9]",4.0,6874,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller
4.3,4.0,2,"[4.0,4.1,4.0,4.0,3.9,3.9]",4.1,79132,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller
4.3,3.5,2,"[4.0,4.0]",3.9,99114,Django Unchained (2012),Action|Drama
4.2,3.5,2,"[4.0,4.1,4.0,3.9]",3.8,8798,Collateral (2004),Action|Crime|Drama|Thriller
4.2,4.5,2,"[4.0,4.0]",4.1,68157,Inglourious Basterds (2009),Action|Drama
4.2,5.0,2,"[4.0,4.1,4.0]",3.9,106782,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama
4.2,4.5,2,"[4.0,4.1,4.0]",4.2,58559,"Dark Knight, The (2008)",Action|Crime|Drama
4.1,4.0,2,"[4.1,4.0,3.9]",4.3,48516,"Departed, The (2006)",Crime|Drama|Thriller
4.1,4.5,2,"[4.1,4.0,3.9]",4.0,80489,"Town, The (2010)",Crime|Drama|Thriller


### Find similar items (movies)

The neural network above produces two feature vectors: a user feature vector (𝑣𝑢) and a movie feature vector (𝑣𝑚). These are 32-entry vectors whose values are difficult to interpret. However, similar items will have similar vectors. This information can be used to make recommendations. For example, if a user has rated "Toy Story 3" highly, one could recommend similar movies by selecting movies with similar movie feature vectors. Similarity can be measured at the sum of squared diffs between two vectors.

In [42]:
# GRADED_FUNCTION: sq_dist
# UNQ_C2
def sq_dist(a,b):
    """
    Returns the squared distance between two vectors
    Args:
      a (ndarray (n,)): vector with n features
      b (ndarray (n,)): vector with n features
    Returns:
      d (float) : distance
    """
    ### START CODE HERE ### 
    d = np.sum((a -b)**2)
    ### END CODE HERE ###     
    return d

In [43]:
a1 = np.array([1.0, 2.0, 3.0]); b1 = np.array([1.0, 2.0, 3.0])
a2 = np.array([1.1, 2.1, 3.1]); b2 = np.array([1.0, 2.0, 3.0])
a3 = np.array([0, 1, 0]);       b3 = np.array([1, 0, 0])
print(f"squared distance between a1 and b1: {sq_dist(a1, b1):0.3f}")
print(f"squared distance between a2 and b2: {sq_dist(a2, b2):0.3f}")
print(f"squared distance between a3 and b3: {sq_dist(a3, b3):0.3f}")

# Expected Output:
# squared distance between a1 and b1: 0.000
# squared distance between a2 and b2: 0.030
# squared distance between a3 and b3: 2.000

squared distance between a1 and b1: 0.000
squared distance between a2 and b2: 0.030
squared distance between a3 and b3: 2.000


In [44]:
# Public tests
test_sq_dist(sq_dist)

[92mAll tests passed!


In [47]:
# Once a model is trained, obtain the movie feature vector (𝑣𝑚) for each of the movies. 
# To do this, use the trained item_NN and build a small model to allow us to run the movie vectors through it to generate 𝑣𝑚.
# I had to make the same adjustments to the code below that I made in previous sections to accommodate differences in notebook environments.

input_item_m = tf.keras.layers.Input(shape=[num_item_features, ])    # input layer

vm_m = item_NN(input_item_m)                                       # use the trained item_NN


class MyLayer(keras.Layer):
    def call(self, x):
        return tf.linalg.l2_normalize(x, axis=1)

# vm_m = tf.linalg.l2_normalize(vm_m, axis=1)                        # incorporate normalization as was done in the original model
vm_m = MyLayer()(vm_m)

model_m = tf.keras.Model(input_item_m, vm_m)   

model_m.summary()

In [49]:
# Create a set of movie feature vectors by using the model to predict from an input which is a set of item/movie vectors. 
# item_vecs is a set of all of the movie vectors. It must be scaled to use with the trained model. The result of the prediction is a 32 entry feature vector for each movie.
scaled_item_vecs = scalerItem.transform(item_vecs)
vms = model_m.predict(scaled_item_vecs[:,i_s:])
print(f"size of all predicted movie feature vectors: {vms.shape}")
# Expected: (847, 32)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 760us/step
size of all predicted movie feature vectors: (847, 32)


In [51]:
# Identify the closest movie by finding the minimum along each row.
# Make use of numpy masked arrays to avoid selecting the same movie. The masked values along the diagonal won't be included in the computation.

count = 50  # number of movies to display
dim = len(vms)
dist = np.zeros((dim,dim))

for i in range(dim):
    for j in range(dim):
        dist[i,j] = sq_dist(vms[i, :], vms[j, :])
        
m_dist = ma.masked_array(dist, mask=np.identity(dist.shape[0]))  # mask the diagonal

disp = [["movie1", "genres", "movie2", "genres"]]
for i in range(count):
    min_idx = np.argmin(m_dist[i])
    movie1_id = int(item_vecs[i,0])
    movie2_id = int(item_vecs[min_idx,0])
    disp.append( [movie_dict[movie1_id]['title'], movie_dict[movie1_id]['genres'],
                  movie_dict[movie2_id]['title'], movie_dict[movie1_id]['genres']]
               )
table = tabulate.tabulate(disp, tablefmt='html', headers="firstrow")

# The results show the model will generally suggest a movie with the same genre.
table

movie1,genres,movie2,genres.1
Save the Last Dance (2001),Drama|Romance,Mona Lisa Smile (2003),Drama|Romance
"Wedding Planner, The (2001)",Comedy|Romance,"Sweetest Thing, The (2002)",Comedy|Romance
Hannibal (2001),Horror|Thriller,Final Destination 2 (2003),Horror|Thriller
Saving Silverman (Evil Woman) (2001),Comedy|Romance,"Wedding Planner, The (2001)",Comedy|Romance
Down to Earth (2001),Comedy|Fantasy|Romance,Bewitched (2005),Comedy|Fantasy|Romance
"Mexican, The (2001)",Action|Comedy,Rush Hour 2 (2001),Action|Comedy
15 Minutes (2001),Thriller,Panic Room (2002),Thriller
Enemy at the Gates (2001),Drama,Finding Neverland (2004),Drama
Heartbreakers (2001),Comedy|Crime|Romance,Fun with Dick and Jane (2005),Comedy|Crime|Romance
Spy Kids (2001),Action|Adventure|Children|Comedy,Scooby-Doo (2002),Action|Adventure|Children|Comedy
