Implement content-based filtering using a (paired combination of) neural network(s) to build a recommender system for movies.



In this lab, use NumPy and TensorFlow (main), scikit-learn (helpful subroutines), tabulate (neatly print tables), and Pandas (organize tabular data).

In [3]:
# Standard imports
import numpy as np
import numpy.ma as ma
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
from recsysNN_utils import *

pd.set_option("display.precision", 1)

Dataset derived from the MovieLens ml-latest-small dataset. Cool site.

The original dataset is fairly large, but has been limited for the scope of this lab - focusing on movies since 2000:

397 users ($n_u$), 847 movies ($n_m$), and 25521 ratings.

Each movie has a title, release date, and one or more genres.



But maybe we should analyze it a little.

In [4]:
# Display top 10 movies based on number of ratings.
top10_df = pd.read_csv("./data/content_top10_df.csv")
bygenre_df = pd.read_csv("./data/content_bygenre_df.csv")

top10_df

Unnamed: 0,movie id,num ratings,ave rating,title,genres
0,4993,198,4.1,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
1,5952,188,4.0,"Lord of the Rings: The Two Towers, The",Adventure|Fantasy
2,7153,185,4.1,"Lord of the Rings: The Return of the King, The",Action|Adventure|Drama|Fantasy
3,4306,170,3.9,Shrek,Adventure|Animation|Children|Comedy|Fantasy|Ro...
4,58559,149,4.2,"Dark Knight, The",Action|Crime|Drama
5,6539,149,3.8,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy
6,79132,143,4.1,Inception,Action|Crime|Drama|Mystery|Sci-Fi|Thriller
7,6377,141,4.0,Finding Nemo,Adventure|Animation|Children|Comedy
8,4886,132,3.9,"Monsters, Inc.",Adventure|Animation|Children|Comedy|Fantasy
9,7361,131,4.2,Eternal Sunshine of the Spotless Mind,Drama|Romance|Sci-Fi


In [6]:
# Info sorted by genre - movies can have multiple genres, so numbers may not add up
bygenre_df

Unnamed: 0,genre,num movies,ave rating/genre,ratings per genre
0,Action,321,3.4,10377
1,Adventure,234,3.4,8785
2,Animation,76,3.6,2588
3,Children,69,3.4,2472
4,Comedy,326,3.4,8911
5,Crime,139,3.5,4671
6,Documentary,13,3.8,280
7,Drama,342,3.6,10201
8,Fantasy,124,3.4,4468
9,Horror,56,3.2,1345


The movie content provided to the network includes engineered features.
Original features are the release year and genre (one-hot vector...though I suppose with multiple possible simultaneous genres, why one-hot?).

The engineered feature is an average rating derived from user ratings.



The user features also include engineered features, such as a per genre rating per user. Some other components of the data are useful for data interpretation, but not so much in the training or prediction - such as user ID



In the case of underrepresented genres, some ratings are repeated to boost the number of training examples. I doubt this is 'good science', but whatever!


In [7]:
# Load data and set config vars
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

In [9]:
num_user_features = user_train.shape[1] - 3 # drop ID, rating count, and ave rating during training
num_item_features = item_train.shape[1] - 1 # drop movie ID

uvs = 3 # user genre vector start
ivs = 3 # item genre ""
u_s = 3 # start of cols to use in training (user)
i_s = 1 # start of cols to use in training (item)

print(f"Number of training vectors: {len(item_train)}")

Number of training vectors: 50884


In [11]:
# First few entries in the user training array
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9
2,22,4.0,4.0,4.2,0.0,0.0,4.0,4.1,4.0,4.0,0.0,3.0,4.0,0.0,3.9,3.9


In [12]:
# First entries in the item training array
pprint_train(item_train, item_features, ivs, i_s, maxcount=5, user=False)

[movie id],year,ave rating,Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
6874,2003,4.0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
8798,2004,3.8,1,0,0,0,0,1,0,1,0,0,0,0,0,1
46970,2006,3.2,1,0,0,0,1,0,0,0,0,0,0,0,0,0
48516,2006,4.3,0,0,0,0,0,1,0,1,0,0,0,0,0,1
58559,2008,4.2,1,0,0,0,0,1,0,1,0,0,0,0,0,0


In [13]:
print(f"y_train[:5]: {y_train[:5]}")

y_train[:5]: [4.  3.5 4.  4.  4.5]


The target, y, is the movie rating given by the user.

To prepare the training data, we will want to use feature scaling on the input features and minmax scaling on the target ratings

In [14]:
item_train_unscaled = item_train
user_train_unscaled = user_train
y_train_unscaled = y_train

scalerItem = StandardScaler()
scalerItem.fit(item_train)
item_train = scalerItem.transform(item_train)

scalerUser = StandardScaler()
scalerUser.fit(user_train)
user_train = scalerUser.transform(user_train)

scalerTarget = MinMaxScaler((-1, 1))
scalerTarget.fit(y_train.reshape(-1, 1))
y_train = scalerTarget.transform(y_train.reshape(-1, 1))

# Verify
print(np.allclose(item_train_unscaled, scalerItem.inverse_transform(item_train)))
print(np.allclose(user_train_unscaled, scalerUser.inverse_transform(user_train)))

True
True


As normal, we want to split the data into training and test sets.
scikit-learn to help with that.

Remember to shuffle afterwards.
Setting the initial random state ensures this shuffle is identical across user, item, and y arrays.

In [17]:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test = train_test_split(y_train, train_size=0.80, shuffle=True, random_state=1)

print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")

movie/item training data shape: (40707, 17)
movie/item test data shape: (10177, 17)


In [18]:
# Now scaled, should have a mean of 0.
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
1,0,-1.0,-0.8,-0.7,0.1,-0.0,-1.2,-0.4,0.6,-0.5,-0.5,-0.1,-0.6,-0.6,-0.7,-0.7
0,1,-0.7,-0.5,-0.7,-0.1,-0.2,-0.6,-0.2,0.7,-0.5,-0.8,0.1,-0.0,-0.6,-0.5,-0.4
-1,-1,-0.2,0.3,-0.4,0.4,0.5,1.0,0.6,-1.2,-0.3,-0.6,-2.3,-0.1,0.0,0.4,-0.0
0,-1,0.6,0.5,0.5,0.2,0.6,-0.1,0.5,-1.2,0.9,1.2,-2.3,-0.1,0.0,0.2,0.3
-1,0,0.7,0.6,0.5,0.3,0.5,0.4,0.6,1.0,0.6,0.3,0.8,0.8,0.4,0.7,0.7


Now for the Neural Networks. They do not need to be the same (barring the output size), but will be for this lab. In this case, user and movie content are similar.

Use Keras and its Sequential Model
- Layer 1: Dense, 256 units, relu activation
- Layer 2: Dense, 128 units, relu
- Layer 3 (output): Dense, num_outputs units, no activation (or linear)



The original lab also makes use of the functional API of Keras, apparently allowing flexibility in component interconnectivity.

In [19]:
num_outputs = 32
tf.random.set_seed(1)

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

# Create user input, point to base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# Create item input and point to base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# Compute dot product of vu and vm vectors
output = tf.keras.layers.Dot(axes=1)([vu, vm]) # I hate how this is written

# Specify inputs/outputs of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 14)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 16)]         0                                            
__________________________________________________________________________________________________
sequential (Sequential)         (None, 32)           40864       input_1[0][0]                    
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 32)           41376       input_2[0][0]                    
______________________________________________________________________________________________

2023-03-17 19:26:12.476656: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-17 19:26:12.477435: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-17 19:26:12.479120: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [20]:
# Use mean-squared error and Adam optimizer
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt, loss=cost_fn)

In [21]:
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], y_train, epochs=30)

2023-03-17 19:29:54.208204: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2023-03-17 19:29:54.235700: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3693060000 Hz


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f6d4b379340>

In [23]:
# With this loss (0.0727 when initially ran), we evaluate the model on the test data
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], y_test)



0.08293633908033371

At a value of 0.082 at inital run, we determine it is comparable, and there are no indications of overfit.

The model is now trained, so we try to make predictions.




First, a new user to have movies recommended to.

In [24]:
new_uid = 5000
new_rate_ave = 0.0
new_action = 0.0
new_adventure = 5.0 # But from whence? I guess the 'setup' stage of a new acct
new_animation = 0.0
new_childrens = 0.0
new_comedy = 0.0
new_crime = 0.0
new_documentary = 0.0
new_drama = 0.0
new_fantasy = 0.0
new_horror = 0.0
new_mystery = 0.0
new_romance = 0.0
new_scifi = 0.0
new_thriller = 0.0
new_rating_count = 1

user_vec = np.array([[new_uid, new_rating_count, new_rate_ave, new_action, new_adventure, 
                      new_animation, new_childrens, new_comedy, new_crime, new_documentary,
                     new_drama, new_fantasy, new_horror, new_mystery, new_romance,
                     new_scifi, new_thriller]])

In [25]:
# Generate and replicate the user vector to match the number of movies in the dataset
user_vecs = gen_user_vecs(user_vec, len(item_vecs))

# Scale user and item vecs
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# Make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# Unscale it
y_pu = scalerTarget.inverse_transform(y_p)

# Sort the results
sorted_index = np.argsort(-y_pu, axis=0).reshape(-1).tolist()
sorted_ypu = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index] # Unscaled vecs for display

print_pred_movies(sorted_ypu, sorted_items, movie_dict, maxcount=10)

y_p,movie id,rating ave,title,genres
4.0,98809,3.8,"Hobbit: An Unexpected Journey, The (2012)",Adventure|Fantasy
4.0,106489,3.6,"Hobbit: The Desolation of Smaug, The (2013)",Adventure|Fantasy
4.0,88125,3.9,Harry Potter and the Deathly Hallows: Part 2 (2011),Action|Adventure|Drama|Fantasy|Mystery
4.0,69844,3.9,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance
3.9,4896,3.8,Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),Adventure|Children|Fantasy
3.9,54001,3.9,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy
3.9,137857,3.6,The Jungle Book (2016),Adventure|Drama|Fantasy
3.9,118696,3.4,The Hobbit: The Battle of the Five Armies (2014),Adventure|Fantasy
3.8,40815,3.8,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller
3.8,59387,4.0,"Fall, The (2006)",Adventure|Drama|Fantasy


How about predictions for an existing user?

In [26]:
uid = 2
user_vecs, y_vecs = get_user_vecs(uid, user_train_unscaled, item_vecs, user_to_genre)

# Scale
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# Predict
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# Unscale
y_pu = scalerTarget.inverse_transform(y_p)

# Sort
sorted_index = np.argsort(-y_pu, axis=0).reshape(-1).tolist()
sorted_ypu = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]
sorted_user = user_vecs[sorted_index]
sorted_y = y_vecs[sorted_index]

print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, ivs, uvs, movie_dict, maxcount = 50)

y_p,y,user,user genre ave,movie rating ave,movie id,title,genres
4.4,5.0,2,[4.0],4.3,80906,Inside Job (2010),Documentary
4.3,3.5,2,"[4.0,4.1,4.0,3.9]",3.8,8798,Collateral (2004),Action|Crime|Drama|Thriller
4.2,3.5,2,"[4.0,4.0]",3.9,99114,Django Unchained (2012),Action|Drama
4.2,4.0,2,"[4.0,4.1,3.9]",4.0,6874,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller
4.2,4.0,2,"[4.1,4.0,3.9]",4.3,48516,"Departed, The (2006)",Crime|Drama|Thriller
4.2,4.0,2,"[4.0,4.1,4.0,4.0,3.9,3.9]",4.1,79132,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller
4.1,4.5,2,"[4.0,4.0]",4.1,68157,Inglourious Basterds (2009),Action|Drama
4.1,3.5,2,"[4.0,3.9,3.9]",3.9,115713,Ex Machina (2015),Drama|Sci-Fi|Thriller
4.0,4.5,2,"[4.1,4.0,3.9]",4.0,80489,"Town, The (2010)",Crime|Drama|Thriller
4.0,5.0,2,"[4.0,4.1,4.0]",3.9,106782,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama


Not perfect, but generally within the ballpark. 1.1 difference at most, at a glance. A big problem is if the movie is rated far from the user's average rating of that genre.



Next, we try to find similar items. Simply finding the square distance between two vectors.

In [29]:
def sq_dist(a, b):
    """
    Returns the squared distance between two vectors
    Args:
        a (ndarray (n,)): vector with n features
        b (ndarray (n,)): vector with n features
    Returns:
        d (float): distance
    """
    d = np.subtract(a, b)
    d = np.dot(d, d)
    return d
    

In [32]:
sq_dist([0, 1, 0], [1, 0, 0]) # Expect 2, but also lol

2

To ease computation, a matrix of closeness can be computed once when the model is trained and then be used for new recommendations without retraining.

Step 1 is, once the model is trained, get the movie feature vectors for each movie. Then use the trained item NN to build a small model to allow us to run movie vectors through it to generate v_m

In [33]:
input_item_m = tf.keras.layers.Input(shape=(num_item_features))
vm_m = item_NN(input_item_m)
vm_m = tf.linalg.l2_normalize(vm_m, axis=1)
model_m = tf.keras.Model(input_item_m, vm_m)
model_m.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 16)]              0         
_________________________________________________________________
sequential_1 (Sequential)    (None, 32)                41376     
_________________________________________________________________
tf.math.l2_normalize_2 (TFOp (None, 32)                0         
Total params: 41,376
Trainable params: 41,376
Non-trainable params: 0
_________________________________________________________________


Once you have a movie model, you can create a set of movie feature vectors by using the model to predict using a set of item/movie vectors as input.
item_vecs is a set of all movie vectors, and it must be scaled to use with the trained model.

The result of the prediction in this case is a 32 entry feature vector for each movie:

In [34]:
scaled_item_vecs = scalerItem.transform(item_vecs)
vms = model_m.predict(scaled_item_vecs[:, i_s:])

print(f"size of all predicted movie feature vectors: {vms.shape}")

size of all predicted movie feature vectors: (847, 32)


Now, compute a matrix of sq distance between each movie feature vector and all other movie feature vectors.

From there, find the closest movie by looking at the minimum along each row.

numpy has masked arrays that will help us avoid selecting the same movie (as the dist is obviously 0), they simply won't be included in the calculation.

In [36]:
count = 50 # display this many movies
dim = len(vms)
dist = np.zeros((dim, dim))

for i in range(dim):
    for j in range(dim):
        dist[i, j] = sq_dist(vms[i, :], vms[j, :])
        
m_dist = ma.masked_array(dist, mask=np.identity(dist.shape[0]))

disp = [["movie1", "genres", "movie2", "genres"]]
for i in range(count):
    min_idx = np.argmin(m_dist[i])
    movie1_id = int(item_vecs[i, 0])
    movie2_id = int(item_vecs[min_idx, 0])
    disp.append([
        movie_dict[movie1_id]['title'], movie_dict[movie1_id]['genres'],
        movie_dict[movie2_id]['title'], movie_dict[movie2_id]['genres']
    ])
    

In [37]:
table = tabulate.tabulate(disp, tablefmt="html", headers="firstrow")
table

movie1,genres,movie2,genres.1
Save the Last Dance (2001),Drama|Romance,Mona Lisa Smile (2003),Drama|Romance
"Wedding Planner, The (2001)",Comedy|Romance,"Sweetest Thing, The (2002)",Comedy|Romance
Hannibal (2001),Horror|Thriller,Final Destination 2 (2003),Horror|Thriller
Saving Silverman (Evil Woman) (2001),Comedy|Romance,"Wedding Planner, The (2001)",Comedy|Romance
Down to Earth (2001),Comedy|Fantasy|Romance,Bewitched (2005),Comedy|Fantasy|Romance
"Mexican, The (2001)",Action|Comedy,Rush Hour 2 (2001),Action|Comedy
15 Minutes (2001),Thriller,Panic Room (2002),Thriller
Enemy at the Gates (2001),Drama,"Aviator, The (2004)",Drama
Heartbreakers (2001),Comedy|Crime|Romance,Fun with Dick and Jane (2005),Comedy|Crime
Spy Kids (2001),Action|Adventure|Children|Comedy,Scooby-Doo (2002),Adventure|Children|Comedy|Fantasy|Mystery


Obviously shaky ground to base actual recommendations on, but there you go!