# Movie Recommender with Content-Based Filtering and Neural Networks

This notebook implements a neural network with ***TensorFlow*** to create a content-based movie recommender. Unlike ***collaborative filtering***, which only leverages user-item interactions (e.g., ratings) to generate user and item vectors, **content-based filtering** utilizes additional information about both the user and the movie to enhance predictions.

**User and Movie Feature Vectors:**  The model generates a feature vector for each user and movie. The user feature vector captures the user’s preferences based on their past interactions with the movies. The movie feature vector represents the movie’s attributes, like genre or release date.

**Prediction Through Vector Similarity**: Predictions are made by calculating the dot product between user and movie vectors ( $v_u$ .  $v_m$). This similarity measure, based on the cosine of the angle between vectors, shows how well a movie aligns with a user’s preferences, providing a personalized recommendation score.

**Finding Similar Items**
To identify similar items, we can analyze only the item/movie vectors, without needing the user vector. Movies with similar feature vectors are considered alike, allowing us to make recommendations based on item similarity. For example, if a user rates a specific movie highly, we can recommend other movies with closely matching feature vectors, suggesting content that aligns with the user’s interests.

A similarity measure is the squared distance between the two vectors of movies/items $ \mathbf{v_m^{(k)}}$ and $\mathbf{v_m^{(i)}}$ :
$$\left\Vert \mathbf{v_m^{(k)}} - \mathbf{v_m^{(i)}}  \right\Vert^2 = \sum_{l=1}^{n}(v_{m_l}^{(k)} - v_{m_l}^{(i)})^2$$







In [None]:
import numpy as np
import numpy.ma as ma
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

from numpy import genfromtxt
import csv
from collections import defaultdict
pd.set_option("display.precision", 1)

from IPython.display import display, HTML
!pip install pickle5
import pickle5 as pickle



### Preparing the dataset

#### Dataset
The data set is derived from the [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/) dataset.

The original dataset has roughly 9000 movies rated by 600 users with ratings on a scale of 0.5 to 5 in 0.5 step increments. The dataset has been reduced in size to focus on movies from the years since 2000 and popular genres. The reduced dataset has  𝑛𝑢=397 users,  𝑛𝑚=847 movies and 25521 ratings.
For each movie, the dataset provides a movie title, release date, and one or more genres. For example "Lord of the Rings: The Two Towers" was released in 2002 and has two genres: "Adventure|Fantasy". This dataset contains little information about users other than their ratings.

<br></br>
#### Training data
**Movie content** includes original features like the release year and a ***one-hot encoded vector*** for 14 genres, along with an ***engineered feature***: the average user rating.

**User content:** consists of ***engineered features***, including average ratings per genre, a user ID, rating count, and rating average (some of these are not used in training or predictions but are helpful for data interpretation).

**Note:** The training set comprises all user ratings, with some repeated to balance underrepresented genres. This set is split into two arrays: one for users and one for movies/items.


**Target y:**  Is the movie rating given by the user.

In the example shown below a user with id 2 rated a movie with id 6874 with a value of 4/5

<br></br>
#### Scaling the training data

**Feature scaling** is crucial for improving convergence in machine learning.

- **Input features** are scaled using the ***StandardScaler from scikit-learn***, which standardizes them by removing the mean and scaling to unit variance.
- **Target ratings**, a Min-Max Scaler is used to transform values to a range between -1 and 1. This dual scaling approach enhances model learning and stability during training.

<br></br>
#### Splitting the data into training and test sets

 Making use of *sklean train_test_split* to **split and shuffle** the data. Note that setting the initial random state to the same value ensures ***item, user, and y are shuffled identically*** (to maintain the correct correspondence between them).

In [None]:
## Auxiliary functions

def load_data():
    item_train = genfromtxt('./content_item_train.csv', delimiter=',')
    user_train = genfromtxt('./content_user_train.csv', delimiter=',')
    y_train    = genfromtxt('./content_y_train.csv', delimiter=',')
    with open('./content_item_train_header.txt', newline='') as f:    #csv reader handles quoted strings better
        item_features = list(csv.reader(f))[0]
    with open('./content_user_train_header.txt', newline='') as f:
        user_features = list(csv.reader(f))[0]
    item_vecs = genfromtxt('./content_item_vecs.csv', delimiter=',')

    movie_dict = defaultdict(dict)
    count = 0
    with open('./content_movie_list.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for line in reader:
            if count == 0:
                count += 1  #skip header
                #print(line) print
            else:
                count += 1
                movie_id = int(line[0])
                movie_dict[movie_id]["title"] = line[1]
                movie_dict[movie_id]["genres"] = line[2]

    with open('./content_user_to_genre.pickle', 'rb') as f:
        user_to_genre = pickle.load(f)

    return(item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre)


def pprint_manual_train(x_train, features, maxcount=5, show_id=True):
    # Start building the HTML table
    html = "<table style='border-collapse: collapse; width: 90%;'>"
    html += "<thead><tr>"

    # Create table headers, optionally excluding the ID
    if not show_id:
        headers = features[1:]  # Skip the first feature (ID)
    else:
        headers = features  # Include all features

    for feature in headers:
        html += f"<th style='border: 1px solid black; padding: 4px;'>{feature}</th>"
    html += "</tr></thead><tbody>"

    # Fill in the rows of the table
    for i in range(min(maxcount, x_train.shape[0])):
        html += "<tr>"
        for j, value in enumerate(x_train[i]):
            # Skip the ID column if show_id is False
            if j == 0 and not show_id:
                continue  # Skip ID column

            # Format the first column (if ID is shown) as an integer
            if j == 0:
                html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{int(value)}</td>"
            else:
                html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{value}</td>"
        html += "</tr>"

    html += "</tbody></table>"
    return html

def print_pred_movies(y_p, item, movie_dict, maxcount=10):
    """Print results of prediction of a new user. Inputs are expected to be in
    sorted order, unscaled."""

    # Start building the HTML table
    html = "<table style='border-collapse: collapse; width: 90%;'>"
    html += "<thead><tr>"

    # Create table headers
    headers = ["Predicted Rating", "Movie ID", "Average Rating", "Title", "Genres"]
    for header in headers:
        html += f"<th style='border: 1px solid black; padding: 4px;'>{header}</th>"
    html += "</tr></thead><tbody>"

    # Fill in the rows of the table
    count = 0
    for i in range(0, y_p.shape[0]):
        if count == maxcount:
            break
        count += 1
        movie_id = item[i, 0].astype(int)
        html += "<tr>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{np.around(y_p[i, 0], 1)}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_id}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{np.around(item[i, 2].astype(float), 1)}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie_id]['title']}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie_id]['genres']}</td>"
        html += "</tr>"

    html += "</tbody></table>"
    return html

def gen_user_vecs(user_vec, num_items):
    """ given a user vector this function creates a matrix of user vectors that can be used to match user preferences against multiple items."""
    user_vecs = np.tile(user_vec, (num_items, 1)) # It repeats user_vec num_items times along the first dimension (rows), while keeping the second dimension (columns) unchanged
    return user_vecs

def get_user_vecs(user_id, user_train, item_vecs, user_to_genre):
    """ given a user_id, return:
        user train/predict matrix to match the size of item_vecs
        y vector with ratings for all rated movies and 0 for others of size item_vecs """

    if not user_id in user_to_genre:
        print("error: unknown user id")
        return None
    else:
        user_vec_found = False
        for i in range(len(user_train)):
            if user_train[i, 0] == user_id:
                user_vec = user_train[i]
                user_vec_found = True
                break
        if not user_vec_found:
            print("error in get_user_vecs, did not find uid in user_train")
        num_items = len(item_vecs)
        user_vecs = np.tile(user_vec, (num_items, 1))

        y = np.zeros(num_items)
        for i in range(num_items):  # walk through movies in item_vecs and get the movies, see if user has rated them
            movie_id = item_vecs[i, 0]
            if movie_id in user_to_genre[user_id]['movies']:
                rating = user_to_genre[user_id]['movies'][movie_id]
            else:
                rating = 0
            y[i] = rating
    return(user_vecs, y)

def print_existing_user(y_p, y, user, items, ivs, uvs, movie_dict, maxcount=10):
    """Print results of prediction for an existing user. Inputs are expected to be in sorted order, unscaled."""

    # Start building the HTML table
    html = "<table style='border-collapse: collapse; width: 90%;'>"
    html += "<thead><tr>"

    # Define table headers
    headers = ["Predicted Rating", "Actual Rating", "User ID", "User Genre Avg", "Movie Rating Avg", "Movie ID", "Title", "Genres"]
    for header in headers:
        html += f"<th style='border: 1px solid black; padding: 4px;'>{header}</th>"
    html += "</tr></thead><tbody>"

    count = 0
    for i in range(y.shape[0]):
        if y[i, 0] != 0:  # Skip if not rated
            if count == maxcount:
                break
            count += 1
            movie_id = items[i, 0].astype(int)

            # Get the user's genre average for rated genres
            offsets = np.nonzero(items[i, ivs:] == 1)[0]
            genre_ratings = user[i, uvs + offsets]
            genre_ratings_str = ", ".join([f"{rating:.1f}" for rating in genre_ratings])

            # Add a row with data for each rated movie
            html += "<tr>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{y_p[i, 0]:.1f}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{y[i, 0]:.1f}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{int(user[i, 0])}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{genre_ratings_str}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{items[i, 2]:.1f}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_id}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie_id]['title']}</td>"
            html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie_id]['genres']}</td>"
            html += "</tr>"

    html += "</tbody></table>"
    return html



In [None]:
# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
print(f"Number of user features: {num_user_features}")
print(f"Number of item features: {num_item_features}")
print(f"Number of training vectors: {len(item_train)}")


Number of user features: 14
Number of item features: 16
Number of training vectors: 50884


In [None]:
#First row of user_train data
display(HTML(pprint_manual_train(user_train, user_features, maxcount=1)))

#First row of item_train data
display(HTML(pprint_manual_train(item_train, item_features, maxcount=1)))

#Target of that (user,item) pair
print(f"y_train[:1]: {y_train[:1]}")



user id,rating count,rating ave,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
2,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89


movie id,year,ave rating,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
6874,2003.0,3.9618320610687023,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


y_train[:1]: [4.]


In [None]:
# scale training data
item_train_unscaled = item_train
user_train_unscaled = user_train
y_train_unscaled    = y_train

scalerItem = StandardScaler()
scalerItem.fit(item_train)
item_train = scalerItem.transform(item_train)

scalerUser = StandardScaler()
scalerUser.fit(user_train)
user_train = scalerUser.transform(user_train)

scalerTarget = MinMaxScaler((-1, 1))
scalerTarget.fit(y_train.reshape(-1, 1))
y_train = scalerTarget.transform(y_train.reshape(-1, 1))
#ynorm_test = scalerTarget.transform(y_test.reshape(-1, 1))

# Check scaling doing the inverse
print(np.allclose(item_train_unscaled, scalerItem.inverse_transform(item_train)))
print(np.allclose(user_train_unscaled, scalerUser.inverse_transform(user_train)))

True
True


In [None]:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")

#Show scaled data
display(HTML(pprint_manual_train(user_train, user_features, maxcount=1, show_id=False)))

movie/item training data shape: (40707, 17)
movie/item test data shape: (10177, 17)


rating count,rating ave,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
-0.0821578588532831,-0.9734314607219016,-0.7500437305379585,-0.7120370523253821,0.0544552941202423,-0.0486580466176729,-1.2007356134851437,-0.413507997185553,0.638343600863053,-0.5454291584413995,-0.4652740925492035,-0.0630374514812024,-0.5693510969927813,-0.6377189361688048,-0.6547691297657673,-0.7338245274481356


### Neural Networks for content-based filtering

#### Architecture for both Neural Networks
- First Layer: A dense layer with 256 units and a ReLU activation function. This layer introduces non-linearity to the model, enabling it to learn complex patterns in the input data.

- Second Layer: A dense layer with 128 units and a ReLU activation function. Similar to the first layer, this layer continues to build non-linear representations of the data.

- Third Layer: A dense layer with num_outputs units, using a linear activation function. This layer produces the final output of the network, suitable for regression tasks where the output can take any real value.

- Normalization Layer: The output of the second layer is passed through a Lambda layer that applies L2 normalization. This process ensures that the output vectors from both the user and item networks have a unit norm, helping to stabilize training and improve convergence by controlling the scale of the outputs.

- Loss Function: The model uses Mean Squared Error (MSE) as the loss function. MSE measures the average squared difference between the predicted and actual values.

- Optimizer: The model employs the Adam optimizer, a popular choice for training neural networks.


Note: These networks do not need to be the same

In [None]:
num_outputs = 32
tf.random.set_seed(1)

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear'),
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear'),
])

# Create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features,))
vu = user_NN(input_user)

# Normalize using a Lambda layer
vu = tf.keras.layers.Lambda(lambda x: tf.linalg.l2_normalize(x, axis=1))(vu)

# Create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features,))
vm = item_NN(input_item)

# Normalize using a Lambda layer
vm = tf.keras.layers.Lambda(lambda x: tf.linalg.l2_normalize(x, axis=1))(vm)

# Compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# Specify the inputs and output of the model
model = tf.keras.Model(inputs=[input_user, input_item], outputs=output)

model.summary()


In [None]:
# Compile and train the model

tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
              loss=cost_fn)

tf.random.set_seed(1)

u_s = 3  # start of columns to use in training, user (avoid id, rating count and rating average)
i_s = 1  # start of columns to use in training, items (avoid id)

model.fit([user_train[:, u_s:], item_train[:, i_s:]], y_train, epochs=30)

Epoch 1/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - loss: 0.1303
Epoch 2/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.1150
Epoch 3/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - loss: 0.1103
Epoch 4/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 5ms/step - loss: 0.1054
Epoch 5/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - loss: 0.1016
Epoch 6/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.0990
Epoch 7/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.0967
Epoch 8/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.0944
Epoch 9/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.0924
Epoch 10/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

<keras.src.callbacks.history.History at 0x7be4b9fc0c40>

In [None]:
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], y_test)

[1m319/319[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0872


0.08373275399208069

### Predictions

#### 1) Predictions for a new user

**A new user** will be created with the attributes below. This user prefers Adventure, Science Fiction, Fantasy and Action genres.

**item_vecs** is a set of movie/item vectors, that has a vector for each movie in the training/test set.
This is matched with the new user vector above and the ***scaled vectors*** are used to predict ratings for all the movies.

In [None]:
# Creation of new user with some values.

new_user_id = 5000
new_rating_ave = 0.0
new_action = 3.0
new_adventure = 5.0
new_animation = 0.5
new_childrens = 0.5
new_comedy = 1.5
new_crime = 1.5
new_documentary = 0.5
new_drama = 2.5
new_fantasy = 5.0
new_horror = 0.5
new_mystery = 0.0
new_romance = 0.0
new_scifi = 5.0
new_thriller = 1.0
new_rating_count = 3

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])

# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))

# scale user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction, in order to have the values for interpretation
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display

display(HTML(print_pred_movies(sorted_ypu, sorted_items, movie_dict, maxcount = 10)))

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


Predicted Rating,Movie ID,Average Rating,Title,Genres
4.599999904632568,40815,3.8,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller
4.599999904632568,5816,3.6,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
4.5,57640,3.8,Hellboy II: The Golden Army (2008),Action|Adventure|Fantasy|Sci-Fi
4.5,98809,3.8,"Hobbit: An Unexpected Journey, The (2012)",Adventure|Fantasy
4.5,106489,3.6,"Hobbit: The Desolation of Smaug, The (2013)",Adventure|Fantasy
4.400000095367432,81834,4.0,Harry Potter and the Deathly Hallows: Part 1 (2010),Action|Adventure|Fantasy
4.400000095367432,4896,3.8,Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),Adventure|Children|Fantasy
4.400000095367432,122886,3.9,Star Wars: Episode VII - The Force Awakens (2015),Action|Adventure|Fantasy|Sci-Fi
4.400000095367432,8368,3.9,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy
4.400000095367432,98243,3.8,Rise of the Guardians (2012),Adventure|Animation|Children|Fantasy


#### 2) Predictions for an existing user

Below the predictions for the existing user with id 2, one of the users in the data set.

In [None]:
uid = 2
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, user_train_unscaled, item_vecs, user_to_genre)

# scale our user and item vectors
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# make a prediction
y_p = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]])

# unscale y prediction
y_pu = scalerTarget.inverse_transform(y_p)

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu,axis=0).reshape(-1).tolist()  #negate to get largest rating first
sorted_ypu   = y_pu[sorted_index]
sorted_items = item_vecs[sorted_index]  #using unscaled vectors for display
sorted_user  = user_vecs[sorted_index]
sorted_y     = y_vecs[sorted_index]

uvs = 3  # user genre vector start inside the user own array, avoiding id and other two attributes
ivs = 3  # item genre vector start inside the item own, avoiding id and other two attributes

#print sorted predictions for movies rated by the user
display(HTML(print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, ivs, uvs, movie_dict, maxcount = 20)))

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


Predicted Rating,Actual Rating,User ID,User Genre Avg,Movie Rating Avg,Movie ID,Title,Genres
4.5,5.0,2,4.0,4.3,80906,Inside Job (2010),Documentary
4.3,4.0,2,"4.0, 4.1, 3.9",4.0,6874,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller
4.2,3.5,2,"4.0, 4.1, 4.0, 3.9",3.8,8798,Collateral (2004),Action|Crime|Drama|Thriller
4.2,4.0,2,"4.0, 4.1, 4.0, 4.0, 3.9, 3.9",4.1,79132,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller
4.2,4.5,2,"4.0, 4.0",4.1,68157,Inglourious Basterds (2009),Action|Drama
4.1,4.5,2,"4.0, 4.1, 4.0",4.2,58559,"Dark Knight, The (2008)",Action|Crime|Drama
4.1,3.5,2,"4.0, 4.0",3.9,99114,Django Unchained (2012),Action|Drama
4.0,4.0,2,"4.1, 4.0, 3.9",4.3,48516,"Departed, The (2006)",Crime|Drama|Thriller
4.0,5.0,2,"4.0, 4.1, 4.0",3.9,106782,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama
3.9,3.5,2,"4.0, 3.9, 3.9",3.9,115713,Ex Machina (2015),Drama|Sci-Fi|Thriller


#### 3) Finding similar items

As said at the beginning, to identify similar items, one can analyze only the item/movie vectors, without needing the user vector. Movies with similar feature vectors are considered alike, allowing us to make recommendations based on item similarity.

A similarity measure is the squared distance between the two vectors of movies/items $ \mathbf{v_m^{(k)}}$ and $\mathbf{v_m^{(i)}}$ :
$$\left\Vert \mathbf{v_m^{(k)}} - \mathbf{v_m^{(i)}}  \right\Vert^2 = \sum_{l=1}^{n}(v_{m_l}^{(k)} - v_{m_l}^{(i)})^2$$

A ***matrix of distances between movies*** can be computed once when the model is trained and ***then reused for new recommendations without retraining***. The first step, once a model is trained, is to obtain the movie feature vector, $v_m$, for each of the movies. To do this, `item_NN` can be used to build a small model to run the movie vectors through it to generate $v_m$.

Once the movie model is created, one can use it to build a set of movie feature vector. item_vecs is a set of all of the movie vectors. It must be scaled to use with the trained model. The result of the prediction is a ***32 entry feature vector for each movie***.

One can find the closest movie by finding the minimum along each row of the matrix.  to avoid selecting the same movie make use of ***numpy masked arrays*** as the masked values along the diagonal won't be included in the computation.

In [None]:
def sq_dist(a,b):
    """
    Returns the squared distance between two vectors
    Args:
      a (ndarray (n,)): vector with n features
      b (ndarray (n,)): vector with n features
    Returns:
      d (float) : distance
    """
    d = sum(np.square(a-b))
    return d

In [None]:
input_item_m = tf.keras.layers.Input(shape=(num_item_features,))   # Input layer
vm_m = item_NN(input_item_m)                                       # Use the trained item_NN
vm_m = tf.keras.layers.LayerNormalization(axis=1)(vm_m)            # Use LayerNormalization for compatibility
model_m = tf.keras.Model(input_item_m, vm_m)                       # Define the model
model_m.summary()


In [None]:
scaled_item_vecs = scalerItem.transform(item_vecs)
vms = model_m.predict(scaled_item_vecs[:,i_s:])
print(f"size of all predicted movie feature vectors: {vms.shape}")

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step
size of all predicted movie feature vectors: (847, 32)


In [None]:
def print_movie_distances(vms, item_vecs, movie_dict, count=10):
    """
    Display a table showing the closest movies to each movie based on calculated distances.
    """
    dim = len(vms)
    dist = np.zeros((dim, dim))

    # Calculate the squared distances between movies
    for i in range(dim):
        for j in range(dim):
            dist[i, j] = sq_dist(vms[i, :], vms[j, :])

    # Mask the diagonal to ignore self-comparisons
    m_dist = ma.masked_array(dist, mask=np.identity(dist.shape[0]))

    # Begin building the HTML table
    html = "<table style='border-collapse: collapse; width: 90%;'>"
    html += "<thead><tr>"
    headers = ["Movie 1", "Genres", "Related Movie", "Genres"]
    for header in headers:
        html += f"<th style='border: 1px solid black; padding: 4px;'>{header}</th>"
    html += "</tr></thead><tbody>"

    # Fill in the rows for the closest movies
    for i in range(count):
        min_idx = np.argmin(m_dist[i])
        movie1_id = int(item_vecs[i, 0])
        movie2_id = int(item_vecs[min_idx, 0])

        html += "<tr>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie1_id]['title']}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie1_id]['genres']}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie2_id]['title']}</td>"
        html += f"<td style='border: 1px solid black; padding: 4px; text-align: center;'>{movie_dict[movie2_id]['genres']}</td>"
        html += "</tr>"

    html += "</tbody></table>"
    return html

In [None]:
display(HTML(print_movie_distances(vms, item_vecs, movie_dict)))

Movie 1,Genres,Related Movie,Genres.1
Save the Last Dance (2001),Drama|Romance,Mona Lisa Smile (2003),Drama|Romance
"Wedding Planner, The (2001)",Comedy|Romance,Mr. Deeds (2002),Comedy|Romance
Hannibal (2001),Horror|Thriller,Final Destination 2 (2003),Horror|Thriller
Saving Silverman (Evil Woman) (2001),Comedy|Romance,"Sweetest Thing, The (2002)",Comedy|Romance
Down to Earth (2001),Comedy|Fantasy|Romance,Bewitched (2005),Comedy|Fantasy|Romance
"Mexican, The (2001)",Action|Comedy,Rush Hour 2 (2001),Action|Comedy
15 Minutes (2001),Thriller,Panic Room (2002),Thriller
Enemy at the Gates (2001),Drama,"Aviator, The (2004)",Drama
Heartbreakers (2001),Comedy|Crime|Romance,Fun with Dick and Jane (2005),Comedy|Crime
Spy Kids (2001),Action|Adventure|Children|Comedy,Scooby-Doo (2002),Adventure|Children|Comedy|Fantasy|Mystery
