# Recommender systems. Content-based filtering

## Motivation

With collaborative filtering the recommendations were based in the ratings only. Content-based filtering recognizes there may be other information available about users or items that may improve the prediction. That information is represented as features about the items and the users.

## Notation

We have the following:

* Features for usr j: $x_u^{(j)}$
* Features for item i: $x_m^{(i)}$

We predict rating of user j for item i as: 

$$v_u^{(j)} \cdot v_m^{(i)}$$

where $v_u^{(j)}$ and $v_m{(i)}$ are vectors of similar length derived from $x_u^{(j)}$ and $x_m^{(i)}$

## Implementation with deep learning

The vectors $v_u^{(j)}$ and $v_m{(i)}$ are the output of two neural networks that have $x_u^{(j)}$ and $x_m^{(i)}$ as their respective inputs and are training together by minimizing the cost function:

$$J = \sum_{(i, j):r(i,j)=1} (v_u^{(j)} \cdot v_m^{(i)}-y^{(i, j)})^2 + \text{neural network regularization term}$$

To find similar items to item i:

$$\left\Vert v_m^{(k)} - v_m^{(i)}\right\Vert^2 = \sum_{l=1}^n (v_{m_l}^{(k)} - v_{m_l}^{(i)})^2$$

## Recommending from a large set of items

Two steps:

* Retrieval: compile a list of pausible items by, for example (in a movie recommendation application):
    * For each of the last 10 movies watched by user, select the 10 most similar movies
    * For the 3 most viewed genres by the user, find the top 10 movies
    * Top 20 movies in the country
* Ranking
    * Rank the list using the learned model

## TensorFlow implementation

### Input data format

The TensorFlow implementation requires the input as follows:

- Items: numpy array with one row per rating (combination of item and user), with the features of the item corresponding to that rating
- Users: numpy array with one row per rating (combination of item and user), with the features of the user corresponding to that rating
- Training set: one dimension numpy array with a sequence of ratings, corresponing to each row of the inputs

### Code example

``` python
# User neural network

user_NN = tf.keras.models.Sequential ([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(32)
]) 

# Item neural network

user_NN = tf.keras.models.Sequential ([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(32)
]) 

input_user = tf.keras.layers.Input(shape=(num_user_features))   # Extracts the input features for the user
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)                         # Normalizes the vector vu to be equal to 1

input_item = tf.keras.layers.Input(shape=(num_item_features))   # Extracts the input features for the user
vm = item_NN(input_user)
vm = tf.linalg.l2_normalize(vm, axis=1)                         # Normalizes the vector vu to be equal to 1

output = tf.keras.layers.Dot(axes=1)([vu, vm])

model = Model([input_user, input_item], output)

# Specify the cost function
cost_fn = tf.keras.losses.MeanSquaredError ()

```


In [1]:
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import tensorflow as tf
from tensorflow import keras

2023-03-05 18:23:08.378556: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
pd.set_option("display.precision", 2)

In [3]:

movies_vectors = pd.read_csv(
    './data/MovieLensSimplified/content_item_train.csv', 
    header=None, 
    index_col=0,  # The movieId
    names= ['year', 'ave rating', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Horror', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller'],
    delimiter=',')
users_vectors = pd.read_csv(
    './data/MovieLensSimplified/content_user_train.csv', 
    header=None, 
    index_col=0, # The userId
    names= ['rating count', 'rating ave', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Horror', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller'],
    delimiter=',')
y_df = pd.read_csv('./data/MovieLensSimplified/content_y_train.csv', header=None, delimiter=',')

users_vectors.drop(columns=['rating count', 'rating ave'], inplace=True)

movies = pd.read_csv(
    "./data/MovieLensSimplified/content_movie_list.csv",
    index_col='movieId',
).rename(
    columns={'title': 'title old'}
)

In [4]:
# Re-structure the info about movies

# Extract year
movies[['title', 'year']] = movies['title old'].str.extract('(.*)\(([0-9]*)\)$')

# One hot encoding of genres
genres = movies['genres'].str.get_dummies('|')
movies = movies.join(genres)

movies.drop(columns=['title old', 'genres'], inplace=True)


In [5]:
print(movies_vectors.shape)
print(users_vectors.shape)

(50884, 16)
(50884, 14)


In [6]:
# Parameters
num_user_features = users_vectors.shape[1] 
num_movie_features = movies_vectors.shape[1]

In [7]:
# Scaling 

movies_scaler = StandardScaler()
movies_scaled = pd.DataFrame(
    movies_scaler.fit_transform(movies_vectors),
    index=movies_vectors.index,
    columns=movies_vectors.columns
)

users_scaler = StandardScaler()
users_scaled = pd.DataFrame(
    users_scaler.fit_transform(users_vectors),
    index=users_vectors.index,
    columns=users_vectors.columns
)


y_scaler = MinMaxScaler((-1,1))
y_scaled = pd.DataFrame(
    y_scaler.fit_transform(y_df.to_numpy().reshape(-1,1)),
    index=y_df.index,
    columns=['rating']
)


In [8]:
# Split train, test

movies_train, movies_test = train_test_split(movies_scaled, train_size=0.80, shuffle=True, random_state=1)
users_train, users_test = train_test_split(users_scaled, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test = train_test_split(y_scaled, train_size=0.80, shuffle=True, random_state=1)

print(movies_train.shape)
print(users_train.shape)
print(y_train.shape)

(40707, 16)
(40707, 14)
(40707, 1)


In [9]:
# Define model

num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_movie_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()


2023-03-05 18:24:23.652147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 18:24:23.800233: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 14)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 16)]         0           []                               
                                                                                                  
 sequential (Sequential)        (None, 32)           40864       ['input_1[0][0]']                
                                                                                                  
 sequential_1 (Sequential)      (None, 32)           41376       ['input_2[0][0]']                
                                                                                              

In [10]:
tf.random.set_seed(1)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss=tf.keras.losses.MeanSquaredError()
)

In [11]:
tf.random.set_seed(1)
model.fit([users_train, movies_train], y_train, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f0bfa7b35e0>

In [12]:
model.evaluate([users_test, movies_test], y_test)



0.08724342286586761

In [13]:
# Caculate accuracy on train set

y_train_hat = model.predict([users_train, movies_train])
ratings_train_hat = pd.DataFrame(
    y_scaler.inverse_transform(y_train_hat),
    index=y_train.index,
    columns=['rating']
)

ratings_train = pd.DataFrame(
    y_scaler.inverse_transform(y_train),
    index=y_train.index,
    columns=['rating']
)

accuracy_train = np.sum(np.abs(ratings_train['rating'] - ratings_train_hat['rating']) <= 1)/ratings_train.shape[0]

print(f"Accuracy train: {accuracy_train:0.2f}")

Accuracy train: 0.90


In [14]:
# Calculate accuracy on test set

y_test_hat = model.predict([users_test, movies_test])
ratings_test_hat = pd.DataFrame(
    y_scaler.inverse_transform(y_test_hat),
    index=y_test.index,
    columns=['rating']
)

ratings_test = pd.DataFrame(
    y_scaler.inverse_transform(y_test),
    index=y_test.index,
    columns=['rating']
)

accuracy_test = np.sum(np.abs(ratings_test['rating'] - ratings_test_hat['rating']) <= 1)/ratings_test.shape[0]

print(f"Accuracy test: {accuracy_test:0.2f}")

Accuracy test: 0.89


In [15]:
# Recommendations for a user
# The user likes adventure and fantasy movies

user1_features = np.array([[0,5,0,0,0,0,0,0,5,0,0,0,0,0]])

# This is the vector for the movies
movies1_vectors = movies_vectors.drop_duplicates()

# We will generate predictions for all the movies. This is the vector for the user
user1_vectors = pd.DataFrame(np.repeat(user1_features, movies1_vectors.shape[0], axis=0))

user1_vectors_s = users_scaler.transform(user1_vectors)
movies1_vector_s = movies_scaler.transform(movies1_vectors)



In [16]:

y_ps = model.predict([user1_vectors_s, movies1_vector_s])



In [17]:

y_p = pd.DataFrame(
    y_scaler.inverse_transform(y_ps),
    columns=['rating'],
    index = movies1_vectors.index
)

y_p = y_p.join(movies)

In [18]:
y_p[['title', 'rating']].sort_values(by='rating', ascending=False).head()

Unnamed: 0,title,rating
81834,Harry Potter and the Deathly Hallows: Part 1,4.09
108932,The Lego Movie,4.07
122926,Untitled Spider-Man Reboot,4.01
8368,Harry Potter and the Prisoner of Azkaban,4.0
6283,Cowboy Bebop: The Movie (Cowboy Bebop: Tengoku...,3.99
