<a href="https://colab.research.google.com/github/dzxzlyp/Data-Mining/blob/main/Projet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data
- movie_ratings_500_id.pkl contains the interactions between users and movies
- movie_metadata.pkl contains detailed information about movies, e.g. genres, actors and directors of the movies.

# Goal

- Construct your own recommender systems
- Compare the performances of at least one of the baselines



# Baselines

## User-Based Collaborative Filtering
This approach predicts $\hat{r}_{(u,i)}$ by leveraging the ratings given to $i$ by $u$'s similar users. Formally, it is written as:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{v \in \mathcal{N}_i(u)}sim_{(u,v)}r_{vi}}{\sum\limits_{v \in \mathbf{N}_i(u)}|sim_{(u,v)}|}
\end{equation}
where $sim_{(u,v)}$ is the similarity between user $u$ and $v$. Usually, $sim_{(u,v)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Item-Based Collaborative Filtering
This approach exploits the ratings given to similar items by the target user. The idea is formalized as follows:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{j \in \mathcal{N}_u(i)}sim_{(i,j)}r_{ui}}{\sum\limits_{j \in \mathbf{N}_u(i)}|sim_{(i,j)}|}
\end{equation}
where $sim_{(i,j)}$ is the similarity between item $i$ and $j$. Usually, $sim_{(i,j)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Vanilla MF (You may use the package Surprise if you do not want to write the training function by your self)
Vanilla MF is the inner product of vectors that represent users and items. Each user is represented by a vector $\textbf{p}_u \in \mathbb{R}^d$, each item is represented by a vector $\textbf{q}_i \in \mathbb{R}^d$, and $\hat{r}_{(u,i)}$ is computed by the inner product of $\textbf{p}_u $ and $\textbf{q}_i$. The core idea of Vanilla MF is depicted in the followng figure and follows the idea of SVD as we have seen during the TD.

![picture](https://drive.google.com/uc?export=view&id=1EAG31Qw9Ti6hB7VqdONUlijWd4rXVobC)

\begin{equation}
\hat{r}_{(u,i)} = \textbf{p}_u{\textbf{q}_i}^T
\end{equation}

## Some variants of SVD



-  SVD with bias: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^Tp_u$
- SVD ++: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^T(p_u + |I_u|^{\frac{-1}{2}}\sum\limits_{j \in I_u}y_j)$

## Factorization machine (FM)

FM takes into account user-item interactions and other features, such as users' contexts and items' attributes. It captures the second-order interactions of the vectors representing these features , thereby enriching FM's expressiveness. However, interactions involving less relevant features may introduce noise, as all interactions share the same weight. e.g. You may use FM to consider the features of items.

\begin{equation}
\hat{y}_{FM}(\textbf{X}) = w_0 + \sum\limits_{j =1}^nw_jx_j + \sum\limits_{j=1}^n\sum\limits_{k=j+1}^n\textbf{v}_j^T\textbf{v}_kx_jx_k
\end{equation}

where $\textbf{X} \in \mathbb{R}^n$ is the feature vector, $n$ denotes the number of features, $w_0$ is the global bias, $w_j$ is the bias of the $j$-th feature and $\textbf{v}_j^T\textbf{v}_k$ denotes the bias of interaction between $j$-th feature and $k$-th feature, $\textbf{v}_j \in \mathbb{R}^d$ is the vector representing $j$-th feature.

## MLP

You may also represent users and items by vectors and them feed them into a MLP to make prediction.

## Metrics

- \begin{equation}
RMSE = \sqrt{\frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{(\hat{r}_{(u,i)}-r_{ui})}^2}
\end{equation}

- \begin{equation}
MAE = \frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{|\hat{r}_{(u,i)}-r_{ui}|}
\end{equation}
-  Bonnus: you may also consider NDCG and HR under the top-k setting


In [2]:
%pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/772.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/772.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m768.0/772.0 kB[0m [31m11.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163702 sha256=e124e74311a70808071cc970a600a6188193d48a7741c91ed39f4e932b05da58
  Stored in directory: /root/.cache/pip/wheels/

In [3]:
# Import the data
import numpy as np
import random
from scipy.stats import pearsonr
from math import sqrt
import pandas as pd
import surprise as sp
import surprise.model_selection as spms
import sklearn.model_selection as skms
from sklearn.metrics import mean_squared_error, mean_absolute_error

meta = pd.read_pickle('movie_metadata.pkl') ##Dictionnary
rating = pd.read_pickle('movie_ratings_500_id.pkl') ##Dictionnary

# **User-based**

In [None]:
## ## Separate train and test
partition = 0.8
testSet = {}
trainSet = {}

for movieId,ratings in rating.items():
  for record in ratings :
    if random.random() < partition:
      if movieId not in trainSet:
        trainSet[movieId] = [record]
      else:
        trainSet[movieId].append(record)
    else:
      if movieId not in testSet:
        testSet[movieId] = [record]
      else:
        testSet[movieId].append(record)


User-movie matrix

In [None]:
def build_user_movie_matrix(dataset):
  users = set()
  movies = set()

  for movie, ratings in dataset.items():
    for rating in ratings:
      user_id = rating['user_id']
      users.add(user_id)
      movies.add(movie)
  # Id of users
  user_list = sorted(list(users))
  # Id of movies
  movie_list = sorted(list(movies))

  matrix = np.zeros((len(user_list),len(movie_list)))
  user_map = {user:index for index, user in enumerate(user_list)}
  movie_map = {movie:index for index, movie in enumerate(movie_list)}

  for movie, ratings in dataset.items():
    movie_index = movie_map[movie]
    for rating in ratings:
      user_id = rating['user_id']
      user_index = user_map[user_id]
      score = float(rating['user_rating'])
      # # Verify that the min of rating is not 0
      # if score < 1:
      #   print('!')
      matrix[user_index,movie_index] = score

  return matrix, user_list, movie_list

Pearson

In [None]:
def pearson_corr(line_u,line_v):
  commun = np.logical_and(line_u != 0, line_v != 0)
  # The PC is meaningful if and only if u and v have >=2 movies in commun
  if np.sum(commun) < 2:
    return 0

  # Idem if an user gives always a same score
  if np.std(line_u[commun]) == 0 or np.std(line_v[commun]) == 0:
    return 0

  corr, _ = pearsonr(line_u[commun], line_v[commun])
  return corr

def corr_u_vs(u,matrix):
  nb_user = matrix.shape[0]
  matrix_pc = np.zeros((1, nb_user))
  line_u = matrix[u,:]

  for i in range(nb_user):
    if i == u:
      matrix_pc[0,i] = 1
    else:
      line_v = matrix[i,:]
      pc = pearson_corr(line_u, line_v)
      matrix_pc[0,i] = pc

  return matrix_pc

Predictions

In [None]:
# Predict rating of each movie for u
def predict_u_rating(threshold, matrix, matrix_pc, u):
  unrated_movie =  np.where(matrix[u, :] == 0)[0] # Movies not rated by u
  rated_movie = np.where(matrix[u, :] != 0)[0] # Movies rated by u

  nb_movies = matrix.shape[1]

  # Find the most similar users, predict the rating of u
  # We care only about users who have a 0.7 < PC < 1 with the user u
  filtered_user = np.where((matrix_pc[0, :] > threshold) & (matrix_pc[0, :] < 1))[0]

  prediction_list = []

  for movie in range(nb_movies):
    left = 0
    right = 0
    for v in filtered_user:
      if matrix[v,movie] != 0:
        pc = matrix_pc[0,v]
        left += pc * matrix[v, movie]
        right += np.abs(pc)

    if right != 0:
      prediction = left/right
      prediction_list.append((movie, prediction))

  return unrated_movie, rated_movie, prediction_list

# Keep only unrated movies
def recommend_movies(unrated_movie, prediction_list, nbRmd):
  filtered_movies_pred = [pred for pred in prediction_list if pred[0] in unrated_movie]
  filtered_movies_pred.sort(key = lambda x:x[1], reverse = True)

  if len(filtered_movies_pred) >= nbRmd:
    recommendation = [item[0] for item  in filtered_movies_pred[:nbRmd]]
    return recommendation
  else:
    return filtered_movies_pred


# Keep only rated movies
def predict_rated_movies(rated_movie, prediction_list):
  filtered_rated_pred = [pred for pred in prediction_list if pred[0] in rated_movie]
  return filtered_rated_pred

# Find the user rated list
def get_rated_list(matrix, u, filtered_rated_pred):
  rated_list = []

  for pred in filtered_rated_pred:
    index = pred[0]
    rated_list.append((index, matrix[u][index]))
  return rated_list

Movie index -> id -> info

In [None]:
def movie_index_to_id(filtered_movies_pred, movies):
  return [movies[index] for index in filtered_movies_pred]

def movie_id_to_info(movie_ids, meta):
  return [meta[movie_id] for movie_id in movie_ids]

User-based function

In [None]:
# u : id of a user
# dataset : the movie - ratings dataset, same structure as the movie_ratings_500_id
# threshold : min of PC of two users
# nbRecommend : number of movies that you want to recommend to the user
def user_based(u, dataset, threshold, nbRecommend):
  matrix, user_list, movie_list = build_user_movie_matrix(dataset)
  # Find the index of the user
  u_index = [index for index, id in enumerate(user_list) if id == u][0]

  matrix_pc = corr_u_vs(u_index, matrix)
  unrated_movie, _, prediction_list = predict_u_rating(threshold, matrix, matrix_pc, u_index)
  filtered_movies_pred = recommend_movies(unrated_movie, prediction_list, nbRecommend)
  recommend_id = movie_index_to_id(filtered_movies_pred, movie_list)
  recommendation = movie_id_to_info(recommend_id, meta)
  return recommendation

User-based results

In [None]:
user_based('1005851', trainSet, 0.7, 10)

[{'director': "Thaddeus O'Sullivan",
  'genre': ['Comedy', 'Crime'],
  'actors': ['Kevin Spacey',
   'Linda Fiorentino',
   'Peter Mullan',
   'Stephen Dillane',
   'Gerard McSorley',
   'Colin Farrell'],
  'title': 'Ordinary Decent Criminal'},
 {'director': 'Dominic Anciano Ray Burdis',
  'genre': ['Drama', 'Thriller'],
  'actors': ['Ray Winstone', 'Jude Law', 'Sadie Frost'],
  'title': 'Final Cut'},
 {'director': 'John Lasseter Ash Brannon',
  'genre': ['Animation', 'Adventure', 'Comedy'],
  'actors': ['Tom Hanks', 'Tim Allen', 'Joan Cusack', 'Annie Potts'],
  'title': 'Toy Story 2'},
 {'director': 'Anthony Drazan',
  'genre': ['Comedy', 'Drama'],
  'actors': ['Sean Penn',
   'Kevin Spacey',
   'Chazz Palminteri',
   'Robin Wright',
   'Anna Paquin',
   'Meg Ryan'],
  'title': 'Hurlyburly'},
 {'director': 'Nick Cassavetes',
  'genre': ['Drama', 'Romance'],
  'actors': ['Gena Rowlands',
   'James Garner',
   'Rachel McAdams',
   'Ryan Gosling'],
  'title': 'The Notebook'},
 {'director

The number of users in the trainSet is really large (around 40000) which takes a long time to run all the users. We pick randomly 10 users and for one threshold, repeat 10 times the function. Tried threshold = 0.9/0.8/0.7/0.6, 0.7 has the best result

In [None]:
matrix, user_list, movie_list = build_user_movie_matrix(trainSet)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

def rmse_mae(u, threshold, matrix):
  cpt_is_null = 0

  matrix_pc = corr_u_vs(u,matrix)

  _, rated_movie, prediction_list = predict_u_rating(threshold, matrix, matrix_pc, u)
  index_prediction = predict_rated_movies(rated_movie, prediction_list)

  if len(index_prediction) == 0:
    rmse = 0
    mae = 0
    cpt_is_null += 1

  else:
    index_reality = get_rated_list(matrix, u, index_prediction)

    reality = [rating[1] for rating in index_reality]
    prediction = [prediction[1] for prediction in index_prediction]

    rmse = np.sqrt(mean_squared_error(reality, prediction))
    mae = mean_absolute_error(reality, prediction)

  return rmse, mae, cpt_is_null

In [None]:
rmse_total = 0
mae_total = 0
cpt = 0

nb_sample = random.sample(range(len(user_list)), 100)

for i in range(len(nb_sample)):
  user = nb_sample[i]
  rmse, mae, cpt = rmse_mae(user, 0.7, matrix)
  rmse_total += rmse
  mae_total += mae

In [None]:
avg_rmse = rmse_total/(len(nb_sample) - cpt)
avg_mae = mae_total/(len(nb_sample) - cpt)

print(avg_rmse, avg_mae)

0.1781692559612347 0.1536907551295547


# **Item-based**

In [7]:
##Item-based
def create_ratings_df(ratings_data):
    data = {'user_id': [], 'movie_id': [], 'rating': []}
    for movie_id, ratings in ratings_data.items():
        for info in ratings:
            data['user_id'].append(info['user_id'])
            data['movie_id'].append(movie_id)
            data['rating'].append(int(info['user_rating']))

    return pd.DataFrame(data)

#movies info
def get_movie_details(movie_id):
    return metadata.get(movie_id, {})

ratings_df = create_ratings_df(rating)
#print(ratings_df)

train_data, test_data = skms.train_test_split(ratings_df, test_size=0.2)

# Training data to dictionary format
train_ratings_data = train_data.groupby('user_id').apply(lambda x: dict(zip(x['movie_id'], x['rating']))).to_dict()

# Test data to dictionary format
test_ratings_data = test_data.groupby('user_id').apply(lambda x: dict(zip(x['movie_id'], x['rating']))).to_dict()

def similarityItemCF(data, a=0.3):
    N = {}  # Total number of users who rated item i
    C = {}  # Number of users who rated both items i and j

    # Iterate over users and their ratings
    for user, items in data.items():
        for i, score in items.items():
            N.setdefault(i, 0)
            N[i] += 1
            C.setdefault(i, {})
            for j, scores in items.items():
                if j != i:
                    C[i].setdefault(j, 0)
                    C[i][j] += 1

    # Calculate item-item similarity matrix W
    W = {}
    for i, item in C.items():
        W.setdefault(i, {})
        for j, item2 in item.items():
            W[i].setdefault(j, 0)
            W[i][j] = C[i][j] / sqrt(N[i] * N[j])

    for i,item in W.items():
      for j, score in item.items():
        W[i][j] = W[i][j] * (1 - a) + a

    return W

def recommand_movies(ratings, movie_metadata, item_similarity, user_id, N, k):
    ranked_movies = {}

    # Iterate over movies the user has rated
    for movie_id, user_rating in ratings[user_id].items():
        # Iterate over the top k similar movies to the rated movie
        for similar_movie_id, similarity_score in sorted(item_similarity[movie_id].items(), key=lambda x: x[1], reverse=True)[:k]:
            # Check if the user has not rated the similar movie
            if similar_movie_id not in ratings[user_id]:
                ranked_movies.setdefault(similar_movie_id, {'score': 0, 'details': {}})
                ranked_movies[similar_movie_id]['score'] += user_rating * similarity_score
                ranked_movies[similar_movie_id]['details'] = movie_metadata.get(similar_movie_id, {})

    # Return the top N recommendations with movies details
    recommendations = sorted(ranked_movies.items(), key=lambda x: x[1]['score'], reverse=True)[:N]
    return recommendations

In [5]:
##Item-based results
uid = '1380819'
N = 10
item_similar = similarityItemCF(train_ratings_data)
recommendations = recommand_movies(test_ratings_data, meta, item_similar, uid, N, k=3)
print(f"Top {N} Recommendations for User {uid}:")
for movie_id, details in recommendations:
    print(f"Movie ID: {movie_id}")
    print(f"Title: {details['details']['title']}")
    print(f"Genre: {details['details']['genre']}")
    print("---")

Top 10 Recommendations for User 1380819:
Movie ID: tt0261392
Title: Jay and Silent Bob Strike Back
Genre: ['Comedy']
---
Movie ID: tt0286106
Title: Signs
Genre: ['Drama', 'Mystery', 'Sci-Fi']
---
Movie ID: tt0264464
Title: Catch Me If You Can
Genre: ['Biography', 'Crime', 'Drama']
---
Movie ID: tt0327056
Title: Mystic River
Genre: ['Crime', 'Drama', 'Mystery']
---
Movie ID: tt0311113
Title: Master and Commander: The Far Side of the World
Genre: ['Action', 'Adventure', 'Drama']
---
Movie ID: tt0259711
Title: Vanilla Sky
Genre: ['Fantasy', 'Mystery', 'Romance']
---
Movie ID: tt0208092
Title: Snatch
Genre: ['Comedy', 'Crime']
---
Movie ID: tt0183505
Title: Me, Myself & Irene
Genre: ['Comedy']
---
Movie ID: tt0177971
Title: The Perfect Storm
Genre: ['Action', 'Adventure', 'Drama']
---
Movie ID: tt0257044
Title: Road to Perdition
Genre: ['Crime', 'Drama', 'Thriller']
---


In [8]:
true_ratings = []
predicted_ratings = []

for user_id, movies in test_ratings_data.items():
    for movie_id, true_rating in movies.items():
      # Get the predicted rating from recommendations
      predicted_rating = next((reco[1]['score'] for reco in recommendations if reco[0] == movie_id), 0)
      true_ratings.append(true_rating)
      predicted_ratings.append(predicted_rating)

# Calculate RMSE and MAE
rmse = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))
mae = mean_absolute_error(true_ratings, predicted_ratings)

print(f"RMSE: {rmse}")
print(f"MAE: {mae}")

RMSE: 3.449661981555301
MAE: 3.2610468548585843


# **Vanilla MF**

In [None]:
## Vanilla MF
data = []
movies = {}

for movie, metadata in meta.items():
  title = metadata['title']
  movies[movie] = title

for movie, ratings in rating.items():
    for rat in ratings:
        movie_id = movie
        user_id = rat['user_id']
        user_rating = rat['user_rating']
        data.append((user_id, movie_id, user_rating))

# Define the Reader with your rating scale and other options
reader = sp.Reader(rating_scale=(1, 5))
custom_datasets = pd.DataFrame(data, columns=['user_id', 'movie_id', 'user_rating'])

custom_dataset = sp.Dataset.load_from_df(custom_datasets, reader)

# # # Step 3: Create a Trainset
trainset, testset = spms.train_test_split(custom_dataset, test_size=0.2, random_state=42, shuffle=True)

algo = sp.SVD()
algo.fit(trainset)
pred = algo.test(testset)
sp.accuracy.rmse(pred)
sp.accuracy.mae(pred)

RMSE: 0.9518
MAE:  0.7518


0.7518028762832114

In [None]:
##Vanilla MF results
user_id = '185150'  # Replace with the user ID for whom you want to make recommendations
user_ratings = custom_dataset.raw_ratings
user_ratings = [(uid, iid, r) for (uid, iid, r, _) in user_ratings if uid == user_id]
user_ratings = sorted(user_ratings, key=lambda x: x[2], reverse=True)
already_rated = {iid: r for (uid, iid, r) in user_ratings}

# # Get the items that the user hasn't rated yet
# unrated_items = [item for item in trainset.all_items() if item not in already_rated]
unrated_items = [item for item in movies if item not in already_rated]

# # Make predictions for the unrated items
predictions = [algo.predict(user_id, item) for item in unrated_items]

# Get the top N recommendations
top_n = [(iid, est) for (uid, iid, true_r, est, _) in predictions]
top_n = sorted(top_n, key=lambda x: x[1], reverse=True)[:10]

print(f"Top 10 Recommendations for User {user_id}:")
for iid, est in top_n:
    print(f"{movies[iid]}: Estimated rating - {est}")


Top 10 Recommendations for User 185150:
Saving Private Ryan: Estimated rating - 4.215646182155594
Gladiator: Estimated rating - 3.962867269805805
Toy Story 2: Estimated rating - 3.9470806961577036
Remember the Titans: Estimated rating - 3.9253485767454577
Fight Club: Estimated rating - 3.9190645930473766
Elizabeth: Estimated rating - 3.8321405639439696
October Sky: Estimated rating - 3.7724821850676817
The Green Mile: Estimated rating - 3.7684793352017807
We Were Soldiers: Estimated rating - 3.755327231481414
O Brother, Where Art Thou?: Estimated rating - 3.692665514322229


# Requirements
- Minimizing the RMSE and MAE
- Try to compare different methods that you have adopted and interpret the results that you have obtained
- Construct a recommender system that returns the top 10 movies that the users have not watched in the past

# For the same user 185150
# User-based :

In [None]:
# For the same user 185150
# User-based
recomend_user_based = user_based('185150', rating, 0.7, 10)

titles_user_based = [movie['title'] for movie in recomend_user_based]

titles_user_based

['Toy Story 2',
 'Kingdom Come',
 'All About My Mother',
 'My Life So Far',
 'Pirates of the Caribbean: The Curse of the Black Pearl',
 'Almost Famous',
 'Erin Brockovich',
 'Harry Potter and the Chamber of Secrets',
 'Along Came a Spider',
 'The Patriot']

# Item-based :

In [None]:
ratings_data = ratings_df.groupby('user_id').apply(lambda x: dict(zip(x['movie_id'], x['rating']))).to_dict()

In [None]:
uid = '185150'
N = 10
item_similar = similarityItemCF(ratings_data)
recommendations = recommand_movies(ratings_data, meta, item_similar, uid, N, k=3)
print(f"Top {N} Recommendations for User {uid}:")
for movie_id, details in recommendations:
    print(f"Title: {details['details']['title']}")

Top 10 Recommendations for User 185150:
Title: A Beautiful Mind
Title: Collateral
Title: The Bourne Identity
Title: Catch Me If You Can
Title: The Patriot
Title: Mona Lisa Smile
Title: The Aviator
Title: Training Day
Title: Harry Potter and the Chamber of Secrets
Title: The Last Samurai


The same recommendation for user-based and item-based is Harry Potter and the Chamber of Secrets.
The same recommendation for user-based and Vanilla MF is Toy Story 2.
There is no same recommendation for item-based and Vanilla MF.



The recommendations from the three algorithms are significantly different, as the user similarity method may prefer recommending items similar to the user's historical behavior, while the movie similarity method may lean towards suggesting movies similar to those previously liked by the user. Matrix factorization, on the other hand, takes into account the latent features of users and movies, potentially resulting in diverse recommendations in certain scenarios.
In terms of minimizing rmse and mae, the user-based algorithm seems to be the best one