<div style="background-color: #7ed957; padding: 20px; text-align: center; font-size: 22px;">
    <strong>Implementation of the Recommendation System</strong><br>
    <em>(Part I)</em>
</div>

<h1 align="left"><strong style="color: #7ed957;">1. Importing modules</strong></h1>

In [46]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from pprint import pprint
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import DistanceMetric

I loaded here the libraries that we might potentially need during the implementation of our models.

<h1 align="lefth"><strong style="color: #7ed957;">2. Data import</strong></h1>

I will go quickly over this part since we already covered it in the first section.

In [47]:
movies = pd.read_excel('movies.xlsx', header=0)
ratings = pd.read_excel('ratings.xlsx', header=0)
users = pd.read_excel('users.xlsx', header=0)

In [48]:
import re

movies["Genres"] = movies["Genres"].astype(str)
movies["Genres"] = movies["Genres"].apply(
    lambda x: [g.strip() for g in re.split(r"[|;]", x) if g.strip() != ""]
)
all_genres = sorted(
    set(
        g
        for genres in movies["Genres"]
        for g in genres
    )
)
def normalize_genres(movies_df):
    movies_df = movies_df.copy()
    movies_df["Genres"] = movies_df["Genres"].apply(
        lambda x: x.split("|") if isinstance(x, str) else x
    )
    return movies_df

In [49]:
def extract_year(Title):
    if isinstance(Title, str):
        matches = re.findall(r'\((\d{4})\)', Title)
        if matches:
            return int(matches[-1])
    return None
movies['Year'] = movies['Title'].apply(extract_year)
def rearrange_title(Title):
    if not isinstance(Title, str):
        # If the input is not a string, return it as is
        return Title
    
    pattern = r'^(.*?)(,\s*(The|A))?\s*\((\d{4})\)$'
    match = re.match(pattern, Title, re.IGNORECASE)
    
    if match:
        determinant = match.group(3) or ""  # Handle the case where there is no "The" or "A"
        year = match.group(4)
        return f'{determinant} {match.group(1)} ({year})'
    return Title

movies['Title'] = movies['Title'].apply(rearrange_title)
ratings['MovieID'] = ratings['MovieID'].astype(int)
movies['MovieID'] = movies['MovieID'].astype(int)
movies['Title'] = movies['Title'].str.strip()
movies.loc[movies['MovieID'] == 2382, 'Title'] = "Police Academy 5: Assignment Miami Beach (1988)"
movies.loc[movies['MovieID'] == 2382, 'Genres'] = "Comedy"
movies.loc[movies['MovieID'] == 2382, 'Year'] = 1988

<h1 align="left"><strong style="color: #7ed957;">3.  The simple recommendations</strong></h1>

A knowledge-based movie recommendation system would use specific information about movies and users to suggest suitable films. For example, if a user has positively rated several movies from a particular genre, the system could recommend other films from the same genre. Similarly, if the user has indicated specific preferences for an actor or actress, the system could suggest movies featuring that person. In summary, this type of system leverages detailed information about movies and user preferences to generate personalized recommendations.

## 1.1 Recommendation for a Chosen Genre

#### <span style="color: #7ed957;">1.1.a Creating the Combined Dataset</span>

We will merge the *ratings* table with the *movies* table in order to obtain a single dataset containing all the information we need. 

In [50]:
combined_table = pd.merge(ratings, movies, on='MovieID', how='inner')
combined_table.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Year
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),[Drama],1975.0
1,1,661,3,978302109,James and the Giant Peach (1996),"[Animation, Children's, Musical]",1996.0
2,1,914,3,978301968,My Fair Lady (1964),"[Musical, Romance]",1964.0
3,1,3408,4,978300275,Erin Brockovich (2000),[Drama],2000.0
4,1,2355,5,978824291,A Bug's Life (1998),"[Animation, Children's, Comedy]",1998.0


#### <span style="color: #7ed957;">1.1.b The most appreciated films</span>

Now let’s calculate the highest-rated movies so we can recommend the best-rated ones.

In [51]:
# Compute the median rating for each movie
average_rating = combined_table.groupby('Title')['Rating'].median()

# Sort movies from highest to lowest rating
top_average_rating = average_rating.sort_values(ascending=False)
print("Top-rated movies:")
print(top_average_rating.head())

Top-rated movies:
Title
Schindler's List (1993)                            5.0
Bicycle Thief, The (Ladri di biciclette) (1948)    5.0
The Maltese Falcon (1941)                          5.0
Song of Freedom (1936)                             5.0
The Gate of Heavenly Peace (1995)                  5.0
Name: Rating, dtype: float64


Next, let’s calculate the most popular movies so we can recommend the most popular ones.

In [52]:
# Compute how many times each movie was watched
most_viewed_movies = combined_table.groupby('Title')['UserID'].count()

# Sort movies from most viewed to least viewed
top_most_viewed = most_viewed_movies.sort_values(ascending=False)

print("Most viewed movies:")
print(top_most_viewed.head())

Most viewed movies:
Title
American Beauty (1999)                                  3428
Star Wars Episode IV - A New Hope (1977)                2991
Star Wars Episode V - The Empire Strikes Back (1980)    2990
Star Wars Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                    2672
Name: UserID, dtype: int64


#### <span style="color: #7ed957;">1.1.c New combined table</span>

In [53]:
popular_movies = pd.concat([average_rating, most_viewed_movies], axis=1, join='inner')
# Display the recommended movies
print(popular_movies.head())

                               Rating  UserID
Title                                        
$1,000,000 Duck (1971)            3.0      37
'Night Mother (1986)              3.0      70
'Til There Was You (1997)         3.0      52
...And Justice for All (1979)     4.0     199
1-900 (1994)                      2.5       2


#### <span style="color: #7ed957;">1.1.d let’s create the function</span>

In [54]:
def compute_weighted_scores(ratings_df):
    """
    Computes a weighted score per movie based on:
    - the average rating
    - the number of times the movie has been rated (viewed)

    Returns a DataFrame with:
    movieId | mean_rating | n_ratings | weighted_score
    """

    movie_stats = (
        ratings_df
        .groupby("MovieID")
        .agg(
            mean_rating=("Rating", "mean"),
            n_ratings=("Rating", "count")
        )
        .reset_index()
    )

    movie_stats["weighted_score"] = (
        movie_stats["mean_rating"] * np.log1p(movie_stats["n_ratings"])
    )

    return movie_stats

This function filters the movie list to keep only movies that belong to the selected genres.\
It splits multi-genre movies, checks if any genre matches the user’s choice, and returns each matching movie only once.

In [55]:
def filter_movies_by_genres(movie_df, selected_genres):
    movie_df_exploded = movie_df.copy()

    movie_df_exploded["Genres"] = movie_df_exploded["Genres"].apply(
        lambda x: x.split("|") if isinstance(x, str) else x
    )

    movie_df_exploded = movie_df_exploded.explode("Genres")

    movie_df_exploded["Genres"] = movie_df_exploded["Genres"].apply(
        lambda x: x.strip() if isinstance(x, str) else x
    )

    return (
        movie_df_exploded[
            movie_df_exploded["Genres"].isin(selected_genres)
        ]
        .drop_duplicates(subset="MovieID")
    )


This function removes all movies that the user has already watched or rated, so only new, unseen movies are kept for recommendation.

In [56]:
def remove_seen_movies(movie_df, ratings_df, user_id):
    """
    Removes movies already seen by the user.
    """

    seen_movies = ratings_df[
        ratings_df["UserID"] == user_id
    ]["MovieID"].unique()

    return movie_df[
        ~movie_df["MovieID"].isin(seen_movies)
    ]

This function recommends the top 5 movies a user has not seen yet. It asks for the user’s ID, lets them choose preferred genres, and applies either AND or OR filtering. It then computes a weighted score for each movie based on ratings and popularity, removes movies already watched, filters by genre if needed, and finally returns the five best-scored movies.

In [57]:
def recommend_top5_for_user(movies_df, ratings_df):
    """
    Recommends the top 5 movies for a user based on:
    - selected genres (optional, multiple allowed)
    - AND / OR logic for genre filtering
    - movies not yet seen by the user
    - a weighted score combining ratings and popularity
    """

    # User ID input
    while True:
        user_id_input = input("Enter your user ID (required): ").strip()
        if user_id_input == "":
            print("User ID cannot be empty.")
            continue
        try:
            user_id = int(user_id_input)
            break
        except ValueError:
            print("User ID must be a number.")

    # Display available genres
    # Normalize genres first
    movies_df = normalize_genres(movies_df)
    
    # Display available genres
    all_genres = sorted(
        {genre for genres in movies_df["Genres"] for genre in genres}
    )
    
    print("\nAvailable genres:")
    for i, g in enumerate(all_genres, 1):
        print(f"{i}. {g}")


    genre_input = input(
        "\nSelect one or more genres (e.g. 1,3,5) or press Enter to skip: "
    ).strip()

    selected_genres = []

    if genre_input != "":
        try:
            selected_genres = [
                all_genres[int(i) - 1]
                for i in genre_input.split(",")
            ]
        except (ValueError, IndexError):
            print("Invalid genre selection. No genre filter will be applied.")
            selected_genres = []

    genre_mode = None
    if selected_genres:
        genre_mode = input(
            "Type 'AND' to require all genres or 'OR' to require at least one (default OR): "
        ).strip().upper()
        if genre_mode not in ["AND", "OR"]:
            genre_mode = "OR"

    # Compute weighted scores
    movie_weighted = compute_weighted_scores(ratings_df)

    movie_table = pd.merge(
        movies_df,
        movie_weighted,
        on="MovieID",
        how="inner"
    )

    # Remove movies already seen
    candidate_movies = remove_seen_movies(
        movie_table,
        ratings_df,
        user_id
    )

    # Apply genre filter
    if selected_genres:
        if genre_mode == "AND":
            candidate_movies = candidate_movies[
                candidate_movies["Genres"].apply(
                    lambda g: set(selected_genres).issubset(set(g))
                )
            ]
        else:  # OR
            candidate_movies = candidate_movies[
                candidate_movies["Genres"].apply(
                    lambda g: any(s in g for s in selected_genres)
                )
            ]

    # Select top 5 movies
    top5 = (
        candidate_movies
        .sort_values("weighted_score", ascending=False)
        .head(5)
    )

    return top5[
        ["MovieID", "Title", "Genres", "mean_rating", "n_ratings", "weighted_score"]
    ]

In [58]:
recommend_top5_for_user(movies, ratings)

Enter your user ID (required):  15



Available genres:
1. Action
2. Adventure
3. Animation
4. Children's
5. Comedy
6. Crime
7. Documentary
8. Drama
9. Fantasy
10. Film-Noir
11. Horror
12. Musical
13. Mystery
14. Romance
15. Sci-Fi
16. Thriller
17. War
18. Western



Select one or more genres (e.g. 1,3,5) or press Enter to skip:  12,11
Type 'AND' to require all genres or 'OR' to require at least one (default OR):  AND


Unnamed: 0,MovieID,Title,Genres,mean_rating,n_ratings,weighted_score
2455,2657,The Rocky Horror Picture Show (1975),"[Comedy, Horror, Musical, Sci-Fi]",3.29116,1233,23.426529
2541,2746,Little Shop of Horrors (1986),"[Comedy, Horror, Musical]",3.332265,936,22.801634


Here are the most recommended movies for user 56, after removing the films they have already seen and selecting the genres they are interested in, based on the highest-rated and most popular movies (wich hold more weight cause it's more important).

<h1 align="left"><strong style="color: #7ed957;">4.  Collaborative Recommendation System Using KNN</strong></h1>

The simple recommendation method we used earlier works, but it is too basic and, in practice, not very effective because it does not take into account user characteristics or movie features to produce more personalized recommendations.\
To achieve better personalization, we will use an item-based collaborative filtering system with KNN, which relies on similarities between items to recommend movies similar to those the user has already enjoyed.

## 4.1 Recommendation for a Chosen Genre

First, let’s combine all the data from the ratings table into a single matrix. This matrix will include the columns ‘UserID’, ‘MovieID’, and ‘Rating’, and it will be used for the calculations that follow.

In [59]:
Matrice_Utilisateur_Films = ratings.pivot(
    index="UserID",
    columns="MovieID",
    values="Rating"
).fillna(0)

Matrice_Utilisateur_Films.head()


MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After some research, I found a way to speed up the computations. Working with such a large matrix can take a lot of time and memory, so I decided to use the <span style="color: #4CAF50;">SciPy</span> library and import <span style="color: #4CAF50;">csr_matrix</span>, which allows me to create a sparse matrix.A sparse matrix is a much more efficient representation because it contains mostly zeros. It uses special data structures that store only the non-zero values, which helps reduce memory usage and improves calculation performance.


In [60]:
Matrice_Utilisateur_Films_Creuse = csr_matrix(Matrice_Utilisateur_Films.values)
Matrice_Utilisateur_Films_Creuse

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1000209 stored elements and shape (6040, 3706)>

Now I have transformed my user–movie matrix into a sparse matrix.

## 4.2 User-based Recommendation

In this section, we will use a KNN model to recommend a movie based on the preferences of users who are similar to the target user.

#### <span style="color: #7ed957;">4.2.a Fit our K-Nearest Neighbors model to our sparse matrix</span>

In [61]:
model = NearestNeighbors(metric="cosine",algorithm="brute")
model.fit(Matrice_Utilisateur_Films_Creuse)

Cosine similarity measures how similar two vectors are by computing the cosine of the angle between them. A score of 1 indicates maximum similarity, 0 means no similarity, and –1 represents maximum dissimilarity.\
The KNN “brute force” algorithm checks all possible neighbor combinations without using advanced data structures, ensuring that the closest neighbors are found accurately, although at the cost of higher computation time.

#### <span style="color: #7ed957;">4.2.b Closest Users Function</span>

We will now create a function that will try to find the closest users to the one we choose.

In [62]:
def Utilisateurs_Proches(user, n_closest_users):
    knn_user = np.asarray([Matrice_Utilisateur_Films.values[user - 1]])
    distances, indices = model.kneighbors(knn_user, n_neighbors=n_closest_users + 1)

    print("Top", n_closest_users, "users most similar to User", user, "are:")
    print(" ")

    for i in range(1, len(distances[0])):
        print(i, ". User:", indices[0][i] + 1, "with a distance of", distances[0][i])

    return indices.flatten()[1:] + 1, distances.flatten()[1:]

Our Utilisateurs_Proches function takes as input a user and the desired number of similar users to find. It uses the K-nearest neighbors (KNN) model we previously built to identify and display the users most similar to the input user, along with the distances that separate them.

In [63]:
UserID = 3
Film_vu_utilisateur = ratings[ratings['UserID'] == UserID]['MovieID'].tolist()
Titre_Film_vu_utilisateur = movies[movies['MovieID'].isin(Film_vu_utilisateur)]['Title'].tolist()

print("Some of the movies watched by the user:")
pprint(Titre_Film_vu_utilisateur[:10])

Utilisateur_similaires, distance_list = Utilisateurs_Proches(UserID, 5)

Some of the movies watched by the user:
['Happy Gilmore (1996)',
 'Star Wars Episode IV - A New Hope (1977)',
 'Jurassic Park (1993)',
 'The Three Musketeers (1993)',
 'Dances with Wolves (1990)',
 'The Silence of the Lambs (1991)',
 'Mission Impossible (1996)',
 'Dragonheart (1996)',
 'The Rock (1996)',
 'The Ghost and the Darkness (1996)']
Top 5 users most similar to User 3 are:
 
1 . User: 3000 with a distance of 0.6466231943633549
2 . User: 479 with a distance of 0.6495265347681085
3 . User: 5691 with a distance of 0.6511320944181338
4 . User: 3500 with a distance of 0.662900068768379
5 . User: 1904 with a distance of 0.672531578083694


Now that we have identified the users most similar to the given user, we can recommend movies by taking their preferences into account. To do this, we will identify the films most appreciated by these similar users. In addition, we will give more weight to the movies preferred by the users who are closest to the target user, in order to refine the recommendations based on that user's specific preferences.

#### <span style="color: #7ed957;">4.2.c Recomendation films partie 1</span>

We will normalize the distances to obtain weights. In practice, the closer a user is, the higher their weight will be in the recommendation process.

In [64]:
Weights = distance_list / np.sum(distance_list)
Weights

array([0.19697826, 0.1978627 , 0.19835179, 0.20193662, 0.20487063])

We will retrieve the ratings of all the movies watched by the similar users.

In [65]:
Similar_users_movie_titles = Matrice_Utilisateur_Films.values[Utilisateur_similaires]
Similar_users_movie_titles

array([[4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 3., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

We cannot see it directly here, but the values shown are the ratings given by the similar users. Therefore, we can deduce that any position where the rating is not equal to 0 corresponds to a movie that has already been watched.

In [66]:
print("Number of titles in the database:", len(Matrice_Utilisateur_Films.columns))
print("Number of similar users:", len(Weights))
print("Number of movies watched by all similar users:", np.count_nonzero(Similar_users_movie_titles))

Number of titles in the database: 3706
Number of similar users: 5
Number of movies watched by all similar users: 721


We will now modify the <strong><span style="color: #4CAF50;">Weights</span></strong> matrix so that each user's weight influences the rating they assign to a movie. Then, we will obtain a final list with a score for each film, calculated based on the weights assigned by each user in the <strong><span style="color: #4CAF50;">Weights</span></strong> matrix.

In [67]:
Film_note_pondérer = Weights[:, np.newaxis] + np.zeros(len(Matrice_Utilisateur_Films.columns))
Film_note_pondérer.shape

(5, 3706)

In [68]:
Film_note_pondérer_version2 = Film_note_pondérer * Film_note_pondérer
note_pondérer = Film_note_pondérer_version2.sum(axis=0)
note_pondérer.shape

(3706,)

In [69]:
def Film_recommander(n):
    n = min(len(note_pondérer), n)
    return list(Matrice_Utilisateur_Films.columns[np.argsort(note_pondérer)[::-1][:n]])

In [70]:
print("Movies recommended based on similar users are:")
list(movies[movies['MovieID'].isin(list(Film_recommander(10)))]['Title'].tolist())

Movies recommended based on similar users are:


['Toy Story (1995)',
 'Bootmen (2000)',
 'Digimon The Movie (2000)',
 'Get Carter (2000)',
 'Get Carter (1971)',
 'Meet the Parents (2000)',
 'Requiem for a Dream (2000)',
 'Tigerland (2000)',
 'Two Family House (2000)',
 'The Contender (2000)']

#### <span style="color: #4CAF50;">4.2.d Movie Recommendation – Part 2</span>

Let's build a function that performs all these steps at once.

In [71]:
def film_de_recomandation():
    print("Enter your user ID:")
    ID = int(input())
    print(" ")
    print("How many similar users would you like to consider?")
    Utilisateur_similaire = int(input())
    print(" ")
    print("Enter the number of movies you want to be recommended:")
    nbr_movies = int(input())
    print(" ")

    def Utilisateurs_Proches(utilisateur, nbr_Utilisateur_proches=5):
        knn_utilisateur = np.asarray([Matrice_Utilisateur_Films.values[utilisateur - 1]])
        distances, indices = model.kneighbors(knn_utilisateur, n_neighbors=nbr_Utilisateur_proches + 1)
        print("Top", nbr_Utilisateur_proches, "users most similar to User", utilisateur, "are:")
        for i in range(1, len(distances[0])):
            print(i, ". User:", indices[0][i] + 1, "with a distance of", distances[0][i])
        print("")
        return indices.flatten()[1:] + 1, distances.flatten()[1:]

    def Film_recommander(n=10):
        premier_zero_index = np.where(note_pondérer == 0)[0][-1]
        index_trier = np.argsort(note_pondérer)[::-1]
        index_trier = index_trier[:list(index_trier).index(premier_zero_index)]
        n = min(len(index_trier), n)

        film_visionner = ratings[ratings['UserID'] == ID]['MovieID'].tolist()
        liste_de_film_trier = list(Matrice_Utilisateur_Films.columns[index_trier])

        count = 0
        liste_final = []

        for i in liste_de_film_trier:
            if i not in film_visionner:
                count += 1
                liste_final.append(i)
            if count == n:
                break

        if count == 0:
            print("There are no more movies that the user has not watched and that were also rated by similar users. Increasing the number of similar users may help generate new recommendations.")
        else:
            pprint(list(movies[movies['MovieID'].isin(list(liste_final))]['Title'].tolist()))
        pass

    similar_user_list, distance_list = Utilisateurs_Proches(ID, Utilisateur_similaire)

    weightage_list = distance_list / np.sum(distance_list)
    mov_rtngs_sim_users = Matrice_Utilisateur_Films.values[similar_user_list]
    movies_list = Matrice_Utilisateur_Films.columns

    weightage_list = weightage_list[:, np.newaxis] + np.zeros(len(movies_list))
    new_rating_matrix = weightage_list * mov_rtngs_sim_users
    note_pondérer = new_rating_matrix.sum(axis=0)

    print("Some movies watched by the user:")
    pprint(movies[movies['MovieID'].isin(ratings[ratings['UserID'] == ID]['MovieID'])]['Title'][0:10])

    print("")
    print("Movies recommended based on similar users:")
    Film_recommander(nbr_movies)

# Call the function
film_de_recomandation()

Enter your user ID:


 15


 
How many similar users would you like to consider?


 5


 
Enter the number of movies you want to be recommended:


 10


 
Top 5 users most similar to User 15 are:
1 . User: 4217 with a distance of 0.5188282857866172
2 . User: 746 with a distance of 0.5330772019192221
3 . User: 3625 with a distance of 0.5360025768167026
4 . User: 2559 with a distance of 0.5370558754281303
5 . User: 774 with a distance of 0.5390526064973175

Some movies watched by the user:
5                     Heat (1995)
31          Twelve Monkeys (1995)
46           Seven (Se7en) (1995)
49      The Usual Suspects (1995)
69     From Dusk Till Dawn (1996)
72         Misérables, Les (1995)
102          Happy Gilmore (1996)
108             Braveheart (1995)
139           The Birdcage (1996)
158                  Congo (1995)
Name: Title, dtype: object

Movies recommended based on similar users:
['Apollo 13 (1995)',
 'Like Water for Chocolate (Como agua para chocolate) (1992)',
 'Dances with Wolves (1990)',
 'Grease (1978)',
 'Jerry Maguire (1996)',
 'Titanic (1997)',
 'Robocop (1987)',
 'Ready to Rumble (2000)',
 'U-571 (2000)',
 'X-Men (2

I implemented here a user-based collaborative filtering recommendation system using a machine learning approach with k-Nearest Neighbors (k-NN). First, the user selects an ID, the number of similar users to consider, and the number of movie recommendations to generate. The model computes similarity between users based on their rating vectors in the user–movie matrix, using distance metrics learned by the k-NN model. The closest users are identified as the most similar profiles. Then, we apply a weighted aggregation of ratings, where each similar user’s contribution is scaled by their distance to the target user. This produces a predicted score for each movie that the user has not yet watched. Finally, movies are ranked by these predicted scores and the top recommendations are returned.

Overall, this approach feels very natural and human: it mimics how we take advice from people who share similar tastes, while relying on machine-learning distance metrics and inference to make the recommendations meaningful.

## 4.3 Movie-Based Recommendation

Let’s review the model structure by representing each movie as a row and each user as a column in an n-dimensional rating space, where *n* is the total number of ratings.

Then, as before, we will build a KNN model to detect movies similar to the one provided as input, and recommend the most suitable films that closely match our selection.

Additionally, we will make a small adjustment. Until now, our user–movie matrix only contained MovieIDs and not the actual movie titles. We will correct this to simplify the code, which was sometimes more complex due to this limitation.

#### <span style="color: #4CAF50;">4.3.a Movie/User Matrix</span>

In [72]:
Table_Tampon = pd.merge(ratings, movies, how='inner', on='MovieID')
Ratings = Table_Tampon.groupby(by=['UserID','Title'], as_index=False).agg({"Rating":"mean"})
Ratings.head()

Unnamed: 0,UserID,Title,Rating
0,1,A Bug's Life (1998),5.0
1,1,A Christmas Story (1983),5.0
2,1,A Close Shave (1995),3.0
3,1,Airplane! (1980),4.0
4,1,Aladdin (1992),4.0


In [73]:
Matrice_Films_Utilisateur = Ratings.pivot(
     index='Title',
       columns='UserID',
          values='Rating').fillna(0)

Matrice_Films_Utilisateur.head()

UserID,1,2,3,4,5,6,7,8,9,10,...,6031,6032,6033,6034,6035,6036,6037,6038,6039,6040
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"$1,000,000 Duck (1971)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Night Mother (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...And Justice for All (1979),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-900 (1994),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As before, we will convert this new matrix into a sparse matrix.

In [74]:
Matrice_Films_Utilisateur_creuse = csr_matrix(Matrice_Films_Utilisateur.values)
Matrice_Films_Utilisateur_creuse

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1000209 stored elements and shape (3706, 6040)>

In [75]:
liste_titre = list(Matrice_Films_Utilisateur.index)
liste_titre[:10]

['$1,000,000 Duck (1971)',
 "'Night Mother (1986)",
 "'Til There Was You (1997)",
 '...And Justice for All (1979)',
 '1-900 (1994)',
 '10 Things I Hate About You (1999)',
 '101 Dalmatians (1961)',
 '101 Dalmatians (1996)',
 '12 Angry Men (1957)',
 '187 (1997)']

For the next step, we will proceed a bit differently. We will now create a dictionary with movie titles as keys and the number of times they have been rated as values.

In [76]:
movie_dict = {movie : index for index, movie in enumerate(liste_titre)}
dict(list(movie_dict.items())[:5])

{'$1,000,000 Duck (1971)': 0,
 "'Night Mother (1986)": 1,
 "'Til There Was You (1997)": 2,
 '...And Justice for All (1979)': 3,
 '1-900 (1994)': 4}

Now let’s create a list that ignores case, which will be useful when searching for the title.

In [77]:
Liste_titre_sans_case = [str(i).lower() if isinstance(i, str) else i for i in liste_titre]

Build our model as we did previously

In [78]:
model_f = NearestNeighbors(metric='cosine', algorithm='brute')
model_f.fit(Matrice_Films_Utilisateur_creuse)

In [79]:
def film_proche(movie, n = 10):
    index = movie_dict[movie]
    knn_film = np.asarray([Matrice_Films_Utilisateur.values[index]])
    n = min(len(liste_titre) - 1, n)
    distances, indices = model_f.kneighbors(knn_film, n_neighbors=n + 1)

    print("Top", n, "movies that are very similar to the movie", movie, "are:")
    for i in range(1, len(distances[0])):
        print(liste_titre[indices[0][i]])

In [80]:
film_proche('Addams Family Values (1993)',5)

Top 5 movies that are very similar to the movie Addams Family Values (1993) are:
The Addams Family (1991)
Death Becomes Her (1992)
Ace Ventura Pet Detective (1994)
Batman Returns (1992)
Robin Hood Men in Tights (1993)


In [81]:
def Proposition_films(Movie):
    temp = ''
    films_possibles = Liste_titre_sans_case.copy()
    for lettre in Movie:
        resultats = []
        temp += lettre
        for film in films_possibles:
            if temp in film:
                resultats.append(film)
        if len(resultats) == 0:
            return films_possibles
        resultats.sort()
        films_possibles = resultats.copy()
    return films_possibles

In [82]:
class invalid(Exception):
    pass

def Recomandatio_film_V2():    
    try:
        Title = input("Enter the name of a movie to get similar recommendations: ")
        print("")
        Title_min = Title.lower()

        if Title_min not in Liste_titre_sans_case:
            raise invalid
        else:
            nbr = int(input("Enter the number of movies you want to be recommended: "))
            film_proche(liste_titre[Liste_titre_sans_case.index(Title_min)], nbr)

    except invalid:
        possible_movies = Proposition_films(Title_min)

        if len(possible_movies) == len(liste_titre):
            print("The movie you requested is not in our list of available movies. Please try another one.")
            Recomandatio_film_V2()
        else:
            indices = [Liste_titre_sans_case.index(i) for i in possible_movies]
            print("Maybe you meant:\n", [liste_titre[i] for i in indices])
            print("")
            Recomandatio_film_V2()


In [83]:
Recomandatio_film_V2()

Enter the name of a movie to get similar recommendations:  add



Maybe you meant:
 ['Addams Family Values (1993)', 'Addicted to Love (1997)', 'Aladdin (1992)', 'Aladdin and the King of Thieves (1996)', 'Big Daddy (1999)', 'Blazing Saddles (1974)', 'Braddock Missing in Action III (1988)', 'Caddyshack (1980)', 'Daddy Long Legs (1919)', "Jacob's Ladder (1990)", 'The Addams Family (1991)', 'The Addiction (1995)']



Enter the name of a movie to get similar recommendations:  Addams Family Values (1993)





Enter the number of movies you want to be recommended:  10


Top 10 movies that are very similar to the movie Addams Family Values (1993) are:
The Addams Family (1991)
Death Becomes Her (1992)
Ace Ventura Pet Detective (1994)
Batman Returns (1992)
Robin Hood Men in Tights (1993)
Wayne's World 2 (1993)
The Brady Bunch Movie (1995)
Liar Liar (1997)
The Flintstones (1994)
Mrs. Doubtfire (1993)


Here, I implemented a content-based movie recommendation system that focuses on movie similarity rather than user behavior. The user enters a movie title, and I first check whether it exists in the dataset. If the title is not found, I handle the situation smoothly by suggesting close matches based on textual similarity, which avoids abrupt errors and makes the interaction more natural.

Once a valid movie is selected, I compute similarities between movies using their feature representations and retrieve the closest ones. These similarities come from a machine-learning distance model, allowing me to recommend films that are truly close in terms of genre, style, or overall characteristics. This approach clearly highlights the machine-learning aspect, as it relies on vector representations and nearest-neighbor similarity, while keeping the recommendation process intuitive and easy to use.

<h1 align="left"><strong style="color: #7ed957;">5.  Evaluation of the two models</strong></h1>

## 5.1 Evaluation of the user-based model

Pour tester notre model ici j'ai divier mon dataset en trai et test comme nous avions l'habitude de le faire en cours.

In [84]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from tqdm.auto import tqdm   # ⚠️ IMPORTANT pour affichage notebook
import numpy as np


To evaluate our user-based recommendation model, I first split the ratings dataset into a training set and a test set, following the same methodology used during lectures.
The training set is used to build the recommendation model, while the test set allows us to measure how well the model generalizes to unseen data.
To reduce computation time while keeping realistic results, the evaluation is performed on a random subset of the test set.

In [85]:
ratings_train, ratings_test = train_test_split(ratings,test_size=0.2,random_state=42)
ratings_test_eval = ratings_test.sample(n=min(3000, len(ratings_test)), random_state=42)

In this step, I build the user–item matrix from the training data, where each row represents a user, each column represents a movie, and each value corresponds to the rating given by the user.
Missing ratings are filled with zeros, meaning that the user has not rated the movie.
This matrix is the core representation used by the collaborative filtering model. \
Since the user–item matrix is very sparse, I convert it into a compressed sparse row (CSR) matrix to improve both memory usage and computational efficiency.
I then train a K-Nearest Neighbors model using cosine similarity, which measures how similar two users are based on their rating patterns.
The brute-force algorithm is chosen to ensure exact similarity computation.

In [86]:
Matrice_Utilisateur_Films_train = ratings_train.pivot(index="UserID",columns="MovieID",values="Rating").fillna(0)

Matrice_creuse_train = csr_matrix(Matrice_Utilisateur_Films_train.values)
model_knn = NearestNeighbors(metric="cosine", algorithm="brute")
model_knn.fit(Matrice_creuse_train)

This function predicts the rating that a user would give to a specific movie.
First, it checks whether the user and the movie exist in the training matrix. \
Then, it finds the most similar users using the KNN model and computes a weighted average of their ratings for the target movie, where the weights are based on cosine similarity. \
If none of the neighbors has rated the movie, the function returns no prediction.

In [88]:
def predict_rating_user_based(user_id, movie_id, k=5):
    if user_id not in Matrice_Utilisateur_Films_train.index:
        return np.nan
    if movie_id not in Matrice_Utilisateur_Films_train.columns:
        return np.nan

    user_vec = Matrice_Utilisateur_Films_train.loc[user_id].values.reshape(1, -1)
    distances, indices = model_knn.kneighbors(user_vec, n_neighbors=k+1)

    neighbors = indices.flatten()[1:]
    similarities = 1 - distances.flatten()[1:]

    neighbor_ratings = Matrice_Utilisateur_Films_train.iloc[neighbors][movie_id]
    mask = neighbor_ratings > 0

    if mask.sum() == 0:
        return np.nan

    return np.dot(similarities[mask], neighbor_ratings[mask]) / similarities[mask].sum()

This function generates top-N movie recommendations for a given user. \
It identifies the most similar users, aggregates their ratings, and assigns a score to each movie.
Movies already seen by the user are excluded to avoid recommending items the user has already watched. \
Finally, the movies with the highest scores are returned as recommendations.

In [89]:
def recommend_user_based(user_id, k=10, n_neighbors=5):
    if user_id not in Matrice_Utilisateur_Films_train.index:
        return []

    user_vec = Matrice_Utilisateur_Films_train.loc[user_id].values.reshape(1, -1)
    distances, indices = model_knn.kneighbors(user_vec, n_neighbors=n_neighbors+1)

    neighbors = indices.flatten()[1:]
    similarities = 1 - distances.flatten()[1:]

    scores = np.dot(similarities, Matrice_Utilisateur_Films_train.iloc[neighbors].values)
    seen = Matrice_Utilisateur_Films_train.loc[user_id] > 0
    scores[seen.values] = 0

    return list(Matrice_Utilisateur_Films_train.columns[np.argsort(scores)[::-1][:k]])


For the content-based baseline, the predicted rating of a movie is simply its average weighted rating computed from the training set. \
This approach does not depend on user similarity and serves as a simple reference model. \
If a movie is not found in the training statistics, no prediction is returned. \
This function recommends movies to a user based on overall popularity and weighted scores.
Movies already rated by the user are removed, and the remaining ones are ranked according to their weighted score.
The top-K movies are then returned as content-based recommendations.

In [90]:
movie_scores_train = compute_weighted_scores(ratings_train)

def predict_rating_content_based(movie_id):
    row = movie_scores_train[movie_scores_train["MovieID"] == movie_id]
    return row["mean_rating"].values[0] if not row.empty else np.nan

def recommend_content_based(user_id, k=10):
    seen = ratings_train[ratings_train["UserID"] == user_id]["MovieID"].unique()
    candidates = movie_scores_train[~movie_scores_train["MovieID"].isin(seen)]
    return candidates.sort_values("weighted_score", ascending=False).head(k)["MovieID"].tolist()


To evaluate the quality of the recommendation lists, I use Precision@K and Recall@K.\
Precision@K measures how many recommended movies are actually relevant, while Recall@K measures how many relevant movies were successfully retrieved by the model.
These metrics focus on ranking quality rather than rating accuracy.

In [91]:
def precision_recall_at_k(recommended, relevant, k):
    recommended = set(recommended[:k])
    relevant = set(relevant)
    tp = len(recommended & relevant)
    return tp / k, tp / len(relevant) if relevant else 0

In this block, I evaluate the user-based model on the test subset. \
For each rating in the test set, I compare the predicted rating with the true rating to compute MAE and RMSE. \
In parallel, Precision@10 and Recall@10 are computed by comparing the recommended movies with the movies that the user actually rated positively in the test set.

In [92]:
y_true_ub, y_pred_ub = [], []
prec_ub, rec_ub = [], []

for _, row in tqdm(ratings_test_eval.iterrows(),
                   total=len(ratings_test_eval),
                   desc="User-Based RMSE/MAE"):
    pred = predict_rating_user_based(row["UserID"], row["MovieID"])
    if not np.isnan(pred):
        y_true_ub.append(row["Rating"])
        y_pred_ub.append(pred)

for user_id in tqdm(ratings_test_eval["UserID"].unique(),
                    desc="User-Based Precision/Recall"):
    relevant = ratings_test_eval[
        (ratings_test_eval["UserID"] == user_id) &
        (ratings_test_eval["Rating"] >= 4)
    ]["MovieID"].tolist()

    if not relevant:
        continue

    recommended = recommend_user_based(user_id, k=10)
    p, r = precision_recall_at_k(recommended, relevant, 10)
    prec_ub.append(p)
    rec_ub.append(r)

User-Based RMSE/MAE:   0%|          | 0/3000 [00:00<?, ?it/s]

User-Based Precision/Recall:   0%|          | 0/1932 [00:00<?, ?it/s]

The same evaluation procedure is applied to the content-based model to ensure a fair comparison. \
Rating prediction errors (MAE and RMSE) as well as ranking metrics (Precision@10 and Recall@10) are computed using the same test data. \
This allows us to directly compare both approaches under identical conditions.

In [93]:
y_true_cb, y_pred_cb = [], []
prec_cb, rec_cb = [], []

for _, row in tqdm(ratings_test_eval.iterrows(),
                   total=len(ratings_test_eval),
                   desc="Content-Based RMSE/MAE"):
    pred = predict_rating_content_based(row["MovieID"])
    if not np.isnan(pred):
        y_true_cb.append(row["Rating"])
        y_pred_cb.append(pred)

for user_id in tqdm(ratings_test_eval["UserID"].unique(),
                    desc="Content-Based Precision/Recall"):
    relevant = ratings_test_eval[
        (ratings_test_eval["UserID"] == user_id) &
        (ratings_test_eval["Rating"] >= 4)
    ]["MovieID"].tolist()

    if not relevant:
        continue

    recommended = recommend_content_based(user_id, k=10)
    p, r = precision_recall_at_k(recommended, relevant, 10)
    prec_cb.append(p)
    rec_cb.append(r)

Content-Based RMSE/MAE:   0%|          | 0/3000 [00:00<?, ?it/s]

Content-Based Precision/Recall:   0%|          | 0/1932 [00:00<?, ?it/s]

In [94]:
print("\n===== USER-BASED =====")
print("MAE :", mean_absolute_error(y_true_ub, y_pred_ub))
print("RMSE:", np.sqrt(mean_squared_error(y_true_ub, y_pred_ub)))
print("Precision@10:", np.mean(prec_ub))
print("Recall@10   :", np.mean(rec_ub))

print("\n===== CONTENT-BASED =====")
print("MAE :", mean_absolute_error(y_true_cb, y_pred_cb))
print("RMSE:", np.sqrt(mean_squared_error(y_true_cb, y_pred_cb)))
print("Precision@10:", np.mean(prec_cb))
print("Recall@10   :", np.mean(rec_cb))


===== USER-BASED =====
MAE : 0.8651288461578492
RMSE: 1.13841359740478
Precision@10: 0.015535444947209653
Recall@10   : 0.13303167420814477

===== CONTENT-BASED =====
MAE : 0.7862627910813956
RMSE: 0.9842620863128162
Precision@10: 0.008295625942684766
Recall@10   : 0.062305178481649064


Evaluating recommendation systems, especially user-based collaborative filtering models, is inherently challenging. Unlike classical supervised learning models, the objective here is not to predict an exact value with high precision, but to approximate user preferences, which are subjective, sparse, and highly variable. As a result, standard evaluation metrics must be interpreted with caution. \

Looking at the results, the user-based model achieves a MAE of approximately 0.87 and an RMSE of around 1.14. While these values may seem relatively high at first glance, they are actually common in collaborative filtering settings. User ratings are noisy by nature: different users interpret rating scales differently, and the same user may rate similar movies inconsistently over time. Moreover, the sparsity of the user–item matrix makes it difficult to find highly similar users with enough overlapping ratings, which directly impacts prediction accuracy. \

Regarding ranking metrics, the Precision@10 and Recall@10 values are relatively low. This outcome is expected in large recommendation spaces where users typically interact with only a very small subset of available items. Recommending ten items among thousands, while matching a few highly rated movies in the test set, is statistically difficult. In this context, a low Precision@10 does not necessarily indicate a poor model, but rather reflects the intrinsic difficulty of the task and the strictness of the metric.

The content-based model, on the other hand, shows slightly better RMSE and MAE values, which indicates more stable rating predictions. However, its Precision@10 and Recall@10 are even lower than those of the user-based model. This behavior is coherent with the nature of content-based approaches: they rely on global popularity or item features and do not adapt strongly to individual user preferences. As a result, they tend to predict ratings more consistently but fail to capture personalized relevance.

Overall, these results highlight an important point: recommendation systems are not evaluated solely on numerical performance. Metrics such as RMSE, Precision, and Recall provide useful insights, but they do not fully capture user satisfaction or perceived relevance. Given these limitations, the goal of this project is not to optimize a single metric, but to understand the behavior, strengths, and weaknesses of different recommendation paradigms. This justifies continuing the analysis by focusing not only on numerical scores, but also on qualitative interpretation and model design choices.