# Experimentation and Results

## Objective of the project 

This study seeks to conduct a thorough comparative analysis of these three models, focusing
on their performance with regards to accuracy, computational complexity, scalability, and their
effectiveness in handling data sparsity and dynamically changing environments. By evaluat-
ing these aspects, the research aims to illuminate the operational strengths and weaknesses
of each model, providing clear insights that could guide the development and deployment of
future recommender systems. Through this comparative framework, we aspire to answer which
model, under what conditions, provides the most reliable and robust recommendations, thereby
significantly contributing to the optimization of digital services.

In [3]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter, defaultdict
from surprise import Dataset, Reader, KNNBasic, SVD, CoClustering, accuracy
#from surprise.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from surprise.accuracy import rmse, mae

In [4]:
links_df = pd.read_csv('MovieLens_100k/links.csv')
movies_df = pd.read_csv('MovieLens_100k/movies.csv')
ratings_df = pd.read_csv('MovieLens_100k/ratings.csv')
tags_df = pd.read_csv('MovieLens_100k/tags.csv')

datasets = {
    "Links": links_df,
    "Movies": movies_df,
    "Ratings": ratings_df,
    "Tags": tags_df
}

datasets_info = {name: df.head() for name, df in datasets.items()}
datasets_info

{'Links':    movieId  imdbId   tmdbId
 0        1  114709    862.0
 1        2  113497   8844.0
 2        3  113228  15602.0
 3        4  114885  31357.0
 4        5  113041  11862.0,
 'Movies':    movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  ,
 'Ratings':    userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  9

## Dataset structure

In [6]:
# Check for missing values in each dataset
missing_values = {name: df.isnull().sum() for name, df in datasets.items()}

# Print the information about missing values
for name, missing in missing_values.items():
    print(f"Missing values in {name} dataset:\n{missing}\n")

Missing values in Links dataset:
movieId    0
imdbId     0
tmdbId     8
dtype: int64

Missing values in Movies dataset:
movieId    0
title      0
genres     0
dtype: int64

Missing values in Ratings dataset:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Missing values in Tags dataset:
userId       0
movieId      0
tag          0
timestamp    0
dtype: int64



In [7]:
# Print the shape of each DataFrame
for name, df in datasets.items():
    print(f"The shape of the {name} DataFrame is: {df.shape}")

The shape of the Links DataFrame is: (9742, 3)
The shape of the Movies DataFrame is: (9742, 3)
The shape of the Ratings DataFrame is: (100836, 4)
The shape of the Tags DataFrame is: (3683, 4)


In [8]:
distribution_of_ratings = ratings_df.groupby('rating').size().reset_index(name='count')
distribution_of_ratings

Unnamed: 0,rating,count
0,0.5,1370
1,1.0,2811
2,1.5,1791
3,2.0,7551
4,2.5,5550
5,3.0,20047
6,3.5,13136
7,4.0,26818
8,4.5,8551
9,5.0,13211


# Collaborative Filtering Algorithms

### For the collaborative filtering, we have implemented 3 algorithms liisted below: 

### a. KNNBasic (K-Nearest Neighbors)
The KNNBasic algorithm leverages the k-nearest neighbors technique to predict user ratings
based on the weighted average of ratings from similar users or items. For KNN, we have chosen 3 different similarity measures to test: Pearson, Pearson baseline and Mean squared difference. Refer to the technical report for more detail. 

### b. SVD (Singular Value Decomposition)
SVD: SVD is a matrix factorization technique that decomposes the user-item rating matrix into
latent factors, enabling the prediction of ratings through these latent factors.

### c. CoClustering
CoClustering: CoClustering simultaneously clusters users and items to uncover hidden re-
lationships in the data, facilitating more accurate rating predictions.

### Useful functions 

In [5]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions."""
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Map the predictions to only the top N items
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

def get_top_n_recommendations(user_id, n=10):
    # Get a list of all movies in the dataset
    all_movies = movies_df['movieId'].unique()
    
    # Get movies that the user has already rated
    rated_movies = ratings_df[ratings_df['userId'] == user_id]['movieId'].tolist()
    
    # Predict ratings for all movies the user hasn't rated yet
    predictions = []
    for movie_id in set(all_movies) - set(rated_movies):
        pred = model.predict(uid=user_id, iid=movie_id)
        predictions.append((movie_id, pred.est))
    
    # Sort the predictions by estimated rating in descending order and select the top N
    top_n = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
    
    # Map the movie IDs back to titles
    top_n_movies = [(movies_df[movies_df['movieId'] == mid]['title'].values[0], est) for mid, est in top_n]
    
    return top_n_movies

def precision_recall_at_k(predictions, k=10, threshold=3.5):  
        user_est_true = defaultdict(list)
        for uid, _, true_r, est, _ in predictions:
            user_est_true[uid].append((est, true_r))
        precisions = dict()
        recalls = dict()
        for uid, user_ratings in user_est_true.items():
            user_ratings.sort(key=lambda x: x[0], reverse=True)
            n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
            n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
            n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in user_ratings[:k])
            precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
            recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
        return precisions, recalls

def compute_mse(predictions):
    """Compute Mean Squared Error (MSE)."""
    mse = np.mean([(true_r - est) ** 2 for (_, _, true_r, est, _) in predictions])
    return mse

def compute_rmse(predictions):
    """Compute Root Mean Squared Error (RMSE)."""
    mse = compute_mse(predictions)
    rmse = np.sqrt(mse)
    return rmse

def compute_mae(predictions):
    """Compute Mean Absolute Error (MAE)."""
    mae = np.mean([abs(true_r - est) for (_, _, true_r, est, _) in predictions])
    return mae

def get_sparse_data(ratings, frac=0.1):
    sparse_ratings_df = ratings.sample(frac=frac, random_state=42)  
    return sparse_ratings_df

def get_new_user_data(ratings, frac=0.1):
    new_user_ratings_df = ratings[ratings['userId'].isin(ratings['userId'].sample(frac=frac, random_state=42))]
    return new_user_ratings_df

### In this part, we use the Surprise library, renowned for its robust implementation of various collaborative filtering algorithms, to evaluate different recommendation system models. Specifically, we implement KNNBasic, SVD, and CoClustering algorithms, chosen for their widespread recognition and effectiveness in collaborative filtering tasks. 

In [9]:
#Function to evaluate a model with a given algorithm and similarity measure
def evaluate_algorithm(algo_name, similarity_measure, train_set, test_set, user_based=True):
    if algo_name == 'KNNBasic':
        sim_options = {
            'name': similarity_measure,
            'user_based': user_based
        }
        model = KNNBasic(sim_options=sim_options)
    elif algo_name == 'SVD':
        model = SVD()
    elif algo_name == 'CoClustering':
        model = CoClustering()

    # Measure start time
    start_time = time.time()
    
    # Train the model
    model.fit(train_set)
    
    # Make predictions on the test set
    predictions = model.test(test_set)

    # Measure end time
    end_time = time.time()
    # Calculate running time
    running_time = end_time - start_time
    
    # Evaluate accuracy
    mse_score = accuracy.mse(predictions, verbose=False)
    rmse_score = accuracy.rmse(predictions, verbose=False)
    mae_score = accuracy.mae(predictions, verbose=False)

    
    # Compute precision and recall
    precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=3.5)
    precision_avg = sum(prec for prec in precisions.values()) / len(precisions)
    recall_avg = sum(rec for rec in recalls.values()) / len(recalls)
    
    return algo_name, similarity_measure, user_based, mse_score, rmse_score, mae_score, precision_avg, recall_avg, running_time

# Function to evaluate all scenarios
def evaluate_all_scenarios(train_set, test_set, scenario_name):
    results_combined = []
    for algo_name, similarity_measure in algorithms:
        for user_based in [True, False]:
            algo_name, similarity_measure, user_based, mse_score, rmse_score, mae_score, precision_avg, recall_avg, running_time = evaluate_algorithm(algo_name, similarity_measure, train_set, test_set, user_based)
            results_combined.append({
                'Scenario': scenario_name,
                'Algorithm': algo_name,
                'Similarity Measure': similarity_measure if similarity_measure else 'N/A',
                'User-Based': user_based,
                'MSE': mse_score,
                'RMSE': rmse_score,
                'MAE': mae_score,
                'Precision@10': precision_avg,
                'Recall@10': recall_avg,
                'Running Time (s)': running_time
            })
    return results_combined


reader = Reader(rating_scale=(ratings_df['rating'].min(), ratings_df['rating'].max()))

# Step 1: Split the data into training and test sets (80% training, 20% testing)
train_df, test_df = train_test_split(ratings_df, test_size=0.20, random_state=42)

# Convert the training set into a Surprise dataset
train_data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)
train_set = train_data.build_full_trainset()

# Convert the test set into a Surprise dataset for later use
test_data = Dataset.load_from_df(test_df[['userId', 'movieId', 'rating']], reader)
test_set = test_data.build_full_trainset().build_testset()

# Step 2: Create the sparse training set from the 80% training data
sparse_train_df = get_sparse_data(train_df, frac=0.1)
sparse_data = Dataset.load_from_df(sparse_train_df[['userId', 'movieId', 'rating']], reader)
sparse_train_set = sparse_data.build_full_trainset()

# Step 3: Create the new user training set from the 80% training data
new_user_train_df = get_new_user_data(train_df, frac=0.1)
new_user_data = Dataset.load_from_df(new_user_train_df[['userId', 'movieId', 'rating']], reader)
new_user_train_set = new_user_data.build_full_trainset()


# Output to check
train_set.n_ratings, len(test_set)

# List of algorithms and their similarity measures to evaluate
algorithms = [
    ('KNNBasic', 'pearson'),
    ('KNNBasic', 'pearson_baseline'),
    ('KNNBasic', 'msd'),
    ('SVD', None),  # SVD does not use similarity measures
    ('CoClustering', None)  # CoClustering does not use similarity measures
]

In [10]:
# Evaluate normal scenario
results_normal = evaluate_all_scenarios(train_set, test_set, "Normal")

# Evaluate sparse data scenario
results_sparse = evaluate_all_scenarios(sparse_train_set, test_set, "Sparse")

# Evaluate new user data scenario
results_new_user = evaluate_all_scenarios(new_user_train_set, test_set, "New User")

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Compu

In [11]:
results_CF= pd.DataFrame(results_normal + results_sparse + results_new_user)
results_CF

Unnamed: 0,Scenario,Algorithm,Similarity Measure,User-Based,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,Normal,KNNBasic,pearson,True,0.960516,0.980059,0.756248,0.765217,0.538993,1.725298
1,Normal,KNNBasic,pearson,False,0.956461,0.977988,0.75942,0.656542,0.475766,16.990574
2,Normal,KNNBasic,pearson_baseline,True,0.96051,0.980056,0.754453,0.770664,0.546996,1.766859
3,Normal,KNNBasic,pearson_baseline,False,0.849695,0.921789,0.697694,0.731489,0.541961,12.290272
4,Normal,KNNBasic,msd,True,0.911588,0.954771,0.731834,0.758854,0.555813,1.513052
5,Normal,KNNBasic,msd,False,0.83648,0.914593,0.702499,0.670066,0.496486,11.247583
6,Normal,SVD,,True,0.779164,0.882703,0.676092,0.746351,0.518142,1.232592
7,Normal,SVD,,False,0.781063,0.883777,0.675254,0.743975,0.518712,1.204111
8,Normal,CoClustering,,True,0.899857,0.948608,0.733647,0.717745,0.510991,2.545786
9,Normal,CoClustering,,False,0.908735,0.953276,0.737809,0.708916,0.502505,2.607994


In [13]:
# Group by Algorithm and User-Based, and calculate the average RMSE
avg_rmse = results_CF.groupby(["Algorithm", "User-Based"])["RMSE"].mean().reset_index()

# Find the algorithm with the lowest average RMSE
best_algorithm = avg_rmse.loc[avg_rmse["RMSE"].idxmin()]

# Filter the main DataFrame to get the RMSE values for the best algorithm in each scenario
best_algorithm_performance = results_CF[(results_CF["Algorithm"] == best_algorithm["Algorithm"]) & (results_CF["User-Based"] == best_algorithm["User-Based"])]

best_algorithm, best_algorithm_performance[["Scenario", "Algorithm", "RMSE"]]


(Algorithm          SVD
 User-Based        True
 RMSE          0.903691
 Name: 5, dtype: object,
     Scenario Algorithm      RMSE
 6     Normal       SVD  0.882703
 16    Sparse       SVD  0.946178
 26  New User       SVD  0.882193)

In [16]:
best_algorithm_df = best_algorithm_performance[["Scenario", "Algorithm", "MSE", "RMSE", "MAE", "Precision@10", "Recall@10", "Running Time (s)"]].reset_index(drop=True)
best_algorithm_df

Unnamed: 0,Scenario,Algorithm,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,Normal,SVD,0.779164,0.882703,0.676092,0.746351,0.518142,1.232592
1,Sparse,SVD,0.895253,0.946178,0.735762,0.718301,0.509393,0.324698
2,New User,SVD,0.778264,0.882193,0.674224,0.74578,0.525009,1.197195
