# Experimentation and Results

## Objective of the project 

This study seeks to conduct a thorough comparative analysis of these three models, focusing
on their performance with regards to accuracy, computational complexity, scalability, and their
effectiveness in handling data sparsity and dynamically changing environments. By evaluat-
ing these aspects, the research aims to illuminate the operational strengths and weaknesses
of each model, providing clear insights that could guide the development and deployment of
future recommender systems. Through this comparative framework, we aspire to answer which
model, under what conditions, provides the most reliable and robust recommendations, thereby
significantly contributing to the optimization of digital services.

In [13]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter, defaultdict
from surprise import Dataset, Reader, KNNBasic, SVD, CoClustering, accuracy
#from surprise.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from surprise.accuracy import rmse, mae

In [5]:
links_df = pd.read_csv('MovieLens_100k/links.csv')
movies_df = pd.read_csv('MovieLens_100k/movies.csv')
ratings_df = pd.read_csv('MovieLens_100k/ratings.csv')
tags_df = pd.read_csv('MovieLens_100k/tags.csv')

datasets = {
    "Links": links_df,
    "Movies": movies_df,
    "Ratings": ratings_df,
    "Tags": tags_df
}

datasets_info = {name: df.head() for name, df in datasets.items()}
datasets_info

{'Links':    movieId  imdbId   tmdbId
 0        1  114709    862.0
 1        2  113497   8844.0
 2        3  113228  15602.0
 3        4  114885  31357.0
 4        5  113041  11862.0,
 'Movies':    movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  ,
 'Ratings':    userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  9

## Dataset structure

In [6]:
# Check for missing values in each dataset
missing_values = {name: df.isnull().sum() for name, df in datasets.items()}

# Print the information about missing values
for name, missing in missing_values.items():
    print(f"Missing values in {name} dataset:\n{missing}\n")

Missing values in Links dataset:
movieId    0
imdbId     0
tmdbId     8
dtype: int64

Missing values in Movies dataset:
movieId    0
title      0
genres     0
dtype: int64

Missing values in Ratings dataset:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Missing values in Tags dataset:
userId       0
movieId      0
tag          0
timestamp    0
dtype: int64



In [7]:
# Print the shape of each DataFrame
for name, df in datasets.items():
    print(f"The shape of the {name} DataFrame is: {df.shape}")

The shape of the Links DataFrame is: (9742, 3)
The shape of the Movies DataFrame is: (9742, 3)
The shape of the Ratings DataFrame is: (100836, 4)
The shape of the Tags DataFrame is: (3683, 4)


In [8]:
distribution_of_ratings = ratings_df.groupby('rating').size().reset_index(name='count')
distribution_of_ratings

Unnamed: 0,rating,count
0,0.5,1370
1,1.0,2811
2,1.5,1791
3,2.0,7551
4,2.5,5550
5,3.0,20047
6,3.5,13136
7,4.0,26818
8,4.5,8551
9,5.0,13211


# Hypergraph-based Models: HyperGCN Algorithm

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import hypernetx as hnx
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, regularizers
from collections import defaultdict
from joblib import Parallel, delayed
import time
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt


# Load the dataset
ratings_df = pd.read_csv('MovieLens_100k/ratings.csv')
movies_df = pd.read_csv('MovieLens_100k/movies.csv')

train_df, test_df = train_test_split(ratings_df, test_size=0.20, random_state=42)

# Build the hypergraph
edges = defaultdict(list)
for _, row in train_df.iterrows():
    user_node = f'user_{row["userId"]}'
    movie_node = f'movie_{row["movieId"]}'
    rating = row["rating"]
    hyperedge = f'{movie_node}_rating_{rating}'
    edges[hyperedge].append(user_node)
    edges[hyperedge].append(movie_node)

H = hnx.Hypergraph(edges)
print(f"Hypergraph created with {len(H.nodes)} nodes and {len(H.edges)} edges.")
print(f"Number of nodes in hypergraph: {len(H.nodes)}")
print(f"Sample nodes: {list(H.nodes)[:5]}")


# Create adjacency matrix for hypergraph
def create_hypergraph_adjacency_matrix(hypergraph):
    node_list = list(hypergraph.nodes)
    node_idx = {node: idx for idx, node in enumerate(node_list)}
    n = len(node_list)
    
    data = []
    row = []
    col = []

    for edge in hypergraph.edges:
        edge_nodes = list(hypergraph.edges[edge])
        for i in range(len(edge_nodes)):
            for j in range(i + 1, len(edge_nodes)):
                node_i = node_idx[edge_nodes[i]]
                node_j = node_idx[edge_nodes[j]]
                data.append(1)
                row.append(node_i)
                col.append(node_j)
                data.append(1)
                row.append(node_j)
                col.append(node_i)

    adj_matrix = csr_matrix((data, (row, col)), shape=(n, n))
    print(f"Adjacency matrix created with shape {adj_matrix.shape}")
    return adj_matrix, node_idx

adj_matrix, node_to_idx = create_hypergraph_adjacency_matrix(H)
adj_matrix_sparse = tf.sparse.SparseTensor(indices=np.array([adj_matrix.nonzero()[0], adj_matrix.nonzero()[1]]).T,
                                           values=adj_matrix.data.astype(np.float32),
                                           dense_shape=adj_matrix.shape)
adj_matrix_sparse = tf.sparse.reorder(adj_matrix_sparse)


# Evaluation metrics functions
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))
    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in user_ratings[:k])
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    return precisions, recalls

def compute_mse(predictions):
    """Compute Mean Squared Error (MSE)."""
    mse = np.mean([(true_r - est) ** 2 for (_, _, true_r, est, _) in predictions])
    return mse

def compute_rmse(predictions):
    """Compute Root Mean Squared Error (RMSE)."""
    mse = compute_mse(predictions)
    rmse = np.sqrt(mse)
    return rmse

def compute_mae(predictions):
    mae = np.mean([abs(true_r - est) for (_, _, true_r, est, _) in predictions])
    return mae

# Functions to generate sparse and new user data
def get_sparse_data(ratings, frac=0.1):
    sparse_ratings_df = ratings.sample(frac=frac, random_state=42) 
    return sparse_ratings_df

def get_new_user_data(ratings, frac=0.1):
    new_user_ratings_df = ratings[ratings['userId'].isin(ratings['userId'].sample(frac=frac, random_state=42))]
    return new_user_ratings_df

def evaluate_model(test, embeddings, user_mapping, movie_mapping, scenario, algorithm):
    def predict_rating(user, movie):
        if user in user_mapping and movie in movie_mapping:
            user_idx = user_mapping[user]
            movie_idx = movie_mapping[movie]
            if user_idx >= embeddings.shape[0] or movie_idx >= embeddings.shape[0]:
                return 0
            user_emb = embeddings[user_idx]
            movie_emb = embeddings[movie_idx]
            return np.dot(user_emb, movie_emb)
        else:
            return 0

    predictions = []
    for _, row in test.iterrows():
        uid = row['userId']
        mid = row['movieId']
        true_r = row['rating']
        est = predict_rating(uid, mid)
        predictions.append((uid, mid, true_r, est, None))

    mse = compute_mse(predictions)
    rmse = compute_rmse(predictions)
    mae = compute_mae(predictions)
    precisions, recalls = precision_recall_at_k(predictions, k=10)

    avg_precision = np.mean(list(precisions.values()))
    avg_recall = np.mean(list(recalls.values()))

    results = pd.DataFrame({
        'Scenario': [scenario],
        'Algorithm': [algorithm],
        'MSE':[mse],
        'RMSE': [rmse],
        'MAE': [mae],
        'Precision@10': [avg_precision],
        'Recall@10': [avg_recall]
    })
    
    return results

2024-08-12 07:23:14.799790: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Hypergraph created with 9593 nodes and 26961 edges.
Number of nodes in hypergraph: 9593
Sample nodes: ['user_509.0', 'movie_7347.0', 'user_380.0', 'user_274.0', 'user_474.0']
Adjacency matrix created with shape (9593, 9593)


# HyperGCN

In [8]:
class ImprovedHyperGCN(tf.keras.Model):
    def __init__(self, num_nodes, num_features, hidden_dims, output_dim, dropout_rate):
        super(ImprovedHyperGCN, self).__init__()
        self.embedding = layers.Embedding(num_nodes, num_features)
        self.hidden_layers = [layers.Dense(dim, activation='relu', kernel_regularizer=regularizers.l2(0.01)) for dim in hidden_dims]
        self.output_layer = layers.Dense(output_dim, activation='linear')
        self.dropout = layers.Dropout(dropout_rate)

    def call(self, adj_matrix):
        x = self.embedding(tf.range(adj_matrix.shape[0]))
        x = tf.sparse.sparse_dense_matmul(adj_matrix, x)
        for layer in self.hidden_layers:
            x = layer(x)
            x = self.dropout(x)
        x = self.output_layer(x)
        return x

def train_hypergcn_model(H, train, test, scenario):
    user_ids = train['userId'].unique()
    movie_ids = train['movieId'].unique()

    user_mapping = {user_id: idx for idx, user_id in enumerate(user_ids)}
    movie_mapping = {movie_id: idx + len(user_ids) for idx, movie_id in enumerate(movie_ids)}

    train['user_idx'] = train['userId'].map(user_mapping)
    train['movie_idx'] = train['movieId'].map(movie_mapping)

    test['user_idx'] = test['userId'].map(user_mapping)
    test['movie_idx'] = test['movieId'].map(movie_mapping)

    num_nodes = len(user_ids) + len(movie_ids)
    num_features = 128
    hidden_dims = [256, 128]
    output_dim = 1
    dropout_rate = 0.5

    model = ImprovedHyperGCN(num_nodes, num_features, hidden_dims, output_dim, dropout_rate)
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

    def pairwise_hinge_loss(positive_scores, negative_scores, margin=1.0):
        return tf.reduce_mean(tf.maximum(0.0, margin - positive_scores + negative_scores))

    def improved_sample_negative_edges(train_df, user_mapping, movie_mapping, num_neg_samples):
        users = train_df['userId'].unique()
        movies = train_df['movieId'].unique()
        positive_samples = train_df[['userId', 'movieId']].values
        positive_samples_set = set((user_mapping[user], movie_mapping[movie]) for user, movie in positive_samples)
        
        neg_samples = []
        while len(neg_samples) < num_neg_samples:
            user = np.random.choice(users)
            user_idx = user_mapping[user]
            negative_movies = np.setdiff1d(movies, train_df[train_df['userId'] == user]['movieId'].values)
            neg_movie = np.random.choice(negative_movies)
            neg_movie_idx = movie_mapping[neg_movie]
            if (user_idx, neg_movie_idx) not in positive_samples_set:
                neg_samples.append((user_idx, neg_movie_idx))
        
        neg_df = pd.DataFrame(neg_samples, columns=['user_idx', 'movie_idx'])
        print(f"Generated {len(neg_df)} negative samples")
        return neg_df

    positive_samples = train[['user_idx', 'movie_idx']]
    neg_train_samples = improved_sample_negative_edges(train, user_mapping, movie_mapping, num_neg_samples=len(positive_samples))
    print(f"Number of positive samples: {len(positive_samples)}")
    print(f"Number of negative samples: {len(neg_train_samples)}")

    epochs = 200
    batch_size = 256

    def train_step(model, adj_matrix, optimizer, positive_samples, negative_samples):
        with tf.GradientTape() as tape:
            positive_user_embeddings = model.embedding(positive_samples['user_idx'].values)
            positive_movie_embeddings = model.embedding(positive_samples['movie_idx'].values)
            negative_user_embeddings = model.embedding(negative_samples['user_idx'].values)
            negative_movie_embeddings = model.embedding(negative_samples['movie_idx'].values)

            positive_scores = tf.reduce_sum(positive_user_embeddings * positive_movie_embeddings, axis=1)
            negative_scores = tf.reduce_sum(negative_user_embeddings * negative_movie_embeddings, axis=1)

            loss = pairwise_hinge_loss(positive_scores, negative_scores)

        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        return loss

    start_time = time.time()
    for epoch in range(epochs):
        loss = train_step(model, adj_matrix_sparse, optimizer, positive_samples, neg_train_samples)
        if epoch % 10 == 0:
            print(f'Epoch {epoch}, Loss: {loss.numpy()}')

    end_time = time.time()
    embeddings = model.embedding(tf.range(num_nodes)).numpy()

    results = evaluate_model(test, embeddings, user_mapping, movie_mapping, scenario, "HyperGCN")
    results['Running Time (s)'] = end_time - start_time
    
    return results

# Evaluate HyperGCN model for different scenarios
results_hypergcn_normal = train_hypergcn_model(H, train_df, test_df, "Normal")

Generated 80668 negative samples
Number of positive samples: 80668
Number of negative samples: 80668
Epoch 0, Loss: 1.0000461339950562
Epoch 10, Loss: 0.9871677160263062
Epoch 20, Loss: 0.9546469449996948
Epoch 30, Loss: 0.826619029045105
Epoch 40, Loss: 0.4226799011230469
Epoch 50, Loss: 0.19554905593395233
Epoch 60, Loss: 0.11421488225460052
Epoch 70, Loss: 0.06767608225345612
Epoch 80, Loss: 0.039076920598745346
Epoch 90, Loss: 0.022246282547712326
Epoch 100, Loss: 0.012394175864756107
Epoch 110, Loss: 0.006759673822671175
Epoch 120, Loss: 0.0034875241108238697
Epoch 130, Loss: 0.0017413144232705235
Epoch 140, Loss: 0.0008235073764808476
Epoch 150, Loss: 0.0003586069797165692
Epoch 160, Loss: 0.0001706021575955674
Epoch 170, Loss: 8.574590174248442e-05
Epoch 180, Loss: 3.9966966141946614e-05
Epoch 190, Loss: 1.586773396411445e-05


In [9]:
results_hypergcn_normal

Unnamed: 0,Scenario,Algorithm,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,Normal,HyperGCN,5.608074,2.368137,2.05015,0.116924,0.009341,86.924503


# Sparse data HyperGCN 

In [10]:
sparse_train_df = get_sparse_data(train_df, frac=0.1)

# Sparse 
# Build the hypergraph
edges_sparse = defaultdict(list)
for _, row in sparse_train_df.iterrows():
    user_node = f'user_{row["userId"]}'
    movie_node = f'movie_{row["movieId"]}'
    rating = row["rating"]
    hyperedge = f'{movie_node}_rating_{rating}'
    edges_sparse[hyperedge].append(user_node)
    edges_sparse[hyperedge].append(movie_node)

H_sparse = hnx.Hypergraph(edges_sparse)

adj_matrix, node_to_idx = create_hypergraph_adjacency_matrix(H_sparse)
adj_matrix_sparse = tf.sparse.SparseTensor(indices=np.array([adj_matrix.nonzero()[0], adj_matrix.nonzero()[1]]).T,
                                           values=adj_matrix.data.astype(np.float32),
                                           dense_shape=adj_matrix.shape)
adj_matrix_sparse = tf.sparse.reorder(adj_matrix_sparse)


results_hypergcn_sparse = train_hypergcn_model(H_sparse, sparse_train_df, test_df, "Sparse")
results_hypergcn_sparse

Adjacency matrix created with shape (3794, 3794)
Generated 8067 negative samples
Number of positive samples: 8067
Number of negative samples: 8067
Epoch 0, Loss: 1.000066876411438
Epoch 10, Loss: 0.959601104259491
Epoch 20, Loss: 0.8961366415023804
Epoch 30, Loss: 0.7816119194030762
Epoch 40, Loss: 0.5799558162689209
Epoch 50, Loss: 0.27321043610572815
Epoch 60, Loss: 0.04732617735862732
Epoch 70, Loss: 0.0027934403624385595
Epoch 80, Loss: 2.025742651312612e-05
Epoch 90, Loss: 0.0
Epoch 100, Loss: 0.0
Epoch 110, Loss: 0.0
Epoch 120, Loss: 0.0
Epoch 130, Loss: 0.0
Epoch 140, Loss: 0.0
Epoch 150, Loss: 0.0
Epoch 160, Loss: 0.0
Epoch 170, Loss: 0.0
Epoch 180, Loss: 0.0
Epoch 190, Loss: 0.0


Unnamed: 0,Scenario,Algorithm,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,Sparse,HyperGCN,13.239963,3.638676,3.480585,0.0,0.0,11.528216


# New User HyperGCN

In [12]:
new_user_train_df = get_new_user_data(train_df, frac=0.1)

# Build the hypergraph
edges_new_user = defaultdict(list)
for _, row in new_user_train_df.iterrows():
    user_node = f'user_{row["userId"]}'
    movie_node = f'movie_{row["movieId"]}'
    rating = row["rating"]
    hyperedge = f'{movie_node}_rating_{rating}'
    edges_new_user[hyperedge].append(user_node)
    edges_new_user[hyperedge].append(movie_node)

H_new_user = hnx.Hypergraph(edges_new_user)

adj_matrix, node_to_idx = create_hypergraph_adjacency_matrix(H_new_user)
adj_matrix_sparse = tf.sparse.SparseTensor(indices=np.array([adj_matrix.nonzero()[0], adj_matrix.nonzero()[1]]).T,
                                           values=adj_matrix.data.astype(np.float32),
                                           dense_shape=adj_matrix.shape)
adj_matrix_sparse = tf.sparse.reorder(adj_matrix_sparse)



results_hypergcn_new_user = train_hypergcn_model(H_new_user, new_user_train_df, test_df, "New User")
results_hypergcn_new_user

Adjacency matrix created with shape (9564, 9564)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['user_idx'] = train['userId'].map(user_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['movie_idx'] = train['movieId'].map(movie_mapping)


Generated 80184 negative samples
Number of positive samples: 80184
Number of negative samples: 80184
Epoch 0, Loss: 0.9998919367790222
Epoch 10, Loss: 0.9869893193244934
Epoch 20, Loss: 0.9545198678970337
Epoch 30, Loss: 0.8279085159301758
Epoch 40, Loss: 0.4215397834777832
Epoch 50, Loss: 0.19156445562839508
Epoch 60, Loss: 0.11105109006166458
Epoch 70, Loss: 0.06559161841869354
Epoch 80, Loss: 0.037892796099185944
Epoch 90, Loss: 0.021464204415678978
Epoch 100, Loss: 0.01199073065072298
Epoch 110, Loss: 0.0064708152785897255
Epoch 120, Loss: 0.0033071571961045265
Epoch 130, Loss: 0.0016385489143431187
Epoch 140, Loss: 0.000772208790294826
Epoch 150, Loss: 0.00035553640918806195
Epoch 160, Loss: 0.00015765165153425187
Epoch 170, Loss: 5.20551620866172e-05
Epoch 180, Loss: 1.1776611245295499e-05
Epoch 190, Loss: 9.506505875833682e-07


Unnamed: 0,Scenario,Algorithm,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,New User,HyperGCN,5.691181,2.38562,2.063831,0.105558,0.008382,88.79288


In [13]:
# Combine HyperGCN results into a single DataFrame
results_hypergcn_combined = pd.concat([results_hypergcn_normal, results_hypergcn_sparse, results_hypergcn_new_user], ignore_index=True)
results_hypergcn_combined

Unnamed: 0,Scenario,Algorithm,MSE,RMSE,MAE,Precision@10,Recall@10,Running Time (s)
0,Normal,HyperGCN,5.608074,2.368137,2.05015,0.116924,0.009341,86.924503
1,Sparse,HyperGCN,13.239963,3.638676,3.480585,0.0,0.0,11.528216
2,New User,HyperGCN,5.691181,2.38562,2.063831,0.105558,0.008382,88.79288
