In [1]:
import sys
import os

current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, '..'))
sys.path.append(os.path.join(project_root, 'src'))

In [2]:
import pandas as pd
import numpy as np
import networkx as nx


from metrics import map_score, mrr_score, ndcg_score, rmse_score, average_precision
from utils import train_test_split, to_user_movie_matrix, make_binary_matrix, RatingMatrix
from models.pagerank import create_transition_matrix, personalized_page_rank

In [3]:
ratings = pd.read_csv('../data/ratings.dat', sep='::', engine='python', names=['UserID', 'MovieID', 'Rating', 'Timestamp'])
users = pd.read_csv('../data/users.dat', sep='::', engine='python', names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'])
movies = pd.read_csv('../data/movies.dat', sep='::', engine='python', names=['MovieID', 'Title', 'Genres'], encoding='latin1')

data = ratings.merge(users, on='UserID').merge(movies, on='MovieID')

In [4]:
#train / test split by time 
train_ratings, test_ratings = train_test_split(ratings, 'Timestamp')

#train / test matrix creation
user_movie_train = to_user_movie_matrix(train_ratings)
user_movie_test = to_user_movie_matrix(test_ratings) 

### Graph-Based analysis with custome implementation of Peronalized PageRank: 

Represent some part of the dataset as a connected graph and set up meaningful experiment with the PageRank algorithm.

In [5]:
#let's filter all movies from the test dataset not presented in the train because we cannot evaluate such cases

user_movie_test.matrix = user_movie_test.matrix.loc[user_movie_test.matrix.index.isin(user_movie_train.matrix.index), :].copy()

In [6]:
#let's also filter all users from the test dataset not presented in the train

user_movie_test.matrix = user_movie_test.matrix.drop(columns=user_movie_test.matrix.columns[~user_movie_test.matrix.columns.isin(user_movie_train.matrix.columns)]).copy()

In [7]:
#let's compute transition matrix needed for Page Rank 
train_matrix = user_movie_train.matrix.fillna(0)

In [8]:
#create custome weights to implement bias towards 'better' films

custom_weights = {
    5.0: 100,
    4.0: 50,
    3.0: 1,
    2.0: 0,
    1.0: 0,
    0.0: 0
}

In [9]:
train_matrix = train_matrix.replace(custom_weights)

In [10]:
transition_matrix = create_transition_matrix(train_matrix)

In [11]:
num_users = user_movie_train.matrix.shape[1]
num_movies = user_movie_train.matrix.shape[0]

In [12]:
#this will be a draft of our prediction
y_pred = user_movie_test.matrix[user_movie_test.matrix.columns[0:30]].copy() #user_movie_test.matrix.copy()

In [13]:
indeces_in_train = [user_movie_train.matrix.columns.to_list().index(user) if user in user_movie_train.matrix.columns.to_list() else -1 for user in y_pred.columns]

In [15]:
for user, id in zip(y_pred.columns, indeces_in_train):
    
    #personalization vector: will be used in the Page Rank to create a bias for a specific user 
    personalization_vector = np.zeros(transition_matrix.shape[0])
    personalization_vector[id] = 1 
    # highly_rated_movies = train_ratings[(train_ratings['UserID'] == user) & (train_ratings['Rating'] >= 4)]['MovieID']
    
    # for movie_id in highly_rated_movies:
    #     movie_index = num_users + train_matrix.index.get_loc(movie_id)
    #     personalization_vector[movie_index] = 1

    # #normalization
    # personalization_vector /= personalization_vector.sum()

    #compute Personalized Page Rank scores
    ppr_vector = personalized_page_rank(transition_matrix, personalization_vector, alpha = 0.85)

    movie_scores = ppr_vector[num_users:]

    movie_recommendations = pd.DataFrame({
    'PPR_Score': movie_scores
    })

    movie_recommendations = movie_recommendations.set_index(user_movie_train.matrix.index)
    
    for ind in y_pred.index:
        y_pred.loc[ind, user] = movie_recommendations.loc[ind, ['PPR_Score']].values[0]

In [26]:
y_pred.to_csv('../artifacts/pagerank_results_20249616_1.csv')

In [16]:
y_pred = RatingMatrix(y_pred)

In [17]:
map_score_value = map_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)
mrr_score_value = mrr_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)
ndcg_score_value = ndcg_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)

print(f'PageRank MAP: {map_score_value}')
print(f'PageRank MRR: {mrr_score_value}')
print(f'PageRank NDCG: {ndcg_score_value}')

PageRank MAP: 0.1769805265726722
PageRank MRR: 0.22
PageRank NDCG: 0.18374997127433737


A MAP of indicates the mean precision of the algorithm when averaged over all queries. A value of  ~0.18 is poor and suggests that, on average, the precision of the recommended items is about 18%.

An MRR of 0.22 reflects low precision telling us that relevant items are, on average, ranked fairly low in the recommendation list (~8-9 place).

NDCG evaluates the quality of the ranking by considering the position of relevant items. An NDCG of 0.18 suggests that the overall ranking of relevant items is not effective

### Networkx implementation to check our results

To check our Personalizaed PageRank implementation, we will compare the result with networkx 

In [18]:
import networkx as nx

In [19]:
G = nx.Graph()

for user, movie, rating in zip(train_ratings['UserID'], train_ratings['MovieID'], train_ratings['Rating']):
    G.add_edge(f'user_{user}', f'movie_{movie}', weight=rating)  #G.add_edge(f'{user}', f'{movie}', weight=rating)

In [20]:
#this will be a draft of our prediction
y_pred = user_movie_test.matrix[user_movie_test.matrix.columns[0:30]].copy() #user_movie_test.matrix.copy()

In [21]:
indeces_in_train = [user_movie_train.matrix.columns.to_list().index(user) if user in user_movie_train.matrix.columns.to_list() else -1 for user in y_pred.columns]

In [22]:
for user, id in zip(y_pred.columns[:25], indeces_in_train):
    
    #personalization vector: will be used in the Page Rank to create a bias for a specific user 
    personalization = {node: 1 if node == f'user_{user}' else 0 for node in G.nodes()}
    
    #compute Personalized Page Rank scores
    pagerank_scores = nx.pagerank(G, personalization=personalization)

    recommended_movies = sorted([(node, score) for node, score in pagerank_scores.items() if node.startswith('movie_')],
                                key=lambda x: x[1], reverse=True)


    movie_recommendations = pd.DataFrame(recommended_movies, columns=['MovieID', 'PPR_Score'])
    movie_recommendations['MovieID'] = movie_recommendations['MovieID'].str.replace('movie_', '')
    movie_recommendations = movie_recommendations.set_index(movie_recommendations['MovieID'])


    for ind in y_pred.index:
        y_pred.loc[ind, user] = movie_recommendations.loc[str(ind), ['PPR_Score']].values[0]

In [23]:
y_pred = RatingMatrix(y_pred)

In [24]:
map_score_value = map_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)
mrr_score_value = mrr_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)
ndcg_score_value = ndcg_score(RatingMatrix(user_movie_test.get_rating_matrix()[y_pred.get_users()]), y_pred, top=30)

print(f'PageRank MAP: {map_score_value}')
print(f'PageRank MRR: {mrr_score_value}')
print(f'PageRank NDCG: {ndcg_score_value}')

PageRank MAP: 0.1743260222284341
PageRank MRR: 0.21583333333333332
PageRank NDCG: 0.18165560677695508


We can see that the results are similar with ours. So, we can conclude that our implementation doesn't have technical mistakes. 
Overall results are quite bad in terms of recommendation quality. This suggests that PageRank approach to the problem is not suitable. However, we see some steps can be taken to improve this approach: 

* Personalization vector enhancement: we can improve the initialization of the personalization vector by considering not only the user's id but also incorporating user preferable films into the vector. 
* Incorporate additional features to the adjacency matrix: for example, we can try to add films similarities, movie genres or other attributes describing content. **However,** this also inflate our adjacency matrix and significantly increase time needed to compute recommendations
* Apply regularization techniques: we can try to create better normalization mechanism for our transition matrix to handle sparsity in the rating matrix, which can help in better capturing user preferences (in theory). 

But, the ultimate conclusion is that PageRank is not suitable for this problem as a standalone solution. This is because PageRank primarily considers the link structure (i.e., the user-movie interactions) without understanding the content or features of the movies or users. Our dataset contains information about movies (genres) and users (demographics, occupation) which are not leveraged by the algorithm.

Another problem is that the dataset is sparse, meaning most users have rated only a small subset of movies (this fact was fixed during the exploratory data analysis). PageRank relies on the link structure, and sparse data can lead to a poor representation of user preferences and weak connections in the bipartite graph.

The last but not the least problem is that PageRank tends to favor nodes with more connections (by its design), which in the context of movies can lead to popular movies being recommended repeatedly, even if they are not relevant to a specific user's preferences.