<a href="https://www.kaggle.com/code/gpreda/content-based-recommender-system-evaluation?scriptVersionId=128328186" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>


# Introduction

This Notebook is evaluating Content Based Recommender System, by comparing it against random recommender. 

The content is as following:

- Ingestion of data, using utility script for reading Movielens data (movie_lens_data).  
- Initialization of evaluator object, from utility script evaluator.   
- Definition of baseline random recommender.  
- Definition of Content Based Recommender System. It uses genres information to calculate the movies (items) similarity.
- Adds the baseline recomemnder, the naive top10 recommender and the Content Based Recommender System to the evaluator.  
- Perform evaluation, comparing the three algoritms.  
- Analyze the results.




# Analysis preparation

In [1]:
import numpy as np
import pandas as pd
import re
import os
import heapq
from surprise import accuracy
from surprise import Dataset
from surprise import Reader
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import LeaveOneOut
from surprise import NormalPredictor, SVD

In [2]:
from recommender_metrics import RecommenderMetrics
from movie_lens_data import MovieLensData
from evaluator import Evaluator

# Read data

We read the Movielens data using MovieLensData class from movie_lens_data utility script.

First, we initialize the object.

In [3]:
path = "/kaggle/input/movielens-100k-dataset/ml-100k"
movie_lens_data = MovieLensData(
    users_path = os.path.join(path, "u.user"),
    ratings_path = os.path.join(path, "u.data"), 
    movies_path = os.path.join(path, "u.item"), 
    genre_path = os.path.join(path, "u.genre") 
    )

Then, we read:
- evaluation data   
- movie data  
- popularity rankings  

In [4]:
evaluation_data = movie_lens_data.read_ratings_data()
movie_data = movie_lens_data.read_movies_data()
popularity_rankings = movie_lens_data.get_popularity_ranks()

# Prepare evaluator


The evaluator will use the evaluation data to test the algorithms.

Evaluator works with two components:  
- EvaluationData   
- EvaluatedAlgorithm  

to define the data that will be used to test the algorithm(s) and the approach for testing the algorithm.  
The metrics to be used in testing are defined in RecommenderMetrics, from recomender_metrics utility script.

In [5]:
evaluator = Evaluator(evaluation_data, popularity_rankings)

Number of full trainset users: 943
Number of full trainset items: 1682
Number of trainset users: 943
Number of trainset items: 1641
Size of testset: 25000
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


# Add random recommender to evaluator

We start by adding to the evaluator a random recommender system. The normal predictor will simply generate random recommendations.

In [6]:
algo_np = NormalPredictor()
evaluator.add_algorithm(algo_np, "Random")

# Prepare content based recommender system


Here we define our Content based recommender system.

We define:
- init  
- fit  
- evaluate 

functions.

In [7]:
from surprise import AlgoBase
from surprise import PredictionImpossible

class ContentBasedRecSysKNN(AlgoBase):

    def __init__(self, k=40, movie_data=None, sim_options={}):
        AlgoBase.__init__(self)
        self.k = k
        self.movie_data = movie_data

    def fit(self, trainset):
        """
        Fit function - calculate genre similarities
        Args
            trainset: the training set
        Returns
            self
        """
        AlgoBase.fit(self, trainset)
        
        self.compute_genre_similarities()
    
        return self
    
    def estimate(self, u, i):
        """
        Estimate function: evaluate the ratings for an user & item
        Process is:
            - gather the ratings for the user u
            - from the similarities (on items) between item i and the items
            from the ratings gathered (neighboring items)
            - order these neighbors 
            - calculate the average simmilarities score weighted by user ratings
        Args
            u: current user
            i: current item
        Returns
            predicted ratings
        """
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')
        
        # Build up similarity scores between this item and everything the user rated
        neighbors = []
        for rating in self.trainset.ur[u]:
            genre_similarity = self.genre_similarities[i,rating[0]]
            neighbors.append( (genre_similarity, rating[1]) )
        
        # Extract the top-K most-similar ratings (k is set at init)
        k_neighbors = heapq.nlargest(self.k, neighbors, key=lambda t: t[0])
        
        # Compute average sim score of K neighbors weighted by user ratings
        sim_total = weighted_sum = 0
        for (sim_score, rating) in k_neighbors:
            if (sim_score > 0):
                sim_total += sim_score
                weighted_sum += sim_score * rating
            
        if (sim_total == 0):
            raise PredictionImpossible('No neighbors')

        predicted_rating = weighted_sum / sim_total

        return predicted_rating
    
    def compute_genre_similarities(self):
        # compute similarities
        genre_columns = self.movie_data.columns[6:-1]
        genres = self.movie_data[genre_columns]
        # calculate cosine similarity using linear_kernel from sklearn
        self.genre_similarities = linear_kernel(genres, genres)
        return self

# Prepare naive top 10 RecSys

In [8]:
class Top10dRecSys(AlgoBase):

    def __init__(self, ratings=None):
        AlgoBase.__init__(self)
        self.ratings = ratings
        
        self.items = []

    def fit(self, trainset):
        """
        Fit function - calculate top 10
        Args
            trainset: the training set
        Returns
            self
        """
        AlgoBase.fit(self, trainset)
        
        self.prepare_rankings()
        return self
    
    def estimate(self, u, i):
        """
        Estimate function: evaluate the ratings for an user & item
        Process is:
            - get the rating for item i (is not dependent of user) from 
            the ranking dataframe
            - from the similarities (on items) between item i and the items

        Args
            u: current user
            i: current item
        Returns
            predicted ratings
        """
        
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')

        # this is the rating for user u regarding item i
        # each item rating is the same, and is from self.rankings data, where
        # items rating is the average rating from all users per that item (if exists)
        predicted_rating = self.rankings.loc[self.rankings["item"]==i]["rating_score"].values
        if predicted_rating:
            return predicted_rating[0]
        
        # if there is no rating for that item by any user, we just return averge rating/all
        return self.average_rating
    
    def prepare_rankings(self):
        # Collector
        self.rankings = pd.DataFrame(self.ratings.groupby('movie_id')['rating'].mean())
        self.rankings['rating_count'] = self.ratings.groupby('movie_id')['rating'].count()
        self.rankings = self.rankings.reset_index()
        self.rankings.columns = ["item", "rating_score", "rating_count"]

        # Ranking
        self.rankings = self.rankings.sort_values(ascending=False,by=['rating_score', 'rating_count'])
        self.average_rating = self.rankings["rating_score"].mean()

    def get_top_n(self, top_n=10):
        # Serving
        return self.rankings.head(top_n)

# Add popularity ranking (naive top10) algo to evaluator

In [9]:
ratings = movie_lens_data.get_ratings()
top_10_rec_sys = Top10dRecSys(ratings)
evaluator.add_algorithm(top_10_rec_sys, "Top10")

# Add content based recommender system to evaluator

In [10]:
algo_content = ContentBasedRecSysKNN(movie_data=movie_data)
evaluator.add_algorithm(algo_content, "ContentKNN")

# Evaluate the algorithms

After we initialized the evaluator, defined the algorithms to be tested and added them to the evaluator, we run now the evaluation.

In [11]:
evaluator.evaluate(do_top_n=False)

Evaluating  Random ...
Evaluating accuracy...
Analysis complete.
Evaluating  Top10 ...
Evaluating accuracy...




Analysis complete.
Evaluating  ContentKNN ...
Evaluating accuracy...
Analysis complete.


Algorithm  RMSE       MAE       
Random     1.5258     1.2250    
Top10      1.2698     1.0215    
ContentKNN 1.0735     0.8536    

Legend:

RMSE:      Root Mean Squared Error. Lower values mean better accuracy.
MAE:       Mean Absolute Error. Lower values mean better accuracy.


# Analysis of results


We can see that the content based recommender system gives better (smaller) results than the Top10 and the random recommender, for both metrics used.