# Evaluating Recommender System

## Recommender System
A recommender system is an intelligent system that predicts the rating and preferences of users on products. The primary application of recommender systems is finding a relationship between user and products in order to maximise the user-product engagement. The major application of recommender systems is in suggesting related video or music for generating a playlist for the user when they are engaged with a related item. In this project, we are using movie data to build a recommender system and evaluate its performance. 

In [1]:
#load libraries and modules
from surprise import Reader, Dataset, SVD, accuracy, KNNBaseline
from surprise.model_selection import train_test_split, KFold, LeaveOneOut
from collections import defaultdict
import pandas as pd
import itertools as it
import csv

## Database
In this project we are evaluating the efficiency of the recommender system based on five different metrics. The dataset is called MovieLens and can be found [here](https://grouplens.org/datasets/movielens/25m/). MovieLens 25M dataset contains 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. In this report we are only using the 'ratings.csv' dataset. The dataset contains four features - user Id, movie Id, movie rating and timestamp. 

In [2]:
#Dataset path
ratingsPath = '/MyRecSystemProject/Data/ratings.csv'
#Load data
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file(ratingsPath, reader)

## Rating Prediction
In this section, we will split the dataset into train set and test set. The test set is 25% of the dataset. We use SVD algorithm for the model. There are other algorithms as well in the surprise module, but the main focus of this project is to understand different evaluation techniques. 

In [3]:
#Preparing train and test sets
train_data, test_data = train_test_split(data, test_size=.25, random_state=1)

#Building recommendation model
algo = SVD(random_state=10)
algo.fit(train_data)

#Computing recommendations
predictions = algo.test(test_data)

## Evaluating our model

### 1. Accuracy

There are two ways to compute the accuracy of a recommender model.

- MAE (Mean Absolute Error) 

$$ \frac{\sum_{i=1}^n |y_i - x_i|}{n}, $$ where $y_i$ is the prediction and $x_i$ is the actual rating.

- RMSE (Root Mean Square Error)

$$ \sqrt{\frac{\sum_{i=1}^n (y_i - x_i)^2}{n}} $$


In [4]:
#Evaluating accuracy of model
rmse = accuracy.rmse(predictions, verbose=False)
mae = accuracy.mae(predictions, verbose=False)
print("RMSE: ", rmse)
print("MAE: ", mae)

RMSE:  0.9033701087151801
MAE:  0.6977882196132263


### Cross Validation

In [5]:
#Creating n Folds
kf = KFold(n_splits=5)

c=0 
rmse_list=[] #empty list for storing rmse of each fold
mae_list=[] #empty list for storing mae of each fold

for traindata, testdata in kf.split(data):
    c+=1
    print("\nFold:", c)
    algo.fit(traindata)
    predictions = algo.test(testdata)
    rmse = accuracy.rmse(predictions, verbose=False)
    mae = accuracy.mae(predictions, verbose=False)
    rmse_list.append(rmse)
    mae_list.append(mae)
    print("RMSE: ", rmse)
    print("MAE: ", mae)
    
print("\nMean RMSE: ", sum(rmse_list) / len(rmse_list))
print("Mean MAE: ", sum(mae_list) / len(mae_list))


Fold: 1
RMSE:  0.8972015162077746
MAE:  0.6914804528784515

Fold: 2
RMSE:  0.9033188926872354
MAE:  0.6941806379758839

Fold: 3
RMSE:  0.8977920573434492
MAE:  0.6906677490570959

Fold: 4
RMSE:  0.8917878167131393
MAE:  0.6862092651080182

Fold: 5
RMSE:  0.8885238587079475
MAE:  0.6851851525669552

Mean RMSE:  0.895724828331909
Mean MAE:  0.689544651517281


### 2. Hit Rate

The idea is to generate top-N recommendations for all of the users in the test set and if one of the recommendations in a user's top-N recommendations is something the user actually rated, consider that as a hit. Then just sum all of the hits in the top-end recommendations for every user in the test set, divide by the number of users, and that gives us the hit rate.

In [6]:
def GetTopN(predictions, n=10, minimumRating=4.0):
    topN = defaultdict(list)

    for userID, movieID, actualRating, estimatedRating, _ in predictions:
        if (estimatedRating >= minimumRating):
            topN[int(userID)].append((int(movieID), estimatedRating))

    for userID, ratings in topN.items():
        ratings.sort(key=lambda x: x[1], reverse=True)
        topN[int(userID)] = ratings[:n]

    return topN

We compute the top-N recommendations for each user in our training data, and intentionally remove one of those items from that user's training data. We then test our recommender system's ability to recommend that item that was left out in the top-N results it creates for that user in the testing phase. So we measure our ability to recommend an item in a top-N list for each user that was left out from the training data. This method is called Leave-One-Out Cross-Validation, or LOOCV. 

In [7]:
loocv = LeaveOneOut(n_splits=1, random_state=1)

for traindata, testdata in loocv.split(data):
    #Train model without left-out ratings
    algo.fit(traindata)
    #Predicts ratings for left-out ratings only
    leftout_predictions = algo.test(testdata)
    #Build predictions for all ratings not in the training set
    big_test_set = traindata.build_anti_testset()  
    all_predictions = algo.test(big_test_set)

    #Compute top 10 recs for each user
    top_N_predicted = GetTopN(all_predictions, n=10)

In [9]:
print("Top-N movies with ratings for each user....\n ")
top_n_recommendations = pd.DataFrame.from_dict(top_N_predicted, orient='index', 
                           columns=['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X'])
                           
top_n_recommendations.index.name = "User_id"
top_n_recommendations.head()

Top-N movies with ratings for each user....
 


Unnamed: 0_level_0,I,II,III,IV,V,VI,VII,VIII,IX,X
User_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2,"(46578, 4.439161446506118)","(2064, 4.41610463184011)","(969, 4.391065310537569)","(1147, 4.390537149207635)","(926, 4.38618890237767)","(6807, 4.359389472173928)","(1945, 4.348071719852454)","(48516, 4.325914625773898)","(1228, 4.306747177463255)","(111, 4.302574693285729)"
3,"(3462, 4.306963213695675)","(905, 4.2795135127296255)","(969, 4.273121970391215)","(69844, 4.25212438541356)","(3683, 4.246527927502107)","(46578, 4.228819648953076)","(86882, 4.194861845324719)","(926, 4.191485407784973)","(50, 4.188700290029376)","(3435, 4.165398859879112)"
4,"(318, 5)","(2318, 5)","(2762, 5)","(1035, 5)","(1193, 5)","(1221, 5)","(1247, 5)","(111, 5)","(1250, 5)","(745, 5)"
5,"(1217, 4.748149172147203)","(969, 4.707653668367146)","(7502, 4.674733786162311)","(922, 4.659033619452552)","(908, 4.6447331321575795)","(6016, 4.640703426996425)","(905, 4.6334480044019335)","(318, 4.6331733296216235)","(904, 4.624272736284148)","(3035, 4.5814425881237)"
6,"(318, 4.318653035294874)","(858, 4.131556078432512)","(898, 4.119206097367941)","(1172, 4.07033775580625)","(1207, 4.0539398583844495)","(2318, 4.031520010772444)","(1228, 4.026020833632793)","(3462, 4.006775577984113)","(926, 4.000805310451805)",


In [10]:
top_n_recommendations.shape

(639, 10)

In [11]:
def HitRate(topNPredicted, leftOutPredictions):
    hits = 0
    total = 0

    #For each left-out rating
    for leftOut in leftOutPredictions:
        userID = leftOut[0]
        leftOutMovieID = leftOut[1]
        #Is it in the predicted top 10 for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == int(movieID)):
                hit = True
                break
        if (hit) :
            hits += 1

        total += 1

    #Compute overall precision
    return hits/total

#### Cumulative Hit Rate 
It means that we discard the hits if our predicted ratings are below some threshold. This way we won't recommend items to a user that we think they won't actually enjoy.

In [12]:
def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
    hits = 0
    total = 0

    #For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        #Only look at ability to recommend things the users actually liked...
        if (actualRating >= ratingCutoff):
            #Is it in the predicted top 10 for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == movieID):
                    hit = True
                    break
            if (hit) :
                hits += 1

            total += 1

    #Compute overall precision
    return hits/total

#### Rating Hit Rate
The break down of hit rate by predicted rating score. It is a good way to get an idea of the distribution of how good the algorithm thinks recommended movies are, that actually get a hit.

In [13]:
def RatingHitRate(topNPredicted, leftOutPredictions):
    hits = defaultdict(float)
    total = defaultdict(float)

    #For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        #Is it in the predicted top N for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == movieID):
                hit = True
                break
        if (hit) :
            hits[actualRating] += 1

        total[actualRating] += 1

    #Compute overall precision
    for rating in sorted(hits.keys()):
        print (rating, hits[rating] / total[rating])

#### Average reciprocal hit rate
This metric accounts for where in the top-N list the hits appear. So we end up getting more successful at recommending a movie in the top slot, than in the bottom slot. This is an important and user-focused metric since users tend to focus on the beginning of lists. The only difference is that instead of adding up the number of hits, we add up the reciprocal rank of each hit.

In [14]:
def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
    summation = 0
    total = 0
    #For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        #Is it in the predicted top N for this user?
        hitRank = 0
        rank = 0
        for movieID, predictedRating in topNPredicted[int(userID)]:
            rank = rank + 1
            if (int(leftOutMovieID) == movieID):
                hitRank = rank
                break
        if (hitRank > 0) :
            summation += 1.0 / hitRank

        total += 1

    return summation / total

In [15]:
#Check how often we recommended a movie the user actually rated
print("\nHit Rate: ", HitRate(top_N_predicted, leftout_predictions))


Hit Rate:  0.029806259314456036


In [16]:
#Break down hit rate by rating value
print("\nrHR (Hit Rate by Rating value): ")
RatingHitRate(top_N_predicted, leftout_predictions)


rHR (Hit Rate by Rating value): 
3.5 0.017241379310344827
4.0 0.0425531914893617
4.5 0.020833333333333332
5.0 0.06802721088435375


In [17]:
#See how often we recommended a movie the user actually liked
print("\ncHR (Cumulative Hit Rate, rating >= 4): ", CumulativeHitRate(top_N_predicted, leftout_predictions, 4.0))


cHR (Cumulative Hit Rate, rating >= 4):  0.04960835509138381


In [18]:
#Compute ARHR
print("\nARHR (Average Reciprocal Hit Rank): ", AverageReciprocalHitRank(top_N_predicted, leftout_predictions))


ARHR (Average Reciprocal Hit Rank):  0.0111560570576964


In [19]:
#Computing item similarities 
full_train_data = data.build_full_trainset()
sim_algo = KNNBaseline(sim_options={'name': 'pearson_baseline', 'user_based': False})
sim_algo.fit(full_train_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x13b910940>

In [20]:
#Computing complete recommendations, no hold outs
algo.fit(full_train_data)
big_test_data = full_train_data.build_anti_testset()
predictions = algo.test(big_test_data)
top_N_predicted = GetTopN(predictions, n=10)

In [22]:
print("Top-N movies with ratings for each user....\n ")
top_n_recommendations = pd.DataFrame.from_dict(top_N_predicted, orient='index', 
                           columns=['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X'])
                           
top_n_recommendations.index.name = "User_id"
top_n_recommendations.shape

Top-N movies with ratings for each user....
 


(641, 10)

### 3. Coverage

Coverage is the percentage of possible recommendations the system is able to provide.

In [23]:
def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
    hits = 0
    for userID in topNPredicted.keys():
        hit = False
        for movieID, predictedRating in topNPredicted[userID]:
            if (predictedRating >= ratingThreshold):
                hit = True
                break
        if (hit):
            hits += 1

    return hits / numUsers

In [24]:
#Print user coverage with a minimum predicted rating of 4.0:
print("\nUser coverage: ", UserCoverage(top_N_predicted, full_train_data.n_users, ratingThreshold=4.0))


User coverage:  0.9552906110283159


### 4. Diversity
Diversity is a measure of how broad a variety of items the recommender system is putting in front of users. 

In [27]:
def Diversity(topNPredicted, simsAlgo):
    n = 0
    total = 0
    simsMatrix = simsAlgo.compute_similarities()
    for userID in topNPredicted.keys():
        
        pairs = it.combinations(topNPredicted[userID], 2)
        for pair in pairs:
            movie1 = pair[0][0]
            movie2 = pair[1][0]
            
            innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
            innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
            
            similarity = simsMatrix[innerID1][innerID2]
            
            total += similarity
            n += 1

    S = total / n
    return (1-S)

In [28]:
#Measure diversity of recommendations:
print("\nDiversity: ", Diversity(top_N_predicted, sim_algo))

Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

Diversity:  0.9665208258150911


### 5. Novelty
Novelty is a measure of how popular the movies are that are being recommended.

In [29]:
def getPopularityRanks():
    ratings = defaultdict(int)
    rankings = defaultdict(int)
    with open(ratingsPath, newline='') as csvfile:
        ratingReader = csv.reader(csvfile)
        next(ratingReader)
        for row in ratingReader:
            movieID = int(row[1])
            ratings[movieID] += 1
    rank = 1
    for movieID, ratingCount in sorted(ratings.items(), key=lambda x: x[1], reverse=True):
        rankings[movieID] = rank
        rank += 1
    return rankings

In [30]:
def Novelty(topNPredicted, rankings):
    n = 0
    total = 0
    for userID in topNPredicted.keys():
        for rating in topNPredicted[userID]:
            movieID = rating[0]
            rank = rankings[movieID]
            total += rank
            n += 1
    return total / n  

In [31]:
#Computing movie popularity ranks so we can measure novelty
rankings = getPopularityRanks()
#Measure novelty (average popularity rank of recommendations):
print("\nNovelty (average popularity rank): ", Novelty(top_N_predicted, rankings))


Novelty (average popularity rank):  491.5767777960256
