# Week 1 - Evaluating Recommender Systems 

## Surprise 

[Surprise](http://surpriselib.com/) is a Python package that is desiged to aid the testing of Recommender Systems. We will use this, along with some extra packages developed from the Frank Kane Book to explore some popular datasets and use about the different evaluation metrics we saw in the lecture.



In [None]:
#Install 
!pip install scikit-surprise

In [None]:
#Code examples are developed from materials for 
#Building Recommender Systems Book by Frank Kane https://sundog-education.com/recsys/
from MovieLens import MovieLens
from surprise import SVD
from surprise import KNNBaseline
from surprise.model_selection import train_test_split
from surprise.model_selection import LeaveOneOut
from RecommenderMetrics import RecommenderMetrics

## MovieLens

The [MovieLens dataset](https://grouplens.org/datasets/movielens/) is a collection of film ratings from the [MovieLens](https://movielens.org/) site. 

It contains information on ratings, genres and tags and comes in various sizes, up to __25 million__. 



### Explore the .csv files

Use __Pandas__ to investigate the 4 datasets in the folder ``ml-latest-small`` and take a preview of the data in each __.csv__ file. 

Think about:


1. What data is contained?

2. How much variation is there? 

3. How would this data be useful for recommending new films to users?


In [None]:
import pandas as pd

In [None]:
#Explore the files 
...

### Wrapper 

We have a smaller version thats easier and quicker for testing with. We'll use this ``MovieLens`` wrapper class to load the data and pass it to the Recommenders we build, and the for use with the evaluation metrics we get from ``Surprise``.

In [None]:
ml = MovieLens()
data = ml.loadMovieLensLatestSmall()

### Running some real tests!

For the questions below, run the code and the see if you can understand what is being calculated and what the results mean in terms of whether the Recommender System is working well. 

If you want to look at the code the the __Recommender Metrics__, you can look in the ``RecommenderMetrics.py`` file. You might not understand it all but its good practise to start exploring other peoples libraries!


### Accuracy 

The code below uses an ``SVD`` model (more on that later) to build a Recommender system and uses ``Surprise`` to evaluate the accuracy of the trained model. 

Run the code and see if you can answer these questions


1. What does the RMSE score mean?

2. What does the MAE score mean?

3. Why do you think they are different? What does this tell us?



In [None]:
print("\nComputing movie popularity ranks so we can measure novelty later...")
rankings = ml.getPopularityRanks()

print("\nComputing item similarities so we can measure diversity later...")
fullTrainSet = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
simsAlgo = KNNBaseline(sim_options=sim_options)
simsAlgo.fit(fullTrainSet)

print("\nBuilding recommendation model...")
trainSet, testSet = train_test_split(data, test_size=.25, random_state=1)

algo = SVD(random_state=10)
algo.fit(trainSet)

print("\nComputing recommendations...")
predictions = algo.test(testSet)

print("\nEvaluating accuracy of model...")
print("RMSE: ", RecommenderMetrics.RMSE(predictions))
print("MAE: ", RecommenderMetrics.MAE(predictions))

### Hit Rate

Below, we use ``Surprise`` to calculate the __Hit Rate__ for our model. 

Can you describe what each step does and how to interpret the results at the end? You might want to consider

1. What does the hit rate evaluate?

2. What is in the trainSet and testSet?


In [None]:

# Set aside one rating per user for testing
LOOCV = LeaveOneOut(n_splits=1, random_state=1)

for trainSet, testSet in LOOCV.split(data):
    print("Computing recommendations with leave-one-out...")
    
    # Train model without left-out ratings
    algo.fit(trainSet)

    # Predicts ratings for left-out ratings only
    print("Predict ratings for left-out set...")
    leftOutPredictions = algo.test(testSet)

    # Build predictions for all ratings not in the training set
    print("Predict all missing ratings...")
    bigTestSet = trainSet.build_anti_testset()
    allPredictions = algo.test(bigTestSet)

    # Compute top 10 recs for each user
    print("Compute top 10 recs per user...")
    topNPredicted = RecommenderMetrics.GetTopN(allPredictions, n=10)

    # See how often we recommended a movie the user actually rated
    print("\nHit Rate: ", RecommenderMetrics.HitRate(topNPredicted, leftOutPredictions))


### Diversity and Novelty

Finally, lets look at some metrics beyond performance. Can you answer these questions? 

1. What does the diversity score tell us? Is this particular value a good or a bad thing?

2. What does the novelty score tell us? Again, is this a good or a bad thing for our recommender?

In [None]:
print("\nComputing complete recommendations, no hold outs...")
algo.fit(fullTrainSet)
bigTestSet = fullTrainSet.build_anti_testset()
allPredictions = algo.test(bigTestSet)
topNPredicted = RecommenderMetrics.GetTopN(allPredictions, n=10)

# Measure diversity of recommendations:
print("\nDiversity: ", RecommenderMetrics.Diversity(topNPredicted, simsAlgo))

# Measure novelty (average popularity rank of recommendations):
print("\nNovelty (average popularity rank): ", RecommenderMetrics.Novelty(topNPredicted, rankings))
