# Recommendation Engines - MovieLens Data

## Tuesday June 20 2017

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

### Tasks

1. Load the data into the recommendation format
2. Build and assess model accuracy
3. Make individual recommendations
4. Try multiple models and compare accuracy
5. Consider how a company could use this

In [1]:
# Install Surpise - a useful library for recommendation engines
!pip install scikit-surprise



In [2]:
# Load Surprise
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader

In [3]:
# 1. Load the data into the recommendation format

# As we're loading a custom dataset, we need to define a reader. In the
# movielens dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path = '../../data/u.data', reader=reader)
data.split(n_folds=5)

In [4]:
# 2. Build and assess model accuracy

# We'll use the famous SVD algorithm.
algo = SVD()


#Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE']) #Mean Absolute Error

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9377
MAE:  0.7414
------------
Fold 2
RMSE: 0.9395
MAE:  0.7431
------------
Fold 3
RMSE: 0.9287
MAE:  0.7312
------------
Fold 4
RMSE: 0.9301
MAE:  0.7330
------------
Fold 5
RMSE: 0.9397
MAE:  0.7382
------------
------------
Mean RMSE: 0.9352
Mean MAE : 0.7374
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9377  0.9395  0.9287  0.9301  0.9397  0.9352  
MAE     0.7414  0.7431  0.7312  0.7330  0.7382  0.7374  


In [5]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=2, verbose=True) 

user: 196        item: 302        r_ui = 2.00   est = 4.26   {'was_impossible': False}


In [12]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:

#random_pred.NormalPredictor() #Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.

# we want the lowest number for RMSE and MAE

# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NMF

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NMF = NMF()

# Evaluate performances of our algorithm on the dataset.
perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])

print_perf(perf.NMF)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9665
MAE:  0.7623
------------
Fold 2
RMSE: 0.9711
MAE:  0.7642
------------
Fold 3
RMSE: 0.9608
MAE:  0.7544
------------
Fold 4
RMSE: 0.9582
MAE:  0.7523
------------
Fold 5
RMSE: 0.9649
MAE:  0.7562
------------
------------
Mean RMSE: 0.9643
Mean MAE : 0.7579
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9665  0.9711  0.9608  0.9582  0.9649  0.9643  
MAE     0.7623  0.7642  0.7544  0.7523  0.7562  0.7579  


In [6]:
# running the normal predictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
from surprise import NormalPredictor

# Now we will try Normal Predictor
algo.NormalPredictor = NormalPredictor()

# Evaluate performances of our algorithm on the dataset.
perf.NormalPredictor = evaluate(algo.NormalPredictor, data, measures=['RMSE', 'MAE'])

print_perf(perf.NormalPredictor)


Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5260
MAE:  1.2250
------------
Fold 2
RMSE: 1.5227
MAE:  1.2202
------------
Fold 3
RMSE: 1.5256
MAE:  1.2214
------------
Fold 4
RMSE: 1.5203
MAE:  1.2206
------------
Fold 5
RMSE: 1.5344
MAE:  1.2304
------------
------------
Mean RMSE: 1.5258
Mean MAE : 1.2235
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    1.5260  1.5227  1.5256  1.5203  1.5344  1.5258  
MAE     1.2250  1.2202  1.2214  1.2206  1.2304  1.2235  


In [13]:
#Baseline Model - Algorithm predicting the baseline estimate for given user and item.
from surprise import BaselineOnly

algo.Baseline = BaselineOnly()

perf.BaselineOnly = evaluate(algo.Baseline, data, measures=['RMSE', 'Mae'])

perf.NormalPredictor = evaluate(algo.NormalPredictor, data, measures=['RMSE', 'MAE'])

print_perf(perf.BaselineOnly)


  

Evaluating RMSE, MAE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.9463
MAE:  0.7516
------------
Fold 2
Estimating biases using als...
RMSE: 0.9476
MAE:  0.7523
------------
Fold 3
Estimating biases using als...
RMSE: 0.9403
MAE:  0.7452
------------
Fold 4
Estimating biases using als...
RMSE: 0.9409
MAE:  0.7442
------------
Fold 5
Estimating biases using als...
RMSE: 0.9467
MAE:  0.7495
------------
------------
Mean RMSE: 0.9444
Mean MAE : 0.7485
------------
------------
Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5267
MAE:  1.2273
------------
Fold 2
RMSE: 1.5239
MAE:  1.2232
------------
Fold 3
RMSE: 1.5137
MAE:  1.2142
------------
Fold 4
RMSE: 1.5199
MAE:  1.2210
------------
Fold 5
RMSE: 1.5151
MAE:  1.2141
------------
------------
Mean RMSE: 1.5199
Mean MAE : 1.2200
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9463  0.9476  0.9403  0.9409  0.9467  0.9

In [32]:
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

from surprise import SVDpp

algo3 = SVDpp()

perf.algo3 = evaluate(algo3, data, measures=['RMSE', 'MAE'])

print_perf(perf.algo3)


Evaluating RMSE, MAE of algorithm SVDpp.

------------
Fold 1
RMSE: 0.9201
MAE:  0.7221
------------
Fold 2
RMSE: 0.9210
MAE:  0.7232
------------
Fold 3


KeyboardInterrupt: 

In [35]:
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.

from surprise import SlopeOne

algo.Slope_one = SlopeOne()

perf.algo.Slope_one = evaluate(algo.Slope_one, data, measures=['RMSE', 'MAE'])

print_perf(perf.algo.Slope_one)

Evaluating RMSE, MAE of algorithm SlopeOne.

------------
Fold 1
RMSE: 0.9423
MAE:  0.7430
------------
Fold 2
RMSE: 0.9506
MAE:  0.7490
------------
Fold 3
RMSE: 0.9437
MAE:  0.7413
------------
Fold 4
RMSE: 0.9421
MAE:  0.7393
------------
Fold 5
RMSE: 0.9474
MAE:  0.7419
------------
------------
Mean RMSE: 0.9452
Mean MAE : 0.7429
------------
------------


AttributeError: 'CaseInsensitiveDefaultDict' object has no attribute 'algo'

##### 5. Consider how a company could use this

How might a company use a recommendation like this in practice? Write a few paragraphs covering how they could use the above covering:
- How the algorithm works?

 Each algorithm works in a different way to come up with a predicted rating. For example, the baseline model will assign a rating at random based on a normal distribution of those already existing.

 SVD++ is similar to SVD but also takes into account implicit ratings. ie did the customer watch the show until the end or did they switch after a short amount of time. It takes computationally a long time (31 min)
 
- What data would be used?
- How would we know if it's working?
- What is the benefit of using an algorithm over this over just recommending the most popular films overall?