# Week 8 - Advanced Machine Learning

## Recommendation systems

Recommendation systems attempt to predict the preference of a user for an item. The goal typically being to then present the user with the items they are most likely to prefer. 

### Uses

Recommendation systems have been increasingly used in recent years in online retailing. Perhaps the most well-known example being movie recommendations due to the [Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize). Films, books, music, food and other purchasable items are all commonly the subject of recommendations systems. The same concepts have also been applied to research articles, collaborators, romantic partners, news items, travel routes, and many other areas.

### Types

There are two main types of recommendation systems:

**Collaborative filtering** utilizes the preferences of many users, basing recommendations on what other users with similar preferences to you have liked previously. Collaborative filtering systems do not need to know anything about the users or items, eliminating the need to develop features that accurately capture differences between users or items.

**Content-based filtering** utilizes item descriptions and user profiles or past/current preferences to identify similar items to recommend.

Both approaches have limitations and an increasingly popular approach is **hybrid recommendations systems**. These systems attempt to combine both collaborative and content-based filtering. Content-based filtering can aid collaborative filtering by providing initial recommendations when there is insufficient data on the users preferences, and provide a coarse starting point for the sparse set of labels on user preferences.

Another approach used to supplement user preferences is to use multiple different types of user feedback. In addition to explicit ratings for items more implicit feedback can be used such as viewing duration.

### Evaluation

Performance can be evaluated with metrics we are already familar with such as root mean square error. However, accuracy is not the only factor in determining user satisfaction or the utility of the system. For example, a recommendation system can return several almost identical items and get a very low RMSE but a user will likely prefer a more diverse set of recommendations.

The [Surprise](http://surprise.readthedocs.io/en/latest/index.html) scikit package implements several strategies for creating recommendation systems.

We can install the package by running following in the terminal

`pip install scikit-surprise`

In [3]:
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf


# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
data = Dataset.load_builtin('ml-100k')

In [4]:
data.split(n_folds=3)

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9447
MAE:  0.7453
------------
Fold 2
RMSE: 0.9458
MAE:  0.7471
------------
Fold 3
RMSE: 0.9469
MAE:  0.7474
------------
------------
Mean RMSE: 0.9458
Mean MAE : 0.7466
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
RMSE    0.9447  0.9458  0.9469  0.9458  
MAE     0.7453  0.7471  0.7474  0.7466  


In [5]:
import pandas as pd
from surprise import GridSearch

param_grid = {'n_epochs': [20, 50], 'lr_all': [0.002, 0.005, 0.01],
              'reg_all': [0.01, 0.02, 0.04]}

grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])

grid_search.evaluate(data)

results_df = pd.DataFrame.from_dict(grid_search.cv_results)

[{'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.01}, {'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.02}, {'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.04}, {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.01}, {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}, {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.04}, {'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.01}, {'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.02}, {'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.04}, {'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0.01}, {'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0.02}, {'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0.04}, {'n_epochs': 50, 'lr_all': 0.005, 'reg_all': 0.01}, {'n_epochs': 50, 'lr_all': 0.005, 'reg_all': 0.02}, {'n_epochs': 50, 'lr_all': 0.005, 'reg_all': 0.04}, {'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.01}, {'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.02}, {'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.04}]
------------
Parameters combination 1 of 18
params:  {'n_epochs': 20,

In [6]:
results_df

Unnamed: 0,FCP,RMSE,lr_all,n_epochs,params,scores
0,0.689385,0.9562,0.002,20,"{'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0...","{'RMSE': 0.956199561864, 'FCP': 0.689385127325}"
1,0.691522,0.955674,0.002,20,"{'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0...","{'RMSE': 0.955674251885, 'FCP': 0.69152234133}"
2,0.69281,0.955013,0.002,20,"{'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0...","{'RMSE': 0.955012512668, 'FCP': 0.69280957293}"
3,0.692512,0.952101,0.005,20,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0...","{'RMSE': 0.952101277563, 'FCP': 0.692512032007}"
4,0.697247,0.946653,0.005,20,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0...","{'RMSE': 0.946652912928, 'FCP': 0.697246582457}"
5,0.703818,0.941046,0.005,20,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0...","{'RMSE': 0.941045595886, 'FCP': 0.703818050753}"
6,0.675296,0.985058,0.01,20,"{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.01}","{'RMSE': 0.985058268042, 'FCP': 0.675295843597}"
7,0.6896,0.961507,0.01,20,"{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.02}","{'RMSE': 0.961506702427, 'FCP': 0.689600093792}"
8,0.705742,0.938785,0.01,20,"{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.04}","{'RMSE': 0.938784610728, 'FCP': 0.705741924275}"
9,0.692528,0.952523,0.002,50,"{'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0...","{'RMSE': 0.952523432599, 'FCP': 0.692527663259}"


In [8]:
results_df[['FCP', 'RMSE', 'n_epochs', 'lr_all', 'params']].sort_values('RMSE')

Unnamed: 0,FCP,RMSE,n_epochs,lr_all,params
8,0.705742,0.938785,20,0.01,"{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.04}"
11,0.704059,0.940875,50,0.002,"{'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0..."
5,0.703818,0.941046,20,0.005,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0..."
14,0.702433,0.943777,50,0.005,"{'n_epochs': 50, 'lr_all': 0.005, 'reg_all': 0..."
10,0.697953,0.945962,50,0.002,"{'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0..."
4,0.697247,0.946653,20,0.005,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0..."
17,0.698317,0.951989,50,0.01,"{'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.04}"
3,0.692512,0.952101,20,0.005,"{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0..."
9,0.692528,0.952523,50,0.002,"{'n_epochs': 50, 'lr_all': 0.002, 'reg_all': 0..."
2,0.69281,0.955013,20,0.002,"{'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0..."
