# Recommender with Graphlab Create 

BitTiger DS501

Sept 2017

## 0. Registier, install and launch

* Register account with [Graphlab](https://turi.com/)
* Follow instructions in the email you received to install Graphlab Create
* Launch Graphlab Create

In [None]:
import numpy as np
import graphlab;
import pandas as pd
import matplotlib.pyplot as plt

## 1. Load your data in Dato's SFrame type.

In [None]:
df = pd.read_table('data/u.data',
                   names=["user", "movie", "rating", "timestamp"])
sf = graphlab.SFrame(df[['user', 'movie', 'rating']])

## 2.Create a matrix factorization model.



In [None]:
rec = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False)

## 3. Call the `predict` method on your input data to get the predicted rating for user 1 of movie 100.

In [None]:
one_datapoint_sf = graphlab.SFrame({'user': [1], 'movie': [100]})

In [None]:
one_datapoint_sf

In [None]:
print "rating:", rec.predict(one_datapoint_sf)[0]

## 4. On the returned model object, call the `list_fields` method to see what kind of data is stored for your model.

In [None]:
rec.list_fields()

## 5. Inspect the output of `get('coefficients')` to see what information your model uses.

In [None]:
rec['coefficients'] 

## 6. There should be a `movie` and a `user` array in the coefficients. What are the dimensions of this data?

In [None]:
movie_sf = rec['coefficients']['movie']
print len(movie_sf)
print len(movie_sf['factors'][0])
user_sf = rec['coefficients']['user']
print len(user_sf)
print len(user_sf['factors'][0])

## 7. Without using the `predict` method, compute the predicted rating user 1 of movie 100.

In [None]:
movie_array = movie_sf[movie_sf['movie'] == 100]['factors'][0]
user_array = user_sf[user_sf['user'] == 1]['factors'][0]
intercept = rec['coefficients']['intercept']
print "rating:", np.dot(movie_array, user_array) + intercept    # 4.879

## 8. What is the intercept term? Can you reproduce the calculation of this value on your own?

*The intercept term is the scaling factor. We can compute the value by taking the average of all the ratings in the original dataset.*

In [None]:
print "intercept:", intercept
print "average:", np.average(sf['rating'])

## 9. Call the `predict` method on your input data to get the predicted ratings, and verify that the RMSE reported by the model diagnostics is correct.

In [None]:
sf

In [None]:
from sklearn.metrics import mean_squared_error

predictions = rec.predict(sf)
rmse = np.sqrt(mean_squared_error(sf['rating'], predictions))

print "graphlab's reported rmse:", rec['training_rmse']
print "calculated rmse:", rmse  

## 10. Compare the summary statistics of the original data with your predictions. (`pd.Series(ratings).describe()` to do this). 

Does anything stand out about the min/max?

In [None]:
pd.Series(sf['rating']).describe()

## 11. Regularization - graphlab provides two regularization parameters. 

The parameter `regularization` controls the value of lambda. Using what you know about regularization from linear regression, what effect would you expect this to have on solutions? What would you expect to see in the difference of training RMSE between setting this parameter to 0 or 0.1? Try it.

In [None]:
random_seed = 0
rec2 = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False,
            regularization=0,
            random_seed=random_seed)
print "training rmse with regularization 0:", rec2['training_rmse']   # 0.725

regularization_param = 1e-4
rec3 = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False,
            regularization=regularization_param,
            random_seed=random_seed) 
print "training rmse with regularization %s:"%regularization_param, rec3['training_rmse']

## Extra Point #1. Tune your model to find the best parameters. 

What parameters are being tuned by this procedure?

In [None]:
kfolds = graphlab.cross_validation.KFold(sf, 5)
params = dict(user_id='user', 
              item_id='movie', 
              target='rating',
              solver='als', 
              side_data_factorization=False)
paramsearch = graphlab.model_parameter_search.create(
                    kfolds,
                    graphlab.recommender.factorization_recommender.create,
                    params)

In [None]:
paramsearch.get_status()

#### Best models by different metrics

In [None]:
from pprint import pprint

print "best params by recall@5:"
pprint(paramsearch.get_best_params('mean_validation_recall@5'))
print

print "best params by precision@5:"
pprint(paramsearch.get_best_params('mean_validation_precision@5'))
print

print "best params by rmse:"
pprint(paramsearch.get_best_params('mean_validation_rmse'))

## What are the latent features?

In [None]:
lf_df = df.set_index(['user', 'movie'])[['rating']].unstack().fillna(0)
lf_df

In [None]:
from scipy.spatial.distance import cdist

lf_df = df.set_index(['user', 'movie'])[['rating']].unstack().fillna(0)
user_df = user_sf[['user', 'factors']].sort('user').unpack('factors').to_dataframe()
corr = cdist(lf_df.values.T, user_df.values.T, 'correlation')
corr_df = pd.DataFrame(corr)
corr_df.index = lf_df.columns.get_loc_level('rating')[1]

movies = pd.read_table('data/u.item', sep='|', index_col=0, header=None,
                       names=['movie id', 'movie title', 'release date',
                              'video release date', 'imdb url', 'unknown',
                              'action', 'adventure', 'animation',
                              'children\'s', 'comedy', 'crime',
                              'documentary', 'drama', 'fantasy',
                              'film-noir', 'horror', 'musical', 'mystery',
                              'romance', 'sci-fi', 'thriller', 'war',
                              'western'])
movies_with_corr = pd.concat([movies, corr_df], axis=1)

for i in xrange(1, 9):
    print "TOP MOVIES FOR FACTOR {0}:".format(i)
    top_five_movies = movies_with_corr.sort([i], ascending=False)['movie title'][:5]
    print '    ' + '\n    '.join(top_five_movies)
    print

## Top topics for each latent feature

In [None]:
from collections import Counter

print "TOP TOPICS FOR EACH FACTOR:"
for i in xrange(1, 9):
    scores = Counter()
    for topic in ['action', 'adventure', 'animation', 'children\'s',
                  'comedy', 'crime', 'documentary', 'drama', 'fantasy',
                  'film-noir', 'horror', 'musical', 'mystery', 'romance',
                  'sci-fi', 'thriller', 'war', 'western']:
        scores[topic] = np.dot(movies_with_corr[i], movies_with_corr[topic]) / np.sum(movies_with_corr[topic])
    top_topics = [topic for topic, score in scores.most_common(3)]
    print "    FACTOR {0}:  {1}".format(i, ', '.join(top_topics))