#Collaborative Filtering in Dato
This tutorial explains methods of collaborative filtering for recommender systems using the graphlab create package (from the company Dato). Many of the examples are manipulated versions of the the following basic tutorials:
- https://dato.com/learn/gallery/notebooks/basic_recommender_functionalities.html 
- https://dato.com/learn/gallery/notebooks/five_line_recommender.html

Furthermore, Dato has plenty of iPython notebook examples to look through that do more than just reccomendation systems, including classification, clustering, and graph analytics. 
- https://dato.com/learn/gallery/index.html

## The five line recommendation system (user-item)
This example will build a recommendation system for movie ratings given the following dataset of users and movie ratings. It is explained in detail at https://dato.com/learn/gallery/notebooks/five_line_recommender.html. This example hides much of the functionality and fine tuning possible, but works nicely for starting out with.

The dataset in this example comes from ~330 users that have rated ~7700 movies (a total of ~82,000 ratings).

In [68]:
# This is a well known graphlab example that builds a recommendation system in 5 lines of code

import graphlab as gl

data = gl.SFrame.read_csv("http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv", column_type_hints={"rating":int})
model = gl.recommender.create(data, user_id="user", item_id="movie", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output


PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv
PROGRESS: Parsing completed. Parsed 82068 lines in 0.13068 secs.
PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 82068 observations with 334 users and 7714 items.
PROGRESS:     Data prepared in: 0.175847s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: 

user,movie,score,rank
Weston Smith,The Bourne Supremacy,9.50957940615,1
Weston Smith,Tears of the Sun,9.49613879717,2
Weston Smith,Maverick,9.45537732637,3
Weston Smith,Good Will Hunting,9.36947272814,4
Weston Smith,The Terminal,9.34242509401,5
Richard Smith,Grease,9.14557321465,1
Richard Smith,Bowling for Columbine,7.46110065377,2
Richard Smith,Super Size Me,7.34739025033,3
Richard Smith,Training Day,7.15685255921,4
Richard Smith,There's Something About Mary: Special Edition ...,6.90998847878,5


In [69]:
data.head()

user,movie,rating
Jacob Smith,Flirting with Disaster,4
Jacob Smith,Indecent Proposal,3
Jacob Smith,Runaway Bride,2
Jacob Smith,Swiss Family Robinson,1
Jacob Smith,The Mexican,2
Jacob Smith,Maid in Manhattan,4
Jacob Smith,A Charlie Brown Thanksgiving / The ...,3
Jacob Smith,Brazil,1
Jacob Smith,Forrest Gump,3
Jacob Smith,It Happened One Night,4


That's great!! But we really do not know how good these results are, so let's keep moving and we will come back to using cross-validation. 


##The item-item recommendation system

In [71]:
# from graphlab.recommender import item_similarity_recommender

item_item = gl.recommender.item_similarity_recommender.create(data, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=3,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=3)
results.head()

PROGRESS: Recsys training: model = item_similarity
PROGRESS: Preparing data set.
PROGRESS:     Data has 82068 observations with 334 users and 7714 items.
PROGRESS:     Data prepared in: 0.196451s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 7714 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 0.613749        |
PROGRESS: | 2000            | 0.646487        |
PROGRESS: | 3000            | 0.679481        |
PROGRESS: | 4000            | 0.705158        |
PROGRESS: | 5000            | 0.745915        |
PROGRESS: | 6000            | 0.785288        |
PROGRESS: | 7000            | 0.828979        |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 1.02818s
PROGRESS: Finished prediction in 0.081459s
PROGRESS: Getting similar items completed in 0.026761


movie,similar,score,rank
The Recruit,The Bourne Identity,0.540380563409,1
The Recruit,The Sum of All Fears,0.526702083345,2
The Recruit,Ocean's Eleven,0.522604756892,3
What a Girl Wants,Uptown Girls,0.482187108255,1
What a Girl Wants,Freaky Friday,0.446730732337,2
What a Girl Wants,Maid in Manhattan,0.44203776479,3
The Stepford Wives,Shrek 2,0.559092313958,1
The Stepford Wives,Ocean's Eleven,0.55390536512,2
The Stepford Wives,50 First Dates,0.545006166953,3
Tomb Raider,XXX: Special Edition,0.558090690752,1


___
So now we can make subjective judgments about the item-item affiliations, but we likely need a more "user-centric" method of getting the precision and recall. So let's now create a holdout set and see if we can judge the precision and recall on a per-user basis:

In [72]:
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="user", item_id="movie",
                                                    max_num_users=100, item_test_proportion=0.2)

In [73]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Preparing data set.
PROGRESS:     Data has 76968 observations with 334 users and 7438 items.
PROGRESS:     Data prepared in: 0.161531s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 7438 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 0.54796         |
PROGRESS: | 2000            | 0.584768        |
PROGRESS: | 3000            | 0.613789        |
PROGRESS: | 4000            | 0.640471        |
PROGRESS: | 5000            | 0.674442        |
PROGRESS: | 6000            | 0.713059        |
PROGRESS: | 7000            | 0.766604        |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 0.971355s
PROGRESS: Finished prediction in 0.122341s

Precision and recall summary statistics by cutoff
+--------+----------------+-----------

In [74]:
print rmse_results.viewkeys()
rmse_results['rmse_by_item'].show()

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])


<IPython.core.display.Javascript object>

In [75]:
rmse_results['rmse_by_user'].show()

<IPython.core.display.Javascript object>

In [76]:
import graphlab.aggregate as agg
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
15,0.02,0.0382970843103,0.008049075855,0.0169209269991
5,0.03,0.0768114574787,0.00312077027263,0.0102042761167
10,0.026,0.0558927544499,0.00635326280729,0.0147637237638


Wow... these results appear to be not so great. Let's try something a little different and look to see if the results get better. Let's start with collaborative filtering to create the user-item matrix. 

___
## Cross Validated Collaborative Filtering

In [77]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating")

rmse_results = rec1.evaluate(test)

PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 76968 observations with 334 users and 7438 items.
PROGRESS:     Data prepared in: 0.160514s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Line

In [82]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
15,0.116,0.120598507453,0.0370674239186,0.0443078892704
5,0.126,0.17811232411,0.014889168283,0.0231724346618
10,0.119,0.144703144403,0.0245949402795,0.0352278956535


___
Okay, so we are getting better, but might need to tweak the results of the classifier by regularizing...
Remember that we need to come up with a good estimate of the latent factors and we need that matrix to be a good estiamte of the given ratings. We can control some of the parameters using regularization constants and increasing or decreasing the number of latent factors.

In [81]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  num_factors=16, 
                                  regularization=1e-02,
                                  linear_regularization = 1e-3)

rmse_results = rec1.evaluate(test)

PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 76968 observations with 334 users and 7438 items.
PROGRESS:     Data prepared in: 0.157766s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 16       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 0.01     |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Line

##Parameters, Parameters
There are so many parameters to search through here. It would be great if there as something we could do to change the parameters automatically and search through the best ones...

In [83]:

job = gl.model_parameter_search(gl.recommender.ranking_factorization_recommender.create,
                             training_set=train, 
                             validation_set=test,
                             user_id="user", 
                             item_id="movie", 
                             target="rating",
                             num_factors=[8, 16, 32],
                             regularization=[0.001, 0.01, 0.1],
                             linear_regularization = [0.001, 0.01, 0.1])

[INFO] Validating job.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Apr-21-2015-15-10-59' ready for execution
[INFO] Job: 'Model-Parameter-Search-Apr-21-2015-15-10-59' scheduled.


In [90]:
job.get_status()
job_result = job.get_results()

In [91]:
models = job_result['models']
summary_sframe = job_result['summary']
summary_sframe 

model_id,linear_regularization,num_factors,regularization,training_precision@5,training_recall@5
0,0.01,16,0.01,0.352095808383,0.00873531821555
1,0.001,16,0.01,0.341317365269,0.00853496742723
2,0.001,32,0.01,0.341317365269,0.00853496742723
3,0.001,32,0.1,0.341317365269,0.00853496742723
4,0.1,32,0.001,0.352095808383,0.00873531821555
5,0.01,32,0.1,0.352095808383,0.00873531821555
6,0.1,16,0.1,0.352095808383,0.00873531821555
7,0.001,16,0.1,0.352095808383,0.00873531821555
8,0.1,16,0.001,0.352095808383,0.00873531821555
9,0.01,16,0.1,0.352095808383,0.00873531821555

training_rmse,validation_precision@5,validation_recall@5,validation_rmse
1.03607226313,0.126,0.0145494475462,1.01919849161
1.0290447655,0.124,0.0147939301878,1.01582478013
1.0291817364,0.124,0.0147939301878,1.01592078246
1.04173655542,0.126,0.0147639900299,1.02332404246
1.06912211793,0.132,0.0154109952375,1.08439684786
1.04138758258,0.126,0.0146842514678,1.02426594328
1.06965520034,0.13,0.0153129560218,1.08636715833
1.04181420261,0.128,0.0148925848011,1.02306595698
1.06974260063,0.126,0.0146842514678,1.08692593382
1.0417441106,0.126,0.0147945455854,1.02449380671
