#Collaborative Filtering in Dato
This tutorial explains methods of collaborative filtering for recommender systems using the graphlab create package (from the company Dato). Many of the examples are manipulated versions of the the following basic tutorials:
- https://dato.com/learn/gallery/notebooks/basic_recommender_functionalities.html 
- https://dato.com/learn/gallery/notebooks/five_line_recommender.html

Furthermore, Dato has plenty of iPython notebook examples to look through that do more than just reccomendation systems, including classification, clustering, and graph analytics. 
- https://dato.com/learn/gallery/index.html

## The five line recommendation system (user-item)
This example will build a recommendation system for movie ratings given the following dataset of users and movie ratings. It is explained in detail at https://dato.com/learn/gallery/notebooks/five_line_recommender.html. This example hides much of the functionality and fine tuning possible, but works nicely for starting out with.

The dataset in this example comes from ~330 users that have rated ~7700 movies (a total of ~82,000 ratings).

In [40]:
# This is a well known graphlab example that builds a recommendation system in 5 lines of code

import graphlab as gl

data = gl.SFrame.read_csv("http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv", column_type_hints={"rating":int})
model = gl.recommender.create(data, user_id="user", item_id="movie", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output


PROGRESS: Read 100 lines. Lines per second: 550.264
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.182151 secs.
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv
PROGRESS: Parsing completed. Parsed 82068 lines in 0.141652 secs.
PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 82068 observations with 334 users and 7714 items.
PROGRESS:     Data prepared in: 0.405367s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGR

user,movie,score,rank
Jacob Smith,Mr. Smith Goes to Washington ...,4.95828794039,1
Jacob Smith,The Shield: Season 1,4.78038095033,2
Jacob Smith,When Harry Met Sally,4.57454680002,3
Jacob Smith,Good Morning,4.32713316477,4
Jacob Smith,Three Men and a Baby,4.26816438234,5
Mason Smith,The Hours,5.11884258783,1
Mason Smith,24: Season 2,4.9505128721,2
Mason Smith,Adaptation,4.80303000009,3
Mason Smith,The Fisher King,4.5128459791,4
Mason Smith,Welcome to the Dollhouse,4.47025106943,5


In [41]:
data.head()

user,movie,rating
Jacob Smith,Flirting with Disaster,4
Jacob Smith,Indecent Proposal,3
Jacob Smith,Runaway Bride,2
Jacob Smith,Swiss Family Robinson,1
Jacob Smith,The Mexican,2
Jacob Smith,Maid in Manhattan,4
Jacob Smith,A Charlie Brown Thanksgiving / The ...,3
Jacob Smith,Brazil,1
Jacob Smith,Forrest Gump,3
Jacob Smith,It Happened One Night,4


That's great!! But we really do not know how good these results are, so let's keep moving and we will come back to using cross-validation. 


##The item-item recommendation system

In [42]:
# from graphlab.recommender import item_similarity_recommender

item_item = gl.recommender.item_similarity_recommender.create(data, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=3,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=3)
results.head()

PROGRESS: Recsys training: model = item_similarity
PROGRESS: Preparing data set.
PROGRESS:     Data has 82068 observations with 334 users and 7714 items.
PROGRESS:     Data prepared in: 0.373976s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 7714 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.68271         |
PROGRESS: | 2000            | 1.78203         |
PROGRESS: | 3000            | 1.85044         |
PROGRESS: | 4000            | 1.94641         |
PROGRESS: | 5000            | 1.99907         |
PROGRESS: | 6000            | 2.07812         |
PROGRESS: | 7000            | 2.18435         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.64978s
PROGRESS: Finished prediction in 0.131659s
PROGRESS: Getting similar items completed in 0.032769


movie,similar,score,rank
Flirting with Disaster,Martin Lawrence: You So Crazy ...,0.561863587262,1
Flirting with Disaster,Shadow Magic,0.535303379031,2
Flirting with Disaster,Seinfeld: Season 4,0.507150516208,3
Indecent Proposal,Cocktail,0.568772522656,1
Indecent Proposal,Beverly Hills Cop,0.516246885143,2
Indecent Proposal,Flatliners,0.513955034568,3
Runaway Bride,Notting Hill,0.61341356583,1
Runaway Bride,Sleepless in Seattle,0.609021736748,2
Runaway Bride,Maid in Manhattan,0.608688789629,3
Swiss Family Robinson,Armed and Dangerous,0.483493778415,1


___
So now we can make subjective judgments about the item-item affiliations, but we likely need a more "user-centric" method of getting the precision and recall. So let's now create a holdout set and see if we can judge the precision and recall on a per-user basis:

In [43]:
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="user", item_id="movie",
                                                    max_num_users=100, item_test_proportion=0.2)

In [44]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Preparing data set.
PROGRESS:     Data has 76920 observations with 334 users and 7554 items.
PROGRESS:     Data prepared in: 0.300585s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 7554 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.49467         |
PROGRESS: | 2000            | 1.61319         |
PROGRESS: | 3000            | 1.70744         |
PROGRESS: | 4000            | 1.80876         |
PROGRESS: | 5000            | 1.94854         |
PROGRESS: | 6000            | 2.09927         |
PROGRESS: | 7000            | 2.22566         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.61122s
PROGRESS: Finished prediction in 0.150258s

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------

In [46]:
print rmse_results.viewkeys()
rmse_results['rmse_by_item'].show()

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])


In [47]:
rmse_results['rmse_by_user'].show()

In [49]:
import graphlab.aggregate as agg

# we will be using these aggregations
agg_list = [agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')]

# apply these functions to each group (we will group the results by 'k' which is the cutoff)
# the cutoff is the number of top items to look for see the following URL for the actual equation
# https://dato.com/products/create/docs/generated/graphlab.recommender.util.precision_recall_by_user.html#graphlab.recommender.util.precision_recall_by_user
rmse_results['precision_recall_by_user'].groupby('cutoff',agg_list)

# the groups are not sorted

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.00861111111111,0.0165248596864,0.0083330127044,0.0191417649042
2,0.04,0.135646599663,0.00180530139116,0.00838103556208
46,0.00760869565217,0.0142138009029,0.00872417687967,0.0192290797484
31,0.00967741935484,0.0185308472469,0.00817428254567,0.0190123325833
26,0.0111538461538,0.0205652375158,0.00798560330039,0.0186203624165
8,0.01875,0.0480071609242,0.0033474229103,0.011572768336
5,0.02,0.06,0.00204277170737,0.00851278341827
16,0.015,0.0320156211872,0.00515254398971,0.0132418071712
41,0.00780487804878,0.0149876014593,0.00841175286188,0.0191559046221
4,0.025,0.075,0.00204277170737,0.00851278341827


Wow... these results appear to be not so great. Let's try something a little different and look to see if the results get better. Let's start with collaborative filtering to create the user-item matrix. 

___
## Cross Validated Collaborative Filtering

In [50]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating")

rmse_results = rec1.evaluate(test)

PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 76920 observations with 334 users and 7554 items.
PROGRESS:     Data prepared in: 0.342325s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Line

In [51]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.115555555556,0.0997465925035,0.0955101006921,0.0826102887799
2,0.135,0.263201443765,0.00725364845256,0.0167298746533
46,0.11347826087,0.0970052519382,0.115406140462,0.0863003531057
31,0.120967741935,0.103565469398,0.0869336939133,0.0820137179062
26,0.121923076923,0.108657079598,0.0684396021845,0.0668505402303
8,0.12,0.136839321834,0.0251780443925,0.054698065745
5,0.128,0.173251262622,0.0184579137724,0.052910725483
16,0.11875,0.112326254723,0.0463560825875,0.0598657822605
41,0.113902439024,0.0986557477974,0.105131130548,0.0850026978874
4,0.1175,0.174839211849,0.0157019921138,0.0525806611901


___
Okay, so we are getting better, but might need to tweak the results of the classifier by regularizing...
Remember that we need to come up with a good estimate of the latent factors and we need that matrix to be a good estiamte of the given ratings. We can control some of the parameters using regularization constants and increasing or decreasing the number of latent factors.

In [52]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  num_factors=16,                 # override the default value
                                  regularization=1e-02,           # override the default value
                                  linear_regularization = 1e-3)   # override the default value

rmse_results = rec1.evaluate(test)

PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 76920 observations with 334 users and 7554 items.
PROGRESS:     Data prepared in: 0.34795s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 16       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 0.01     |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linea

# Is this better then the item item matrix?

In [53]:
comparison = gl.recommender.util.compare_models(test, [item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.05      | 0.00140463846829 |
|   2    |      0.035      | 0.00157953247192 |
|   3    | 0.0266666666667 | 0.00173826263065 |
|   4    |      0.025      | 0.00206867564693 |
|   5    |      0.024      | 0.00223793814604 |
|   6    | 0.0216666666667 | 0.00300716891527 |
|   7    | 0.0185714285714 | 0.00300716891527 |
|   8    |     0.02125     | 0.00394896694119 |
|   9    |       0.02      | 0.00413764618647 |
|   10   |       0.02      | 0.00426938930574 |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Finished prediction in 0.020145s

Overall RMSE:  1.23378688949

Per User RMSE (best)
+-------------+-------+---------------+
|     user    | count |      rmse     |
+-------------+-------+---------------+
| Kaden Smi

In [55]:
 comparisonstruct = gl.compare(test,[item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.05      | 0.00140463846829 |
|   2    |      0.035      | 0.00157953247192 |
|   3    | 0.0266666666667 | 0.00173826263065 |
|   4    |      0.025      | 0.00206867564693 |
|   5    |      0.024      | 0.00223793814604 |
|   6    | 0.0216666666667 | 0.00300716891527 |
|   7    | 0.0185714285714 | 0.00300716891527 |
|   8    |     0.02125     | 0.00394896694119 |
|   9    |       0.02      | 0.00413764618647 |
|   10   |       0.02      | 0.00426938930574 |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|  

In [56]:
gl.show_comparison(comparisonstruct,[item_item, rec1])

##Parameters, Parameters
There are so many parameters to search through here. It would be great if there as something we could do to change the parameters automatically and search through the best ones...

In [57]:
params = {'user_id': 'user', 
          'item_id': 'movie', 
          'target': 'rating',
          'num_factors': [8, 16, 32], 
          'regularization':[0.001, 0.01, 0.1] ,
          'linear_regularization': [0.001, 0.01, 0.1]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=27,
        environment=None)

# also note thatthis evaluator also supports sklearn
# https://dato.com/products/create/docs/generated/graphlab.toolkits.model_parameter_search.create.html?highlight=model_parameter_search

[INFO] Validating job.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800000' ready for execution
[INFO] Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800000' scheduled.
[INFO] Validating job.
[INFO] A job with name 'Model-Parameter-Search-Dec-03-2015-21-01-3800000' already exists. Renaming the job to 'Model-Parameter-Search-Dec-03-2015-21-01-3800000-7704e'.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800000-7704e' ready for execution
[INFO] Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800000-7704e' scheduled.
[INFO] Validating job.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800001' ready for execution
[INFO] Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800001' scheduled.
[INFO] Validating job.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800002' ready for execution
[INFO] Job: 'Model-Parameter-Search-Dec-03-2015-21-01-3800002' scheduled.


In [62]:
job.get_status()

{'Canceled': 0, 'Completed': 27, 'Failed': 0, 'Pending': 0, 'Running': 0}

In [63]:
job_result = job.get_results()

job_result.head()

model_id,item_id,linear_regularization,max_iterations,num_factors,num_sampled_negative_exam ples ...,ranking_regularization
9,movie,0.001,50,32,4,0.1
8,movie,0.001,25,16,4,0.5
1,movie,0.01,25,32,8,0.1
0,movie,0.1,50,8,8,0.25
3,movie,0.01,50,8,8,0.1
2,movie,0.1,50,32,4,0.5
5,movie,0.1,50,32,8,0.5
4,movie,0.1,25,8,8,0.25
7,movie,0.1,50,16,8,0.25
6,movie,0.1,50,16,8,0.5

regularization,target,user_id,training_precision@5,training_recall@5,training_rmse,validation_precision@5
0.001,rating,user,0.34371257485,0.00852223684674,0.959666683084,0.104
0.001,rating,user,0.341916167665,0.00840717693956,1.14363014926,0.106
0.1,rating,user,0.34371257485,0.00852223684674,1.02348663246,0.106
0.1,rating,user,0.34371257485,0.00852223684674,1.06988409636,0.108
0.001,rating,user,0.34371257485,0.00852223684674,1.0122506198,0.11
0.1,rating,user,0.380838323353,0.00965870798895,1.07085996458,0.108
0.1,rating,user,0.449700598802,0.0143505274348,1.07151793159,0.13
0.1,rating,user,0.380838323353,0.00965870798895,1.06961665252,0.112
0.001,rating,user,0.34371257485,0.00852223684674,1.06939204287,0.11
0.01,rating,user,0.449700598802,0.0143505274348,1.07130771472,0.134

validation_recall@5,validation_rmse
0.0110219819853,0.967239299379
0.0127751080693,1.13830603918
0.0107128182318,1.02766546406
0.0105432420489,1.05595323353
0.010992908912,1.01729189571
0.0132905134316,1.05817341507
0.0203961570044,1.0588949583
0.0136440507124,1.05607633004
0.0110697731382,1.05615296339
0.0208903430509,1.05888211208


In [64]:
bst_prms = job.get_best_params()
bst_prms

{'item_id': 'movie',
 'linear_regularization': 0.001,
 'max_iterations': 25,
 'num_factors': 8,
 'num_sampled_negative_examples': 4,
 'ranking_regularization': 0.1,
 'regularization': 0.001,
 'target': 'rating',
 'user_id': 'user'}

In [65]:
models = job.get_models()
models

[Class                           : RankingFactorizationRecommender
 
 Schema
 ------
 User ID                         : user
 Item ID                         : movie
 Target                          : rating
 Additional observation features : 0
 Number of user side features    : 0
 Number of item side features    : 0
 
 Statistics
 ----------
 Number of observations          : 76920
 Number of users                 : 334
 Number of items                 : 7554
 
 Training summary
 ----------------
 Training time                   : 28.3346
 
 Model Parameters
 ----------------
 Model class                     : RankingFactorizationRecommender
 num_factors                     : 8
 binary_target                   : 0
 side_data_factorization         : 1
 solver                          : auto
 nmf                             : 0
 max_iterations                  : 50
 
 Regularization Settings
 -----------------------
 regularization                  : 0.1
 regularization_type            

In [66]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.11      | 0.00236911585321 |
|   2    |       0.08      | 0.00296058317008 |
|   3    | 0.0866666666667 | 0.00546313884775 |
|   4    |      0.095      | 0.00797855342139 |
|   5    |      0.108      | 0.0105432420489  |
|   6    |       0.12      | 0.0172256258502  |
|   7    |       0.12      | 0.0221724765359  |
|   8    |     0.12375     | 0.0298003456622  |
|   9    |  0.121111111111 | 0.0311270455731  |
|   10   |      0.115      | 0.0323140842826  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|  

In [69]:
models[13]

Class                           : RankingFactorizationRecommender

Schema
------
User ID                         : user
Item ID                         : movie
Target                          : rating
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 76920
Number of users                 : 334
Number of items                 : 7554

Training summary
----------------
Training time                   : 18.8711

Model Parameters
----------------
Model class                     : RankingFactorizationRecommender
num_factors                     : 8
binary_target                   : 0
side_data_factorization         : 1
solver                          : auto
nmf                             : 0
max_iterations                  : 25

Regularization Settings
-----------------------
regularization                  : 0.001
regularization_type             : normal
linear_regularization  