#Collaborative Filtering in Dato
This tutorial explains methods of collaborative filtering for recommender systems using the graphlab create package (from the company Dato). Many of the examples are manipulated versions of the the following basic tutorials:
- https://dato.com/learn/gallery/notebooks/basic_recommender_functionalities.html 
- https://dato.com/learn/gallery/notebooks/five_line_recommender.html

Furthermore, Dato has plenty of iPython notebook examples to look through that do more than just reccomendation systems, including classification, clustering, and graph analytics. 
- https://dato.com/learn/gallery/index.html

## The five line recommendation system (user-item)
This example will build a recommendation system for movie ratings given the following dataset of users and movie ratings. It is explained in detail at https://dato.com/learn/gallery/notebooks/five_line_recommender.html. This example hides much of the functionality and fine tuning possible, but works nicely for starting out with.

The dataset in this example comes from ~330 users that have rated ~7700 movies (a total of ~82,000 ratings).

In [1]:
# This is a well known graphlab example that builds a recommendation system in 5 lines of code

import graphlab as gl

data = gl.SFrame.read_csv("http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv", 
                          column_type_hints={"rating":int})
model = gl.recommender.create(data, user_id="user", item_id="movie", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output


2016-04-19 20:27:38,784 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1461122855.log


This non-commercial license of GraphLab Create is assigned to eclarson@smu.edu and will expire on November 20, 2016. For commercial licensing options, visit https://dato.com/buy/.


user,movie,score,rank
Jacob Smith,Citizen Kane,5.27867387331,1
Jacob Smith,Sex and the City: Season 1 ...,4.85848735369,2
Jacob Smith,Moonstruck,4.81308291948,3
Jacob Smith,The Jerk,4.72002028024,4
Jacob Smith,Six Feet Under: Season 2,4.70141337908,5
Mason Smith,Welcome to the Dollhouse,6.11764071977,1
Mason Smith,Best in Show,5.45646928346,2
Mason Smith,Six Feet Under: Season 1,5.41805718935,3
Mason Smith,Napoleon Dynamite,5.25122045076,4
Mason Smith,Election,5.01047347582,5


In [2]:
data.head()

user,movie,rating
Jacob Smith,Flirting with Disaster,4
Jacob Smith,Indecent Proposal,3
Jacob Smith,Runaway Bride,2
Jacob Smith,Swiss Family Robinson,1
Jacob Smith,The Mexican,2
Jacob Smith,Maid in Manhattan,4
Jacob Smith,A Charlie Brown Thanksgiving / The ...,3
Jacob Smith,Brazil,1
Jacob Smith,Forrest Gump,3
Jacob Smith,It Happened One Night,4


That's great!! But we really do not know how good these results are, so let's keep moving and we will come back to using cross-validation. 


##The item-item recommendation system

In [3]:
# from graphlab.recommender import item_similarity_recommender

item_item = gl.recommender.item_similarity_recommender.create(data, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=3,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=3)
results.head()

movie,similar,score,rank
Flirting with Disaster,Martin Lawrence: You So Crazy ...,0.561863587262,1
Flirting with Disaster,Shadow Magic,0.535303379031,2
Flirting with Disaster,Seinfeld: Season 4,0.507150516208,3
Indecent Proposal,Cocktail,0.568772522656,1
Indecent Proposal,Beverly Hills Cop,0.516246885143,2
Indecent Proposal,Flatliners,0.513955034568,3
Runaway Bride,Notting Hill,0.61341356583,1
Runaway Bride,Sleepless in Seattle,0.609021736748,2
Runaway Bride,Maid in Manhattan,0.608688789629,3
Swiss Family Robinson,Armed and Dangerous,0.483493778415,1


___
So now we can make subjective judgments about the item-item affiliations, but we likely need a more "user-centric" method of getting the precision and recall. So let's now create a holdout set and see if we can judge the precision and recall on a per-user basis:

In [4]:
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="user", item_id="movie",
                                                    max_num_users=100, item_test_proportion=0.2)

In [5]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)



Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.07      | 0.00167502523854 |
|   2    |       0.06      | 0.00339865660984 |
|   3    | 0.0433333333333 | 0.00346532327651 |
|   4    |      0.035      | 0.00409032327651 |
|   5    |      0.028      | 0.00409032327651 |
|   6    | 0.0283333333333 | 0.00495249922797 |
|   7    | 0.0257142857143 | 0.00508407817534 |
|   8    |      0.025      | 0.00548592739298 |
|   9    | 0.0222222222222 | 0.00548592739298 |
|   10   |       0.02      | 0.00548592739298 |
+--------+-----------------+------------------+
[10 rows x 3 columns]



('\nOverall RMSE: ', 1.2730448475220915)

Per User RMSE (best)
+------------+-------+----------------+
|    user    | count |      rmse      |
+------------+-------+----------------+
| Jose Smith |   19  | 0.592108011181 |
+------------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------------+-------+---------------+
|      user     | count |      rmse     |
+---------------+-------+---------------+
| Grayson Smith |   39  | 1.95480523054 |
+---------------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+--------------------------+-------+------+
|          movie           | count | rmse |
+--------------------------+-------+------+
| The Purple Rose of Cairo |   1   | 0.0  |
+--------------------------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------------------------+-------+------+
|           movie           | count | rmse |
+---------------------------+-------+------+
| Hello Kitty Saves the Day |   1   |

In [6]:
print rmse_results.viewkeys()
print rmse_results['rmse_by_item']

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])
+---------------------+-------+----------------+
|        movie        | count |      rmse      |
+---------------------+-------+----------------+
|   Steel Magnolias   |   5   | 0.931786461299 |
|    Donnie Brasco    |   2   | 1.32198224309  |
|       Eurotrip      |   3   | 0.549175402409 |
| Cast a Giant Shadow |   1   |      2.0       |
|    Reindeer Games   |   1   |      0.25      |
|   Tequila Sunrise   |   1   |      0.25      |
|     Brian's Song    |   2   |      1.0       |
|      Scooby-Doo     |   1   |      0.0       |
|      Idle Hands     |   1   |      0.5       |
|     Nurse Betty     |   1   |      2.0       |
+---------------------+-------+----------------+
[2399 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [7]:
rmse_results['rmse_by_user']

user,count,rmse
Tucker Smith,12,1.05807333148
Patrick Smith,96,1.47448487699
Robert Smith,58,1.11030633547
Donovan Smith,37,0.91441088558
Alan Smith,45,1.0359922563
Angelo Smith,63,1.22826784068
Jeremy Smith,10,1.74003199395
Oliver Smith,14,1.63560861794
Seth Smith,43,1.47705005756
Nicholas Smith,60,1.01889507116


In [8]:
rmse_results['precision_recall_by_user']

user,cutoff,precision,recall,count
Aaron Smith,1,0.0,0.0,126
Aaron Smith,2,0.0,0.0,126
Aaron Smith,3,0.0,0.0,126
Aaron Smith,4,0.0,0.0,126
Aaron Smith,5,0.0,0.0,126
Aaron Smith,6,0.0,0.0,126
Aaron Smith,7,0.0,0.0,126
Aaron Smith,8,0.0,0.0,126
Aaron Smith,9,0.0,0.0,126
Aaron Smith,10,0.0,0.0,126


In [9]:
import graphlab.aggregate as agg

# we will be using these aggregations
agg_list = [agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')]

# apply these functions to each group (we will group the results by 'k' which is the cutoff)
# the cutoff is the number of top items to look for see the following URL for the actual equation
# https://dato.com/products/create/docs/generated/graphlab.recommender.util.precision_recall_by_user.html#graphlab.recommender.util.precision_recall_by_user
rmse_results['precision_recall_by_user'].groupby('cutoff',agg_list)

# the groups are not sorted

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.0108333333333,0.0199903526115,0.00795582580529,0.018400391833
2,0.06,0.162480768093,0.00339865660984,0.0117618101546
46,0.0095652173913,0.0169286281165,0.00877324720762,0.0188483196983
31,0.0109677419355,0.0215043010484,0.0073990079169,0.0182554982479
26,0.0119230769231,0.0247442540099,0.00696086091897,0.0179926973068
8,0.025,0.0586301969978,0.00548592739298,0.017160308033
5,0.028,0.0749399759808,0.00409032327651,0.0131740705058
16,0.015625,0.036843206633,0.00591304592739,0.0174174684844
41,0.01,0.0182829242263,0.00819302257079,0.0185967212991
4,0.035,0.093674969976,0.00409032327651,0.0131740705058


Wow... these results appear to be not so great. Let's try something a little different and look to see if the results get better. Let's start with collaborative filtering to create the user-item matrix. 

___
## Cross Validated Collaborative Filtering

In [10]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating")

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.1       | 0.00164457567063 |
|   2    |      0.1       | 0.00324990192487 |
|   3    |      0.11      | 0.00524744901874 |
|   4    |     0.1075     | 0.00722252830053 |
|   5    |     0.114      | 0.0101597468743  |
|   6    | 0.111666666667 | 0.0124705998099  |
|   7    |      0.11      | 0.0147964372292  |
|   8    |    0.10875     | 0.0177830369973  |
|   9    |      0.11      |  0.021903039232  |
|   10   |      0.11      | 0.0239068227756  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.6327680464610126)

Per User RMSE (best)
+--------------+-------+---------------+
|     user     | count |      rmse     |
+--------------+-------+---------------+
| Kayden Smith |   4   | 0.53980226414 |
+--------------+-------+---------------+


In [11]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.103055555556,0.0858234442344,0.078448511722,0.0566539178444
2,0.1,0.22360679775,0.00324990192487,0.00842681757058
46,0.100217391304,0.0793936734729,0.100202361362,0.0690719043425
31,0.103548387097,0.0859899807827,0.0689621680753,0.0509598713691
26,0.101538461538,0.0884648828799,0.0544845102407,0.0434752150222
8,0.10875,0.124568003516,0.0177830369973,0.0226663330593
5,0.114,0.167940465642,0.0101597468743,0.0164956996785
16,0.10625,0.104395581803,0.0362750500144,0.0363623315924
41,0.101951219512,0.0821614608646,0.0907498556074,0.0632092369036
4,0.1075,0.174122801494,0.00722252830053,0.013473467884


___
Okay, so we are getting better, but might need to tweak the results of the classifier by regularizing...
Remember that we need to come up with a good estimate of the latent factors and we need that matrix to be a good estiamte of the given ratings. We can control some of the parameters using regularization constants and increasing or decreasing the number of latent factors.

In [12]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  num_factors=16,                 # override the default value
                                  regularization=1e-02,           # override the default value
                                  linear_regularization = 1e-3)   # override the default value

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.14      | 0.00218019543867 |
|   2    |      0.15      | 0.0046812193381  |
|   3    | 0.126666666667 | 0.00564054321138 |
|   4    |      0.13      | 0.00882771199905 |
|   5    |      0.13      | 0.0116111166767  |
|   6    |      0.13      | 0.0137949815879  |
|   7    | 0.121428571429 | 0.0145997317546  |
|   8    |     0.115      | 0.0156712188371  |
|   9    | 0.117777777778 | 0.0180958384498  |
|   10   |     0.117      | 0.0197284137623  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.0096853787623923)

Per User RMSE (best)
+--------------+-------+----------------+
|     user     | count |      rmse      |
+--------------+-------+----------------+
| Cooper Smith |   52  | 0.588395768232 |
+--------------+-------+-------------

# Is this better then the item item matrix?

In [13]:
comparison = gl.recommender.util.compare_models(test, [item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff


+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.05      | 0.00129847962596 |
|   2    |      0.045      | 0.00263372821883 |
|   3    |       0.04      | 0.00323276513697 |
|   4    |      0.035      | 0.00345952250659 |
|   5    |      0.034      | 0.00445221578126 |
|   6    | 0.0316666666667 | 0.00517925659758 |
|   7    |       0.03      | 0.0054646816988  |
|   8    |      0.0275     | 0.00559626064617 |
|   9    | 0.0244444444444 | 0.00559626064617 |
|   10   |      0.023      | 0.00575755096875 |
+--------+-----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.2730448475220915)

Per User RMSE (best)
+------------+-------+----------------+
|    user    | count |      rmse      |
+------------+-------+----------------+
| Jose Smith |   19  | 0.592108011181 |
+------------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (wor

In [14]:
 comparisonstruct = gl.compare(test,[item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |       0.05      | 0.00129847962596 |
|   2    |      0.045      | 0.00263372821883 |
|   3    |       0.04      | 0.00323276513697 |
|   4    |      0.035      | 0.00345952250659 |
|   5    |      0.034      | 0.00445221578126 |
|   6    | 0.0316666666667 | 0.00517925659758 |
|   7    |       0.03      | 0.0054646816988  |
|   8    |      0.0275     | 0.00559626064617 |
|   9    | 0.0244444444444 | 0.00559626064617 |
|   10   |      0.023      | 0.00575755096875 |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1 

In [15]:
gl.show_comparison(comparisonstruct,[item_item, rec1])

## Parameters, Parameters
There are so many parameters to search through here. It would be great if there as something we could do to change the parameters automatically and search through the best ones...

In [16]:
params = {'user_id': 'user', 
          'item_id': 'movie', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)

# also note thatthis evaluator also supports sklearn
# https://dato.com/products/create/docs/generated/graphlab.toolkits.model_parameter_search.create.html?highlight=model_parameter_search

2016-04-19 20:28:54,925 [INFO] graphlab.deploy.job, 22: Validating job.
2016-04-19 20:28:54,955 [INFO] graphlab.deploy.job, 36: Creating a LocalAsync environment called 'async'.
2016-04-19 20:28:54,964 [INFO] graphlab.deploy.map_job, 186: Validation complete. Job: 'Model-Parameter-Search-Apr-19-2016-20-28-5400000' ready for execution
2016-04-19 20:28:55,048 [INFO] graphlab.deploy.map_job, 192: Job: 'Model-Parameter-Search-Apr-19-2016-20-28-5400000' scheduled.
2016-04-19 20:29:11,239 [INFO] graphlab.deploy.job, 22: Validating job.
2016-04-19 20:29:11,250 [INFO] graphlab.deploy.map_job, 220: A job with name 'Model-Parameter-Search-Apr-19-2016-20-28-5400000' already exists. Renaming the job to 'Model-Parameter-Search-Apr-19-2016-20-28-5400000-340b1'.
2016-04-19 20:29:11,260 [INFO] graphlab.deploy.map_job, 186: Validation complete. Job: 'Model-Parameter-Search-Apr-19-2016-20-28-5400000-340b1' ready for execution
2016-04-19 20:29:11,371 [INFO] graphlab.deploy.map_job, 192: Job: 'Model-Param

In [1]:
job.get_status()

NameError: name 'job' is not defined

In [18]:
job_result = job.get_results()

job_result.head()

model_id,item_id,linear_regularization,max_iterations,num_factors,num_sampled_negative_exam ples ...,ranking_regularization
1,movie,0.001,25,32,8,0.5
0,movie,0.001,50,8,4,0.25
3,movie,0.001,25,24,8,0.5
2,movie,0.001,50,16,4,0.25
4,movie,0.001,25,24,8,0.5

regularization,target,user_id,training_precision@5,training_recall@5,training_rmse,validation_precision@5
0.001,rating,user,0.352694610778,0.0088527862985,1.16371146071,0.128
0.001,rating,user,0.340119760479,0.00850918000996,1.0232845689,0.13
0.001,rating,user,0.352694610778,0.0088527862985,1.16351770394,0.128
0.001,rating,user,0.340119760479,0.00850918000996,1.02328612882,0.136
0.001,rating,user,0.352694610778,0.0088527862985,1.16375120468,0.128

validation_recall@5,validation_rmse
0.011426144642,1.13377883636
0.0113212256343,1.00580952846
0.0116075548984,1.1336006249
0.0121267153261,1.0057794781
0.0116075548984,1.1336521705


In [19]:
bst_prms = job.get_best_params()
bst_prms

{'item_id': 'movie',
 'linear_regularization': 0.001,
 'max_iterations': 50,
 'num_factors': 16,
 'num_sampled_negative_examples': 4,
 'ranking_regularization': 0.25,
 'regularization': 0.001,
 'target': 'rating',
 'user_id': 'user'}

In [20]:
models = job.get_models()
models

[Class                           : RankingFactorizationRecommender
 
 Schema
 ------
 User ID                         : user
 Item ID                         : movie
 Target                          : rating
 Additional observation features : 0
 Number of user side features    : 0
 Number of item side features    : 0
 
 Statistics
 ----------
 Number of observations          : 76833
 Number of users                 : 334
 Number of items                 : 7447
 
 Training summary
 ----------------
 Training time                   : 5.7784
 
 Model Parameters
 ----------------
 Model class                     : RankingFactorizationRecommender
 num_factors                     : 8
 binary_target                   : 0
 side_data_factorization         : 1
 solver                          : auto
 nmf                             : 0
 max_iterations                  : 50
 
 Regularization Settings
 -----------------------
 regularization                  : 0.001
 regularization_type           

In [21]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.14      | 0.00218019543867 |
|   2    |     0.145      | 0.00457252368593 |
|   3    | 0.126666666667 | 0.00574200752592 |
|   4    |     0.1275     | 0.00859985816089 |
|   5    |      0.13      | 0.0113212256343  |
|   6    | 0.126666666667 | 0.0130407981865  |
|   7    | 0.117142857143 |  0.014268686809  |
|   8    |     0.1175     | 0.0160946235916  |
|   9    | 0.113333333333 | 0.0173600979603  |
|   10   |     0.112      |  0.019061632731  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.14

In [27]:
models[2]

Class                           : RankingFactorizationRecommender

Schema
------
User ID                         : user
Item ID                         : movie
Target                          : rating
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 77297
Number of users                 : 334
Number of items                 : 7469

Training summary
----------------
Training time                   : 7.4378

Model Parameters
----------------
Model class                     : RankingFactorizationRecommender
num_factors                     : 8
binary_target                   : 0
side_data_factorization         : 1
solver                          : auto
nmf                             : 0
max_iterations                  : 50

Regularization Settings
-----------------------
regularization                  : 0.1
regularization_type             : normal
linear_regularization     