# Collaborative Filtering in Turi (formerly Dato, Formerly GraphLab)

This tutorial explains methods of collaborative filtering for recommender systems using the graphlab create package (from the company Dato). Many of the examples are manipulated versions of the the following basic tutorials:
- https://dato.com/learn/gallery/notebooks/basic_recommender_functionalities.html 
- https://dato.com/learn/gallery/notebooks/five_line_recommender.html

Furthermore, Dato has plenty of iPython notebook examples to look through that do more than just reccomendation systems, including classification, clustering, and graph analytics. 
- https://dato.com/learn/gallery/index.html

## The five line recommendation system (user-item)
This example will build a recommendation system for movie ratings given the following dataset of users and movie ratings. It is explained in detail at https://dato.com/learn/gallery/notebooks/five_line_recommender.html. This example hides much of the functionality and fine tuning possible, but works nicely for starting out with.

The dataset in this example comes from ~330 users that have rated ~7700 movies (a total of ~82,000 ratings).

In [1]:
# This is a well known graphlab example that builds a recommendation system in 5 lines of code

import graphlab as gl

data = gl.SFrame.read_csv("http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv", 
                          column_type_hints={"rating":int})
model = gl.recommender.create(data, user_id="user", item_id="movie", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output


This non-commercial license of GraphLab Create for academic use is assigned to eclarson@smu.edu and will expire on October 27, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1484158681.log


user,movie,score,rank
Jacob Smith,Sabrina,4.86910174883,1
Jacob Smith,Doctor Zhivago,4.79589484728,2
Jacob Smith,West Side Story,4.58079622782,3
Jacob Smith,Bridget Jones's Diary,4.20802472627,4
Jacob Smith,Legends of the Fall,4.16532920397,5
Mason Smith,The Adventures of Priscilla ...,6.47308109796,1
Mason Smith,Roger & Me,5.23513935602,2
Mason Smith,Some Like It Hot,5.03575180567,3
Mason Smith,Best in Show,5.02817009485,4
Mason Smith,Cool Hand Luke,5.01376794374,5


In the above model creation, we have found the top five highest ranking items for each user. Two users are shown with their corresponding highest ranking items (that they have not rated).
___

In [2]:
data.head()

user,movie,rating
Jacob Smith,Flirting with Disaster,4
Jacob Smith,Indecent Proposal,3
Jacob Smith,Runaway Bride,2
Jacob Smith,Swiss Family Robinson,1
Jacob Smith,The Mexican,2
Jacob Smith,Maid in Manhattan,4
Jacob Smith,A Charlie Brown Thanksgiving / The ...,3
Jacob Smith,Brazil,1
Jacob Smith,Forrest Gump,3
Jacob Smith,It Happened One Night,4


That's great!! But we really do not know how good these results are, so let's keep moving and we will come back, but using cross-validation. 


## The item-item recommendation system
No let's look at creating the item-item similarity matrix. That is, for each item, what are the top closest items based upon user ratings.

In [3]:
# from graphlab.recommender import item_similarity_recommender

item_item = gl.recommender.item_similarity_recommender.create(data, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=3,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=3)
results.head()

movie,similar,score,rank
Flirting with Disaster,Martin Lawrence: You So Crazy ...,0.561863601208,1
Flirting with Disaster,Shadow Magic,0.535303354263,2
Flirting with Disaster,Seinfeld: Season 4,0.507150530815,3
Indecent Proposal,Cocktail,0.568772494793,1
Indecent Proposal,Beverly Hills Cop,0.516246914864,2
Indecent Proposal,Flatliners,0.513955056667,3
Runaway Bride,Notting Hill,0.613413572311,1
Runaway Bride,Sleepless in Seattle,0.60902172327,2
Runaway Bride,Maid in Manhattan,0.608688771725,3
Swiss Family Robinson,Armed and Dangerous,0.483493804932,1


The item-item matrix is typically a good baseline. However, we can do better with a more personalized system. Something that takes into account the various preferences of specific users, rather than all users rating specific items. 
___
Moreover, we need to be performing cross validation of the data set to see what model and model parameters actually generalize well with out dataset. That also means we need a set of evaluation criteria. The first and very common measuer is the root mean squared error, RMSE. It takes into account the difference between the predicted rating and the actual rating of items. However, we can calculate it in a number of different aggregated ways (i.e., splits and aggregation). For instance, we could just take the average RMSE of every entry in the dataset. Or, we could take the average RMSE for each user, or the average RMSE for each item. Ite really depends on what we are most interested in (i.e., out business case). RMSE can be calculated in the following ways:

$$RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^N (\hat{y}_i-y_i)^2}$$

Or we can calculate the RMSE for each user, U, in our data:

$$\underbrace{RMSE(U)}_{\text{user=U}}=\sqrt{\frac{1}{|U|}\sum_{u\in U} (\hat{y}_u-y_u)^2}$$

Or we can calculate the RMSE for each item, J, in our data:

$$\underbrace{RMSE(J)}_{\text{item=J}}=\sqrt{\frac{1}{|J|}\sum_{j\in J} (\hat{y}_j-y_j)^2}$$

It's importatn to understand that RMSE(U) and RMSE(J) are arrays of averages, the size of the unique number of users or unique number of items, respectively. Therefore an approach that visualizes the distribution of values is a nice evaluation technique. It also means that statistical tests of the distributions can be used to evaluate the differences of the models. That is, "Model A has statistically smaller (with 95% confidence) per user RMSE than model B, thereofore we conclude that model A has superior performance."


So let's now create a holdout set and see if we can judge the RMSE on a per-user and per-item basis:

In [4]:
train, test = gl.recommender.util.random_split_by_user(data,
                                                    user_id="user", item_id="movie",
                                                    max_num_users=100, item_test_proportion=0.2)

In [5]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)



Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.18      | 0.00851461400757 |
|   2    |      0.17      | 0.0114813179455  |
|   3    |      0.19      | 0.0203175491454  |
|   4    |     0.1975     | 0.0275973558931  |
|   5    |     0.194      | 0.0327684686723  |
|   6    | 0.186666666667 | 0.0354018807673  |
|   7    | 0.195714285714 | 0.0452641935336  |
|   8    |     0.1975     | 0.0512779743288  |
|   9    | 0.196666666667 |  0.055312349506  |
|   10   |     0.186      | 0.0609766860575  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.633760811056906)

Per User RMSE (best)
+------------+-------+---------------+
|    user    | count |      rmse     |
+------------+-------+---------------+
| Zion Smith |  408  | 1.80170257405 |
+------------+-------+---------------+
[1 rows x 3

In [6]:
print rmse_results.viewkeys()
print rmse_results['rmse_by_item']

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])
+------------------------------+-------+---------------+
|            movie             | count |      rmse     |
+------------------------------+-------+---------------+
|       Steel Magnolias        |   2   | 4.97048890098 |
|        The Hard Word         |   1   |      2.0      |
| Monty Python's Life of Brian |   1   | 3.99193571548 |
|   Crimes and Misdemeanors    |   2   | 3.53526905472 |
|    The Mothman Prophecies    |   1   |      3.0      |
|         ER: Season 1         |   1   |      3.0      |
|        Donnie Brasco         |   2   | 4.12310562562 |
|           Eurotrip           |   3   | 3.46410161514 |
|     Cast a Giant Shadow      |   1   |      2.0      |
|         Brian's Song         |   3   | 2.93285791593 |
+------------------------------+-------+---------------+
[2293 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use prin

In [7]:
rmse_results['rmse_by_user']

user,count,rmse
Dominick Smith,63,3.64281968766
Jeremy Smith,14,3.79007923299
Andre Smith,55,3.19524303916
Beckett Smith,60,3.5033588031
Martin Smith,17,3.99576273485
Silas Smith,25,3.62394758786
Jared Smith,40,3.58626854502
Easton Smith,3,4.0
Joshua Smith,11,3.65339726056
Max Smith,16,3.91823737974


___
Another evaluation criterion is the per-user-recall or the per-user-precision. These are typically smaller values because they require users with a large number of ratings. The idea behind them is that, given a number of highly rated items for a user, how many of them did my model also recommend. This is inherently difficult to calculate because the user has not rated every item in the dataset---we may have found 10 items that the user would have chosen and rated highly, but if the user never rated them, we can't be sure how good we are recommending them. 

Even still, its a good measure of how well you are rating the items that are most important to the user (assuming the user rated items they had strong opinions about). Its not perfect, but its the best we have to work with.

We define the per user measures as follows: Let $p_k$ be a vector of the $k$ highest ranked recommendations for a particular user and let $a$ be the set of all positively ranked items for that user in the test set. 

The per-user-recall for k-items is given by:

$$R(k)=\frac{|a \cap p_k|}{|a|} $$

Which means, intuitively, "of all the items rated positively by the user, how many did your recommender find?"

The per-user-precision for k-items is given by:

$$P(k)=\frac{|a \cap p_k|}{k} $$

Which means, intuitively, "of the k items found by your recommender, how many were rated positively by the user?"

These, like per user RMSE, are arrays the same size as the uniqu number of users in the dataset. Therefore statistical comparisons can be completed to find superior performing models. 

In [8]:
rmse_results['precision_recall_by_user']

user,cutoff,precision,recall,count
Abel Smith,1,0.0,0.0,18
Abel Smith,2,0.0,0.0,18
Abel Smith,3,0.0,0.0,18
Abel Smith,4,0.0,0.0,18
Abel Smith,5,0.2,0.0555555555556,18
Abel Smith,6,0.166666666667,0.0555555555556,18
Abel Smith,7,0.285714285714,0.111111111111,18
Abel Smith,8,0.25,0.111111111111,18
Abel Smith,9,0.222222222222,0.111111111111,18
Abel Smith,10,0.2,0.111111111111,18


In [9]:
import graphlab.aggregate as agg

# we will be using these aggregations
agg_list = [agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')]

# apply these functions to each group (we will group the results by 'k' which is the cutoff)
# the cutoff is the number of top items to look for see the following URL for the actual equation
# https://dato.com/products/create/docs/generated/graphlab.recommender.util.precision_recall_by_user.html#graphlab.recommender.util.precision_recall_by_user
rmse_results['precision_recall_by_user'].groupby('cutoff',agg_list)

# the groups are not sorted

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.138888888889,0.0881041751509,0.144857844577,0.133618138546
2,0.17,0.284780617318,0.0114813179455,0.0365112180182
46,0.12847826087,0.0853157706309,0.167042809694,0.137559587788
31,0.148387096774,0.0922603202973,0.133727168643,0.124622650674
26,0.153076923077,0.0960738128461,0.11446827711,0.0968703334485
8,0.1975,0.156304990323,0.0512779743288,0.0860494554747
5,0.194,0.192779666978,0.0327684686723,0.0710232780272
16,0.1775,0.12365526677,0.0840212470585,0.093881973756
41,0.13487804878,0.086470552528,0.159192878914,0.136140556075
4,0.1975,0.229932055182,0.0275973558931,0.0708021891243


Wow... these results appear to be not so great. Let's try something a little different and look to see if the results get better. Let's start with collaborative filtering to create the user-item matrix. 

___
## Cross Validated Collaborative Filtering

In [10]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating")

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.12      | 0.00313351460276 |
|   2    |     0.115      | 0.00614740877219 |
|   3    | 0.116666666667 | 0.00805033682915 |
|   4    |     0.1175     | 0.0108879659345  |
|   5    |     0.116      | 0.0134032387767  |
|   6    | 0.113333333333 | 0.0148098125264  |
|   7    | 0.107142857143 | 0.0163383198215  |
|   8    |     0.115      | 0.0196304263069  |
|   9    | 0.115555555556 | 0.0225160882759  |
|   10   |     0.115      |  0.024344252415  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.7433494498977482)

Per User RMSE (best)
+-------------+-------+----------------+
|     user    | count |      rmse      |
+-------------+-------+----------------+
| Amari Smith |   2   | 0.446885842126 |
+-------------+-------+----------------+


In [11]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
36,0.100833333333,0.0863933546541,0.0789151469344,0.0567947995142
2,0.115,0.243464576479,0.00614740877219,0.015865395657
46,0.0947826086957,0.0771980978639,0.0967662571927,0.0663275344235
31,0.102580645161,0.0882692880998,0.0704756250516,0.0542113218014
26,0.109615384615,0.09582636541,0.0639982590249,0.050569056255
8,0.115,0.161709616288,0.0196304263069,0.0288967783974
5,0.116,0.185860162488,0.0134032387767,0.0226925346777
16,0.115625,0.119365182842,0.0403995983877,0.0404722147898
41,0.0963414634146,0.081652693791,0.0861951625754,0.0627664049257
4,0.1175,0.207530118296,0.0108879659345,0.0208591980978


___
Okay, so we are getting better, but might need to tweak the results of the classifier by regularizing...
Remember that we need to come up with a good estimate of the latent factors and we need that matrix to be a good estiamte of the given ratings. We can control some of the parameters using regularization constants and increasing or decreasing the number of latent factors.

In [12]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="user", 
                                  item_id="movie", 
                                  target="rating",
                                  num_factors=16,                 # override the default value
                                  regularization=1e-02,           # override the default value
                                  linear_regularization = 1e-3)   # override the default value

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.2       | 0.00335050849505 |
|   2    |      0.15      | 0.00535421397352 |
|   3    | 0.143333333333 | 0.00846222045566 |
|   4    |      0.14      | 0.0109455059885  |
|   5    |     0.128      | 0.0127051484801  |
|   6    | 0.121666666667 | 0.0147067887307  |
|   7    |      0.12      | 0.0175809582432  |
|   8    |    0.12125     | 0.0205166948919  |
|   9    |      0.12      | 0.0225858038243  |
|   10   |     0.118      | 0.0247895619665  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.0348045549757179)

Per User RMSE (best)
+--------------+-------+----------------+
|     user     | count |      rmse      |
+--------------+-------+----------------+
| Andres Smith |   3   | 0.351964396006 |
+--------------+-------+-------------

# Is this better than the item item matrix?

In [13]:
comparison = gl.recommender.util.compare_models(test, [item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.18      | 0.00851461400757 |
|   2    |      0.17      | 0.0114813179455  |
|   3    |      0.19      | 0.0203175491454  |
|   4    |     0.1975     | 0.0275973558931  |
|   5    |     0.194      | 0.0327684686723  |
|   6    | 0.186666666667 | 0.0354018807673  |
|   7    | 0.195714285714 | 0.0452641935336  |
|   8    |     0.1975     | 0.0512779743288  |
|   9    | 0.196666666667 |  0.055312349506  |
|   10   |     0.186      | 0.0609766860575  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.633760811056906)

Per User RMSE (best)
+------------+-------+---------------+
|    user    | count |      rmse     |
+------------+-------+---------------+
| Zion Smith |  408  | 1.80170257405 |
+------------+-------+

In [14]:
 comparisonstruct = gl.compare(test,[item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.18      | 0.00851461400757 |
|   2    |      0.17      | 0.0114813179455  |
|   3    |      0.19      | 0.0203175491454  |
|   4    |     0.1975     | 0.0275973558931  |
|   5    |     0.194      | 0.0327684686723  |
|   6    | 0.186666666667 | 0.0354018807673  |
|   7    | 0.195714285714 | 0.0452641935336  |
|   8    |     0.1975     | 0.0512779743288  |
|   9    | 0.196666666667 |  0.055312349506  |
|   10   |     0.186      | 0.0609766860575  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.2 

In [15]:
gl.show_comparison(comparisonstruct,[item_item, rec1])

## Parameters, Parameters
There are so many parameters to search through here. It would be great if there as something we could do to change the parameters automatically and search through the best ones...

In [16]:
params = {'user_id': 'user', 
          'item_id': 'movie', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)

# also note thatthis evaluator also supports sklearn
# https://dato.com/products/create/docs/generated/graphlab.toolkits.model_parameter_search.create.html?highlight=model_parameter_search

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.job: Creating a LocalAsync environment called 'async'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Jan-11-2017-12-19-4400000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Jan-11-2017-12-19-4400000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Jan-11-2017-12-19-4400000' already exists. Renaming the job to 'Model-Parameter-Search-Jan-11-2017-12-19-4400000-06dbb'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Jan-11-2017-12-19-4400000-06dbb' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Jan-11-2017-12-19-4400000-06dbb' scheduled.


In [24]:
job.get_status()

{'Canceled': 0, 'Completed': 5, 'Failed': 0, 'Pending': 0, 'Running': 0}

In [25]:
job_result = job.get_results()

job_result.head()

model_id,item_id,linear_regularization,max_iterations,num_factors,num_sampled_negative_exam ples ...,ranking_regularization
1,movie,0.001,50,16,4,0.25
0,movie,0.001,50,24,8,0.1
3,movie,0.001,50,16,4,0.1
2,movie,0.001,50,16,4,0.1
4,movie,0.001,25,12,4,0.25

regularization,target,user_id,training_precision@5,training_recall@5,training_rmse,validation_precision@5
0.001,rating,user,0.343113772455,0.00876171469214,1.0212219536,0.136
0.001,rating,user,0.343113772455,0.00876171469214,0.963738769195,0.13
0.001,rating,user,0.343113772455,0.00876171469214,0.95876258495,0.13
0.001,rating,user,0.343113772455,0.00876171469214,0.958716636348,0.128
0.001,rating,user,0.343113772455,0.00876171469214,1.02112743337,0.132

validation_recall@5,validation_rmse
0.013476647209,1.03276903381
0.0127847906784,0.979829082521
0.0132007997687,0.975865328724
0.0126452442132,0.975950897341
0.0129745104568,1.03281822189


In [26]:
bst_prms = job.get_best_params()
bst_prms

{'item_id': 'movie',
 'linear_regularization': 0.001,
 'max_iterations': 50,
 'num_factors': 16,
 'num_sampled_negative_examples': 4,
 'ranking_regularization': 0.1,
 'regularization': 0.001,
 'target': 'rating',
 'user_id': 'user'}

In [27]:
models = job.get_models()
models

[Class                            : RankingFactorizationRecommender
 
 Schema
 ------
 User ID                          : user
 Item ID                          : movie
 Target                           : rating
 Additional observation features  : 0
 User side features               : []
 Item side features               : []
 
 Statistics
 ----------
 Number of observations           : 77252
 Number of users                  : 334
 Number of items                  : 7474
 
 Training summary
 ----------------
 Training time                    : 6.7992
 
 Model Parameters
 ----------------
 Model class                      : RankingFactorizationRecommender
 num_factors                      : 24
 binary_target                    : 0
 side_data_factorization          : 1
 solver                           : auto
 nmf                              : 0
 max_iterations                   : 50
 
 Regularization Settings
 -----------------------
 regularization                   : 0.001
 regulari

In [28]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.2       | 0.00370998561923 |
|   2    |     0.165      | 0.0063380268465  |
|   3    | 0.146666666667 | 0.00871312091682 |
|   4    |     0.1375     | 0.0107087972136  |
|   5    |      0.13      | 0.0127847906784  |
|   6    | 0.123333333333 | 0.0146200367876  |
|   7    | 0.118571428571 | 0.0173911945869  |
|   8    |      0.11      | 0.0184748874076  |
|   9    | 0.113333333333 | 0.0219346654043  |
|   10   |     0.108      | 0.0230003206733  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.19

In [29]:
models[1]

Class                            : RankingFactorizationRecommender

Schema
------
User ID                          : user
Item ID                          : movie
Target                           : rating
Additional observation features  : 0
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 77252
Number of users                  : 334
Number of items                  : 7474

Training summary
----------------
Training time                    : 5.0819

Model Parameters
----------------
Model class                      : RankingFactorizationRecommender
num_factors                      : 16
binary_target                    : 0
side_data_factorization          : 1
solver                           : auto
nmf                              : 0
max_iterations                   : 50

Regularization Settings
-----------------------
regularization                   : 0.001
regularization_type              : normal
l