In [1]:
import pandas as pd
items = pd.read_csv('movies.csv', encoding='latin-1')
ratings = pd.read_csv('ratings.csv', encoding='latin-1')
number_of_users = ratings.userId.unique().shape[0]
number_of_items = ratings.movieId.unique().shape[0]

In [2]:
from sklearn import cross_validation
training_set, testing_set = cross_validation.train_test_split(ratings,test_size=0.25)
import graphlab
item_data = graphlab.SFrame.read_csv('movies.csv')
train_data = graphlab.SFrame(training_set)
test_data = graphlab.SFrame(testing_set)

This non-commercial license of GraphLab Create for academic use is assigned to xwang147@u.rochester.edu and will expire on April 20, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1525656170.log


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
#create content-based recommender
item_content_model = graphlab.item_content_recommender.create(item_data,item_id='movieId',observation_data=train_data,user_id='userId',target='rating')

('Applying transform:\n', Class             : AutoVectorizer

Model Fields
------------
Features          : ['title', 'genre', 'genre.1', 'genre.2', 'genre.3', 'genre.4', 'genre.5', 'genre.6', 'genre.7', 'genre.8', 'genre.9']
Excluded Features : ['movieId']

Column   Type  Interpretation  Transforms                         Output Type
-------  ----  --------------  ---------------------------------  -----------
title    str   short_text      3-Character NGram Counts -> TFIDF  dict       
genre    str   categorical     None                               str        
genre.1  str   categorical     None                               str        
genre.2  str   categorical     None                               str        
genre.3  str   categorical     None                               str        
genre.4  str   categorical     None                               str        
genre.5  str   categorical     None                               str        
genre.6  str   categorical     None    

Defaulting to brute force instead of ball tree because there are multiple distance components.


In [4]:
#evaluate content-based recommender
item_content_model.evaluate(test_data,metric='rmse')

('\nOverall RMSE: ', 3.6439167668741383)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  579   |   4   | 1.08430300936 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  114   |   7   | 4.86973158545 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+-------+----------------+
| movieId | count |      rmse      |
+---------+-------+----------------+
|  113345 |   1   | 0.424698862713 |
+---------+-------+----------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+-------+------+
| movieId | count | rmse |
+---------+-------+------+
|   2267  |   1   | 5.0  |
+---------+-------+------+
[1 rows x 3 columns]



{'rmse_by_item': Columns:
 	movieId	int
 	count	int
 	rmse	float
 
 Rows: 5393
 
 Data:
 +---------+-------+---------------+
 | movieId | count |      rmse     |
 +---------+-------+---------------+
 |   7899  |   1   | 2.97408619549 |
 |   5288  |   1   | 3.99274527807 |
 |   3143  |   1   | 3.90706007972 |
 |   6769  |   1   | 4.95779619534 |
 |   2779  |   5   | 3.87808028801 |
 |   3988  |   4   | 2.60710347744 |
 |   2847  |   1   | 3.98464985781 |
 |  64614  |   7   | 3.77664281953 |
 |   2925  |   1   | 4.98490713525 |
 |   2871  |   13  | 4.01522722571 |
 +---------+-------+---------------+
 [5393 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	userId	int
 	count	int
 	rmse	float
 
 Rows: 671
 
 Data:
 +--------+-------+---------------+
 | userId | count |      rmse     |
 +--------+-------+---------------+
 |  118   |   44  | 4.04345737371 |
 |  435 

In [5]:
#create factorization model
factorization_model = graphlab.factorization_recommender.create(train_data,user_id='userId',item_id='movieId',target='rating',item_data=item_data)

In [6]:
#evaluate factorization model
factorization_model.evaluate(test_data,metric='rmse')

('\nOverall RMSE: ', 1.0967028807754542)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|   50   |   6   | 0.123159515573 |
+--------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  348   |   12  | 3.01754769951 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+-------+------------------+
| movieId | count |       rmse       |
+---------+-------+------------------+
|   3990  |   1   | 0.00102790242491 |
+---------+-------+------------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+-------+--------------+
| movieId | count |     rmse     |
+---------+-------+--------------+
|   3696  |   1   | 5.6056562209 |
+---------+-------+--------------+
[1 rows x 3 columns]



{'rmse_by_item': Columns:
 	movieId	int
 	count	int
 	rmse	float
 
 Rows: 5393
 
 Data:
 +---------+-------+----------------+
 | movieId | count |      rmse      |
 +---------+-------+----------------+
 |   7899  |   1   | 0.385133612469 |
 |   5288  |   1   | 0.086001780721 |
 |   3143  |   1   | 0.494912838526 |
 |   6769  |   1   | 1.59656408739  |
 |   2779  |   5   | 0.371201619234 |
 |   3988  |   4   | 0.865027622605 |
 |   2847  |   1   | 0.095394471638 |
 |  64614  |   7   | 1.19945122268  |
 |   2925  |   1   | 0.333615951525 |
 |   2871  |   13  | 1.03122261501  |
 +---------+-------+----------------+
 [5393 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	userId	int
 	count	int
 	rmse	float
 
 Rows: 671
 
 Data:
 +--------+-------+----------------+
 | userId | count |      rmse      |
 +--------+-------+----------------+
 |  118   |   44  | 0.75246

In [7]:
#create popularity model
popularity_model = graphlab.popularity_recommender.create(train_data,user_id='userId',item_id='movieId',target='rating',item_data=item_data)

In [8]:
#evaluate popularity model
popularity_model.evaluate(test_data,metric='rmse')

('\nOverall RMSE: ', 1.176296438567179)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|   14   |   2   | 0.122159768359 |
+--------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  609   |   30  | 2.64633859938 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+-------+------+
| movieId | count | rmse |
+---------+-------+------+
|   4976  |   1   | 0.0  |
+---------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+-------+------+
| movieId | count | rmse |
+---------+-------+------+
|   6769  |   1   | 5.0  |
+---------+-------+------+
[1 rows x 3 columns]



{'rmse_by_item': Columns:
 	movieId	int
 	count	int
 	rmse	float
 
 Rows: 5393
 
 Data:
 +---------+-------+----------------+
 | movieId | count |      rmse      |
 +---------+-------+----------------+
 |   7899  |   1   |      0.25      |
 |   5288  |   1   |      0.0       |
 |   3143  |   1   |      4.0       |
 |   6769  |   1   |      5.0       |
 |   2779  |   5   | 0.392308870662 |
 |   3988  |   4   | 0.442605919527 |
 |   2847  |   1   |      0.5       |
 |  64614  |   7   | 0.892944619462 |
 |   2925  |   1   | 0.916666666667 |
 |   2871  |   13  | 0.488848584348 |
 +---------+-------+----------------+
 [5393 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	userId	int
 	count	int
 	rmse	float
 
 Rows: 671
 
 Data:
 +--------+-------+----------------+
 | userId | count |      rmse      |
 +--------+-------+----------------+
 |  118   |   44  | 0.77489