#  Probabilistic Matrix Factorization of the MovieLens Ratings

Something something . Matrix factorization is a simple and powerful technique for predicting user's ratings of items using embedding in a latent space.

Modern methods have many useful adaptations, but the general system predicts user $u$'s rating of item $i$, $r_{u,i}$  as 
  *  $r_{u,i} = \mu + \beta_u + \beta_i + \vec{v}_u^T \cdot \vec{v}_i^T$
  *  $\mu$, $\beta_u$, and $\beta_i$ are overall mean ratings and offsets for users and movies
  *  $\vec{v}_i, \vec{v}_u \in \mathbb{R}^K$ are the user's and movie's embeddings in a $K$ dimensional space.
  *  We regularize user and item vectors by modeling them as coming from a multivariate gaussian, which can be learned or predetermined $\vec{v}_u \sim N(0, \Lambda_{\text{Users}})$ and $\vec{v}_i \sim N(0, \Lambda_{\text{Movies}})$.

The most famous application of PMF for recomendation systems is probably Netflix's.  Their system is described in [this paper](https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf) and you can also find a readable overview [here](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf).  Netflix took down the original dataset due to privacy concerns, but you can still run the same model using the [MovieLens data](https://grouplens.org/datasets/movielens/20m/).

Need to cite `MovieLens` with ```F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872```.

The MovieLens data comes contains 20,000,263 ratings of 27,278 movies by 138,493 users. It also contains free text tags, but we will not use them here. 

##  Read in the Data

Download [the data](https://grouplens.org/datasets/movielens/) and unzip.

We will only use `ml-20/movies.csv` and `ml-20/ratings.csv`.

In [29]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd

%pylab inline

movies = pd.read_csv('ml-20m/movies.csv')
ratings = pd.read_csv('ml-20m/ratings.csv')

In [187]:
print(movies.shape)
print(movies.head())

(27278, 3)
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [188]:
print(ratings.shape)
print(ratings.head())

(20000263, 4)
   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580


Let's cherry pick a few movies whose parameters we will track.  We'll pick the ones with the most ratings - hopefully our latent dimensions will be interpretable. 

We'll also bring in the titles now.  We'll need to remap the user and movie IDs, and bringing in the titles helps make sure we don't make mistakes mapping them back later on.

In [190]:
ratings = ratings.merge(movies, on = 'movieId', how = 'left')

In [192]:
print(ratings.head())

   userId  movieId  rating   timestamp  \
0       1        2     3.5  1112486027   
1       1       29     3.5  1112484676   
2       1       32     3.5  1112484819   
3       1       47     3.5  1112484727   
4       1       50     3.5  1112484580   

                                               title  \
0                                     Jumanji (1995)   
1  City of Lost Children, The (Cité des enfants p...   
2          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
3                        Seven (a.k.a. Se7en) (1995)   
4                         Usual Suspects, The (1995)   

                                   genres  
0              Adventure|Children|Fantasy  
1  Adventure|Drama|Fantasy|Mystery|Sci-Fi  
2                 Mystery|Sci-Fi|Thriller  
3                        Mystery|Thriller  
4                  Crime|Mystery|Thriller  


In [194]:
ratings['title'].value_counts()[0:5]

Pulp Fiction (1994)                 67310
Forrest Gump (1994)                 66172
Shawshank Redemption, The (1994)    63366
Silence of the Lambs, The (1991)    63299
Jurassic Park (1993)                59715
Name: title, dtype: int64

In [197]:
ratings.loc[ratings['rating'] >= 4.0]['title'].value_counts()[0:5]

Shawshank Redemption, The (1994)             55807
Pulp Fiction (1994)                          52353
Silence of the Lambs, The (1991)             50114
Forrest Gump (1994)                          47331
Star Wars: Episode IV - A New Hope (1977)    42612
Name: title, dtype: int64

In [198]:
ratings.loc[ratings['rating'] <= 1.0]['title'].value_counts()[0:5]

Dumb & Dumber (Dumb and Dumber) (1994)    4578
Ace Ventura: Pet Detective (1994)         4323
Ace Ventura: When Nature Calls (1995)     3976
Waterworld (1995)                         3013
Blair Witch Project, The (1999)           2992
Name: title, dtype: int64

In [200]:
_, _, _ = plt.hist(np.log(ratings['title'].value_counts()), 30)

<matplotlib.figure.Figure at 0x189e1c250>

In [206]:
print(ratings['rating'].value_counts())

4.0    5561926
3.0    4291193
5.0    2898660
3.5    2200156
4.5    1534824
2.0    1430997
2.5     883398
1.0     680732
1.5     279252
0.5     239125
Name: rating, dtype: int64


In [None]:
easier_ratings = ratings.loc[ratings

##  Model Definition

Let's go ahead and define our model before we reformat our data.

We will use a vanilla PMF - exactly the one defined above, with predetermined $\Lambda_{\text{movies}} = \Lambda_{\text{users}} = I_K$ and $K=3$.

In [217]:
from edward.models import Normal
#  This will need to change if you train/test split
N_users = len(set(ratings.userId))
N_movies = len(set(ratings.movieId))
N_ratings = ratings.shape[0]
K = 2

lnvar_users = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
lnvar_movies = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
sigma_users = tf.sqrt(tf.exp(lnvar_users))
sigma_movies = tf.sqrt(tf.exp(lnvar_users))

user_vecs = Normal(loc = tf.zeros([N_users, K]), 
                   scale = sigma_users * tf.ones([N_users, K]))
movie_vecs = Normal(loc = tf.zeros([N_movies, K]), 
                    scale = sigma_movies * tf.ones([N_movies, K]))

#  Somewhat hacky prior on mu
mu = Normal(loc = 2.5*tf.ones([1]), 
            scale = tf.ones([1]))

user_betas = Normal(loc = tf.zeros([N_users]), 
                    scale = sigma_users * tf.ones([N_users]))
movie_betas = Normal(loc = tf.zeros([N_movies]), 
                     scale = sigma_movies * tf.ones([N_movies]))

#  Placeholders for data inputs
user_ids = tf.placeholder(tf.int32, [N_ratings])
movie_ids = tf.placeholder(tf.int32, [N_ratings])

predicted_ratings = tf.reduce_sum(tf.multiply(
    tf.gather(user_vecs, user_ids),
    tf.gather(movie_vecs, movie_ids)
)) + \
    tf.gather(user_betas, user_ids) + \
    tf.gather(movie_betas, movie_ids) + \
    mu

obs_ratings = Normal(loc=predicted_ratings, scale = tf.ones([N_ratings]))

##  Inference Definition

We have now sret up a probablistic graph for generating ratings from our learnable paramters (mu, offsets, and vectors).  We now explicitly define our inference.  Edward makes it easy to swap out sampling, ML, and variational methods.  We'll use simple MFVI.  

In [218]:
user_ids_train = ratings['userId']
movie_ids_train = ratings['movieId'].astype('category').cat.codes

user_ids_train = user_ids_train.values.astype(int) - 1
movie_ids_train = movie_ids_train.values.astype(int)
ratings_train = ratings['rating'].values.astype(float)

In [219]:
# INFERENCE
q_user_vecs = Normal(loc=tf.Variable(tf.random_normal([N_users, K])),
                     scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users, K]))))
q_movie_vecs = Normal(loc=tf.Variable(tf.random_normal([N_movies, K])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies, K]))))
q_user_betas = Normal(loc=tf.Variable(tf.random_normal([N_users])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users]))))
q_movie_betas = Normal(loc=tf.Variable(tf.random_normal([N_movies])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies]))))

q_mu = Normal(loc=tf.Variable(tf.random_normal([1])),
              scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_users = Normal(loc=tf.Variable(tf.random_normal([1])),
                       scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_movies = Normal(loc=tf.Variable(tf.random_normal([1])),
                        scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))

                                    
parameter_inferences = {
    user_vecs: q_user_vecs,
    movie_vecs: q_movie_vecs,
    user_betas: q_user_betas,
    movie_betas: q_movie_betas,
    mu: q_mu,
    lnvar_users: q_lnvar_users,
    lnvar_movies: q_lnvar_movies
}
train_data = {
    user_ids: user_ids_train,
    movie_ids: movie_ids_train,
    obs_ratings: ratings_train
}

inference = ed.KLqp(parameter_inferences,
                    train_data)


This needs to go further up.

Now let's plug in our data into the data placeholders.

We also want to check that the IDs make don't skip any indices, which can cause problems.  We need our actual IDs to match up with the shapes of our parameters.

#  Run the Inference

We now have everything set up and can let SGD run. 

In [260]:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-9)

inference.initialize(optimizer = optimizer,
                     n_print=10, n_iter=2000)
tf.global_variables_initializer().run()

In [261]:
for _ in range(inference.n_iter):
    info_dict = inference.update()
    #inference.print_progress(info_dict)
    if info_dict['t'] % inference.n_print == 0:
        l_per_rating = info_dict['loss'] / (1.0 * N_ratings)
        print('Iter: {}    Loss per rating: {} \n'.format(info_dict['t'], l_per_rating))
    

KeyboardInterrupt: 

Iter: 1000    Loss per rating: 121133678.13 



Iter: 990    Loss per rating: 41896925.4922 



Iter: 980    Loss per rating: 58324680.8726 



Iter: 970    Loss per rating: 506141549.864 



Iter: 960    Loss per rating: 872019.147271 



Iter: 950    Loss per rating: 50374391.59 



Iter: 940    Loss per rating: 28899612.6886 



Iter: 930    Loss per rating: 9485984.13775 



Iter: 920    Loss per rating: 6305734.37649 



Iter: 910    Loss per rating: 67237674.4151 



Iter: 900    Loss per rating: 30284201.3406 



Iter: 890    Loss per rating: 200744145.051 



Iter: 880    Loss per rating: 64640716.3243 



Iter: 870    Loss per rating: 4783786.172 



Iter: 860    Loss per rating: 63228536.2258 



Iter: 850    Loss per rating: 143258.920856 



Iter: 840    Loss per rating: 35758112.1536 



Iter: 830    Loss per rating: 75124123.875 



Iter: 820    Loss per rating: 59113199.6556 



Iter: 810    Loss per rating: 22813915.3174 



Iter: 800    Loss per rating: 82531986.0537 



Iter: 790    Loss per rating: 148167739.263 



Iter: 780    Loss per rating: 45525809.7258 



Iter: 770    Loss per rating: 21037677.7745 



Iter: 760    Loss per rating: 91955470.0117 



Iter: 750    Loss per rating: 99319946.8804 



Iter: 740    Loss per rating: 2019697.33085 



Iter: 730    Loss per rating: 62466639.1791 



Iter: 720    Loss per rating: 8434857.5626 



Iter: 710    Loss per rating: 3767208.04401 



Iter: 700    Loss per rating: 151569724.701 



Iter: 690    Loss per rating: 11029.6915548 



Iter: 680    Loss per rating: 35703825.1519 



Iter: 670    Loss per rating: 165008124.845 



Iter: 660    Loss per rating: 89387985.7464 



Iter: 650    Loss per rating: 219809415.231 



Iter: 640    Loss per rating: 1509138.02481 



Iter: 630    Loss per rating: 470044112.913 



Iter: 620    Loss per rating: 8964706.95307 



Iter: 610    Loss per rating: 47918534.7411 



Iter: 600    Loss per rating: 27056215.1555 



Iter: 590    Loss per rating: 181949521.361 



Iter: 580    Loss per rating: 124899496.2 



Iter: 570    Loss per rating: 25436317.3994 



Iter: 560    Loss per rating: 11170972.4416 



Iter: 550    Loss per rating: 21073402.7084 



Iter: 540    Loss per rating: 35650095.1465 



Iter: 530    Loss per rating: 68610086.3238 



Iter: 520    Loss per rating: 251561033.25 



Iter: 510    Loss per rating: 161668603.517 



Iter: 500    Loss per rating: 5244059.26549 



Iter: 490    Loss per rating: 15347722.8811 



Iter: 480    Loss per rating: 655234.519871 



Iter: 470    Loss per rating: 10968798.7422 



Iter: 460    Loss per rating: 1907470.13679 



Iter: 450    Loss per rating: 30325375.4426 



Iter: 440    Loss per rating: 70950973.7787 



Iter: 430    Loss per rating: 75499190.3838 



Iter: 420    Loss per rating: 1468.08539868 



Iter: 410    Loss per rating: 15247646.4259 



Iter: 400    Loss per rating: 78275071.775 



Iter: 390    Loss per rating: 34354860.772 



Iter: 380    Loss per rating: 133527676.758 



Iter: 370    Loss per rating: 290383913.234 



Iter: 360    Loss per rating: 668425.851915 



Iter: 350    Loss per rating: 39806713.1915 



Iter: 340    Loss per rating: 2617862.4368 



Iter: 330    Loss per rating: 15073496.1808 



Iter: 320    Loss per rating: 83387364.5187 



Iter: 310    Loss per rating: 7770466.73535 



Iter: 300    Loss per rating: 3110.45289594 



Iter: 290    Loss per rating: 35650064.9479 



Iter: 280    Loss per rating: 2894855.24424 



Iter: 270    Loss per rating: 1224134.30693 



Iter: 260    Loss per rating: 643921.057891 



Iter: 250    Loss per rating: 5908566.69285 



Iter: 240    Loss per rating: 41369627.8827 



Iter: 230    Loss per rating: 2237122.99676 



Iter: 220    Loss per rating: 6282920.17932 



Iter: 210    Loss per rating: 63006.0520336 



Iter: 200    Loss per rating: 86786826.6959 



Iter: 190    Loss per rating: 3828247.36689 



Iter: 180    Loss per rating: 9381428.22482 



Iter: 170    Loss per rating: 150907597.09 



Iter: 160    Loss per rating: 114948241.521 



Iter: 150    Loss per rating: 76718636.3593 



Iter: 140    Loss per rating: 7582205.35937 



Iter: 130    Loss per rating: 23310464.0497 



Iter: 120    Loss per rating: 64372351.5062 



Iter: 110    Loss per rating: 1395539.7308 



Iter: 100    Loss per rating: 32242.3246342 



Iter: 90    Loss per rating: 3372113.15815 



Iter: 80    Loss per rating: 51208580.7098 



Iter: 70    Loss per rating: 381995756.826 



Iter: 60    Loss per rating: 101763.133008 



Iter: 50    Loss per rating: 71390249.2042 



Iter: 40    Loss per rating: 136806854.589 



Iter: 30    Loss per rating: 192943526.7 



Iter: 20    Loss per rating: 3543520.784 



Iter: 10    Loss per rating: 8081945.91508 



I paused at 1k iterations to check results. 

#  Model Criticism

Just as a first sanity check, let's see which movies the model thinks are best. 

In [262]:
fit_movie_means = q_movie_betas.mean().eval()

In [263]:
movie_categories_df['fit_beta'] = fit_movie_means

In [264]:
movie_betas_df = movie_categories_df.merge(movies, on = 'movieId', how='left')

In [265]:
movie_betas_df.sort_values(['fit_beta']).head()

       movieId  movie_cat_code  fit_beta  \
19801    98065           19801 -3.994807   
1520      1571            1520 -3.844569   
20902   102403           20902 -3.637028   
14593    73101           14593 -3.593342   
23582   112882           23582 -3.551799   

                                                   title  \
19801                             Brooklyn Castle (2012)   
1520   When the Cat's Away (Chacun cherche son chat) ...   
20902                               Intruder, The (1999)   
14593                            Looking for Eric (2009)   
23582                    Butcher Boys (Bone Boys) (2012)   

                               genres  
19801                     Documentary  
1520                   Comedy|Romance  
20902  Drama|Mystery|Romance|Thriller  
14593            Comedy|Drama|Fantasy  
23582   Action|Comedy|Horror|Thriller  

In [266]:
movie_betas_df.sort_values(['fit_beta'], ascending=False).head(5)

       movieId  movie_cat_code  fit_beta  \
22171   107100           22171  3.856842   
24497   116943           24497  3.644660   
14141    71033           14141  3.598436   
19740    97876           19740  3.565846   
5785      5884            5785  3.528264   

                                                   title  \
22171                                  Sambizanga (1973)   
24497                        Thesis on a Homicide (2013)   
14141  Secret in Their Eyes, The (El secreto de sus o...   
19740                Making the Earth Stand Still (1995)   
5785                 Chopper Chicks in Zombietown (1989)   

                                     genres  
22171                                 Drama  
24497                Crime|Mystery|Thriller  
14141  Crime|Drama|Mystery|Romance|Thriller  
19740                           Documentary  
5785                          Comedy|Horror  

It pretty clearly looks to me like the model is not yet finding ratings that make any sense at all.  Let's see if it's putting anything into the latent dimensions.  None of the 'best' movies fit. 

In [None]:
Let's also check the other learned parameters to see if they make sense.

In [267]:
print('Mu is at {}'.format(q_mu.mean().eval()))

print('Sigma for movies is at {}'.format(np.sqrt(np.exp(q_lnvar_users.mean().eval()))))
print('Sigma for users is at {}'.format(np.sqrt(np.exp(q_lnvar_movies.mean().eval()))))

Sigma for users is at [ 1.26464629]


Sigma for movies is at [ 1.72941291]


Mu is at [-0.03395778]


We can also check out the distribution of fit offsets for movies. Nothing too concerning.

In [133]:
_, _, _ = plt.hist(fit_movie_means, 30)

<matplotlib.figure.Figure at 0x16ab68190>

OK, so it does not look like I am grabbing the best movies.  

These are pretty much random, so I must be doing the IDs incorrectly.  I could also try displaying the most reviewed movies in two dimensions.  But let's first make sure I have the IDs right. 

#  Visualizing the Latent Dimensions

Now let's see if the factorization has learned meaningfull embeddings.

First let's pick a small subset of the movies to plot.

In [None]:
movie_counts = ratings['movieId'].value_counts()