#  Probabilistic Matrix Factorization of the MovieLens Ratings

Something something . Matrix factorization is a simple and powerful technique for predicting user's ratings of items using embedding in a latent space.

Modern methods have many useful adaptations, but the general system predicts user $u$'s rating of item $i$, $r_{u,i}$  as 
  *  $r_{u,i} = \mu + \beta_u + \beta_i + \vec{v}_u^T \cdot \vec{v}_i^T
  *  $\mu$, $\beta_u$, and $\beta_i$ are overall mean ratings and offsets for users and movies
  *  $\vec{v}_i, \vec{v}_u \in \mathbb{R}^K$ are the user's and movie's embeddings in a $K$ dimensional space.
  *  We regularize user and item vectors by modeling them as coming from a multivariate gaussian, which can be learned or predetermined $\vec{v}_i \sim N(0, \Lambda_{\text{Movies}})$ and $\vec{v}_i \sim N(0, \Lambda_{\text{Movies}})$.

The most famous application of PMF for recomendation systems is probably Netflix's.  Their system is described in [this paper](https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf) and you can also find a readable overview [here](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf).  Netflix took down the original dataset due to privacy concerns, but you can still run the same model using the [MovieLens data](https://grouplens.org/datasets/movielens/20m/).

Need to cite `MovieLens` with ```F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872```.

The MovieLens data comes contains 20,000,263 ratings of 27,278 movies by 138,493 users. It also contains free text tags, but we will not use them here. 

##  Read in the Data

Download [the data](https://grouplens.org/datasets/movielens/) and unzip.

We will only use `ml-20/movies.csv` and `ml-20/ratings.csv`.

In [29]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd

%pylab inline

movies = pd.read_csv('ml-20m/movies.csv')
ratings = pd.read_csv('ml-20m/ratings.csv')

In [187]:
print(movies.shape)
print(movies.head())

(27278, 3)
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [188]:
print(ratings.shape)
print(ratings.head())

(20000263, 4)
   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580


Let's cherry pick a few movies whose parameters we will track.  We'll pick the ones with the most ratings - hopefully our latent dimensions will be interpretable. 

We'll also bring in the titles now.  We'll need to remap the user and movie IDs, and bringing in the titles helps make sure we don't make mistakes mapping them back later on.

In [190]:
ratings = ratings.merge(movies, on = 'movieId', how = 'left')

In [192]:
print(ratings.head())

   userId  movieId  rating   timestamp  \
0       1        2     3.5  1112486027   
1       1       29     3.5  1112484676   
2       1       32     3.5  1112484819   
3       1       47     3.5  1112484727   
4       1       50     3.5  1112484580   

                                               title  \
0                                     Jumanji (1995)   
1  City of Lost Children, The (Cité des enfants p...   
2          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
3                        Seven (a.k.a. Se7en) (1995)   
4                         Usual Suspects, The (1995)   

                                   genres  
0              Adventure|Children|Fantasy  
1  Adventure|Drama|Fantasy|Mystery|Sci-Fi  
2                 Mystery|Sci-Fi|Thriller  
3                        Mystery|Thriller  
4                  Crime|Mystery|Thriller  


In [194]:
ratings['title'].value_counts()[0:5]

Pulp Fiction (1994)                 67310
Forrest Gump (1994)                 66172
Shawshank Redemption, The (1994)    63366
Silence of the Lambs, The (1991)    63299
Jurassic Park (1993)                59715
Name: title, dtype: int64

In [197]:
ratings.loc[ratings['rating'] >= 4.0]['title'].value_counts()[0:5]

Shawshank Redemption, The (1994)             55807
Pulp Fiction (1994)                          52353
Silence of the Lambs, The (1991)             50114
Forrest Gump (1994)                          47331
Star Wars: Episode IV - A New Hope (1977)    42612
Name: title, dtype: int64

In [198]:
ratings.loc[ratings['rating'] <= 1.0]['title'].value_counts()[0:5]

Dumb & Dumber (Dumb and Dumber) (1994)    4578
Ace Ventura: Pet Detective (1994)         4323
Ace Ventura: When Nature Calls (1995)     3976
Waterworld (1995)                         3013
Blair Witch Project, The (1999)           2992
Name: title, dtype: int64

In [200]:
_, _, _ = plt.hist(np.log(ratings['title'].value_counts()), 30)

<matplotlib.figure.Figure at 0x189e1c250>

In [206]:
print(ratings['rating'].value_counts())

4.0    5561926
3.0    4291193
5.0    2898660
3.5    2200156
4.5    1534824
2.0    1430997
2.5     883398
1.0     680732
1.5     279252
0.5     239125
Name: rating, dtype: int64


In [None]:
easier_ratings = ratings.loc[ratings

##  Model Definition

Let's go ahead and define our model before we reformat our data.

We will use a vanilla PMF - exactly the one defined above, with predetermined $\Lambda_{\text{movies}} = \Lambda_{\text{users}} = I_K$ and $K=3$.

In [217]:
from edward.models import Normal
#  This will need to change if you train/test split
N_users = len(set(ratings.userId))
N_movies = len(set(ratings.movieId))
N_ratings = ratings.shape[0]
K = 2

lnvar_users = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
lnvar_movies = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
sigma_users = tf.sqrt(tf.exp(lnvar_users))
sigma_movies = tf.sqrt(tf.exp(lnvar_users))

user_vecs = Normal(loc = tf.zeros([N_users, K]), 
                   scale = sigma_users * tf.ones([N_users, K]))
movie_vecs = Normal(loc = tf.zeros([N_movies, K]), 
                    scale = sigma_movies * tf.ones([N_movies, K]))

#  Somewhat hacky prior on mu
mu = Normal(loc = 2.5*tf.ones([1]), 
            scale = tf.ones([1]))

user_betas = Normal(loc = tf.zeros([N_users]), 
                    scale = sigma_users * tf.ones([N_users]))
movie_betas = Normal(loc = tf.zeros([N_movies]), 
                     scale = sigma_movies * tf.ones([N_movies]))

#  Placeholders for data inputs
user_ids = tf.placeholder(tf.int32, [N_ratings])
movie_ids = tf.placeholder(tf.int32, [N_ratings])

predicted_ratings = tf.reduce_sum(tf.multiply(
    tf.gather(user_vecs, user_ids),
    tf.gather(movie_vecs, movie_ids)
)) + \
    tf.gather(user_betas, user_ids) + \
    tf.gather(movie_betas, movie_ids) + \
    mu

obs_ratings = Normal(loc=predicted_ratings, scale = tf.ones([N_ratings]))

##  Inference Definition

We have now sret up a probablistic graph for generating ratings from our learnable paramters (mu, offsets, and vectors).  We now explicitly define our inference.  Edward makes it easy to swap out sampling, ML, and variational methods.  We'll use simple MFVI.  

In [218]:
user_ids_train = ratings['userId']
movie_ids_train = ratings['movieId'].astype('category').cat.codes

user_ids_train = user_ids_train.values.astype(int) - 1
movie_ids_train = movie_ids_train.values.astype(int)
ratings_train = ratings['rating'].values.astype(float)

In [219]:
# INFERENCE
q_user_vecs = Normal(loc=tf.Variable(tf.random_normal([N_users, K])),
                     scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users, K]))))
q_movie_vecs = Normal(loc=tf.Variable(tf.random_normal([N_movies, K])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies, K]))))
q_user_betas = Normal(loc=tf.Variable(tf.random_normal([N_users])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users]))))
q_movie_betas = Normal(loc=tf.Variable(tf.random_normal([N_movies])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies]))))

q_mu = Normal(loc=tf.Variable(tf.random_normal([1])),
              scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_users = Normal(loc=tf.Variable(tf.random_normal([1])),
                       scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_movies = Normal(loc=tf.Variable(tf.random_normal([1])),
                        scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))

                                    
parameter_inferences = {
    user_vecs: q_user_vecs,
    movie_vecs: q_movie_vecs,
    user_betas: q_user_betas,
    movie_betas: q_movie_betas,
    mu: q_mu,
    lnvar_users: q_lnvar_users,
    lnvar_movies: q_lnvar_movies
}
train_data = {
    user_ids: user_ids_train,
    movie_ids: movie_ids_train,
    obs_ratings: ratings_train
}

inference = ed.KLqp(parameter_inferences,
                    train_data)


This needs to go further up.

Now let's plug in our data into the data placeholders.

We also want to check that the IDs make don't skip any indices, which can cause problems.  We need our actual IDs to match up with the shapes of our parameters.

In [None]:
print(max(ratings['userId']))
print(len(set(ratings['userId'])))
print(max(ratings['movieId']))
print(len(set(ratings['movieId'])))


In [73]:
print(min(set(ratings['movieId'])))
print(min(set(ratings['userId'])))

1
1


#  Run the Inference

We now have everything set up and can let SGD run. 

In [220]:
optimizer = tf.train.AdamOptimizer()

inference.initialize(optimizer = optimizer,
                     n_print=5, n_iter=100)
tf.global_variables_initializer().run()

In [221]:
for _ in range(inference.n_iter):
  info_dict = inference.update()
  inference.print_progress(info_dict)




500/500 [100%] ██████████████████████████████ Elapsed: 608s | Loss: 91254071230464.000

495/500 [ 99%] █████████████████████████████  ETA: 6s | Loss: 1589547800461312.000

490/500 [ 98%] █████████████████████████████  ETA: 12s | Loss: 49143032578048.000 

485/500 [ 97%] █████████████████████████████  ETA: 18s | Loss: 628197150425088.000

480/500 [ 96%] ████████████████████████████   ETA: 24s | Loss: 716284043984896.000

475/500 [ 95%] ████████████████████████████   ETA: 30s | Loss: 1192165507072.000 

470/500 [ 94%] ████████████████████████████   ETA: 36s | Loss: 52606571380736.000

465/500 [ 93%] ███████████████████████████    ETA: 42s | Loss: 1250746016202752.000

460/500 [ 92%] ███████████████████████████    ETA: 48s | Loss: 506891503403008.000

455/500 [ 91%] ███████████████████████████    ETA: 54s | Loss: 333477132107776.000 

450/500 [ 90%] ███████████████████████████    ETA: 60s | Loss: 1204362785325056.000

445/500 [ 89%] ██████████████████████████     ETA: 66s | Loss: 4563371613487104.000

440/500 [ 88%] ██████████████████████████     ETA: 73s | Loss: 565220783161344.000 

435/500 [ 87%] ██████████████████████████     ETA: 79s | Loss: 1350717419814912.000

430/500 [ 86%] █████████████████████████      ETA: 85s | Loss: 145699953967104.000

425/500 [ 85%] █████████████████████████      ETA: 91s | Loss: 76281504858112.000

420/500 [ 84%] █████████████████████████      ETA: 97s | Loss: 9883304853504.000  

415/500 [ 83%] ████████████████████████       ETA: 104s | Loss: 1166632168718336.000

410/500 [ 82%] ████████████████████████       ETA: 110s | Loss: 110261046345728.000

405/500 [ 81%] ████████████████████████       ETA: 116s | Loss: 238170633404416.000 

400/500 [ 80%] ████████████████████████       ETA: 122s | Loss: 1045766286082048.000

395/500 [ 79%] ███████████████████████        ETA: 129s | Loss: 1374396379824128.000

390/500 [ 78%] ███████████████████████        ETA: 135s | Loss: 1306978009743360.000

385/500 [ 77%] ███████████████████████        ETA: 141s | Loss: 2314427957248.000

380/500 [ 76%] ██████████████████████         ETA: 147s | Loss: 72698227064832.000

375/500 [ 75%] ██████████████████████         ETA: 154s | Loss: 12008015527936.000

370/500 [ 74%] ██████████████████████         ETA: 160s | Loss: 87507752452096.000

365/500 [ 73%] █████████████████████          ETA: 167s | Loss: 179799494492160.000 

360/500 [ 72%] █████████████████████          ETA: 173s | Loss: 1015657088942080.000

355/500 [ 71%] █████████████████████          ETA: 179s | Loss: 1450552726650880.000

350/500 [ 70%] █████████████████████          ETA: 186s | Loss: 73251665477632.000

345/500 [ 69%] ████████████████████           ETA: 192s | Loss: 3399475204718592.000

340/500 [ 68%] ████████████████████           ETA: 199s | Loss: 965729100759040.000 

335/500 [ 67%] ████████████████████           ETA: 205s | Loss: 3638324510064640.000

330/500 [ 66%] ███████████████████            ETA: 212s | Loss: 34248782249984.000 

325/500 [ 65%] ███████████████████            ETA: 219s | Loss: 178592558350336.000

320/500 [ 64%] ███████████████████            ETA: 225s | Loss: 3889669079040.000

315/500 [ 63%] ██████████████████             ETA: 232s | Loss: 46653079814144.000  

310/500 [ 62%] ██████████████████             ETA: 238s | Loss: 2328659058753536.000

305/500 [ 61%] ██████████████████             ETA: 245s | Loss: 1211355663171584.000

300/500 [ 60%] ██████████████████             ETA: 252s | Loss: 22660983554048.000

295/500 [ 59%] █████████████████              ETA: 259s | Loss: 5484337692672.000  

290/500 [ 57%] █████████████████              ETA: 265s | Loss: 758590948245504.000

285/500 [ 56%] █████████████████              ETA: 272s | Loss: 606422538649600.000

280/500 [ 56%] ████████████████               ETA: 279s | Loss: 85997089980416.000

275/500 [ 55%] ████████████████               ETA: 286s | Loss: 52477491675136.000 

270/500 [ 54%] ████████████████               ETA: 293s | Loss: 105946315489280.000

265/500 [ 53%] ███████████████                ETA: 300s | Loss: 72402293751808.000

260/500 [ 52%] ███████████████                ETA: 307s | Loss: 73481127460864.000 

255/500 [ 51%] ███████████████                ETA: 314s | Loss: 880570905133056.000

250/500 [ 50%] ███████████████                ETA: 321s | Loss: 188570102923264.000

245/500 [ 49%] ██████████████                 ETA: 329s | Loss: 1037380161110016.000

240/500 [ 48%] ██████████████                 ETA: 336s | Loss: 2280892311535616.000

235/500 [ 47%] ██████████████                 ETA: 344s | Loss: 1677481081831424.000

230/500 [ 46%] █████████████                  ETA: 351s | Loss: 726351782871040.000 

225/500 [ 45%] █████████████                  ETA: 359s | Loss: 1099636383154176.000

220/500 [ 44%] █████████████                  ETA: 367s | Loss: 657401652969472.000

215/500 [ 43%] ████████████                   ETA: 374s | Loss: 306349279805440.000

210/500 [ 42%] ████████████                   ETA: 382s | Loss: 709169699094528.000 

205/500 [ 41%] ████████████                   ETA: 390s | Loss: 9041991711064064.000

200/500 [ 40%] ████████████                   ETA: 398s | Loss: 354453383282688.000

195/500 [ 39%] ███████████                    ETA: 406s | Loss: 1089555692257280.000

190/500 [ 38%] ███████████                    ETA: 415s | Loss: 1451894769713152.000

185/500 [ 37%] ███████████                    ETA: 423s | Loss: 317491163168768.000

180/500 [ 36%] ██████████                     ETA: 432s | Loss: 12799073845248.000  

175/500 [ 35%] ██████████                     ETA: 441s | Loss: 1166277028610048.000

170/500 [ 34%] ██████████                     ETA: 450s | Loss: 872550859014144.000

165/500 [ 33%] █████████                      ETA: 459s | Loss: 3521609203712.000  

160/500 [ 32%] █████████                      ETA: 469s | Loss: 280966123749376.000

155/500 [ 31%] █████████                      ETA: 478s | Loss: 201636114857984.000

150/500 [ 30%] █████████                      ETA: 488s | Loss: 125469164830720.000

145/500 [ 28%] ████████                       ETA: 498s | Loss: 880119262478336.000

140/500 [ 28%] ████████                       ETA: 508s | Loss: 98036856389632.000 

135/500 [ 27%] ████████                       ETA: 519s | Loss: 629143318298624.000

130/500 [ 26%] ███████                        ETA: 529s | Loss: 301431877795840.000

125/500 [ 25%] ███████                        ETA: 536s | Loss: 168808371191808.000

120/500 [ 24%] ███████                        ETA: 543s | Loss: 68628112211968.000

115/500 [ 23%] ██████                         ETA: 552s | Loss: 83362748301312.000  

110/500 [ 22%] ██████                         ETA: 562s | Loss: 2333369295699968.000

105/500 [ 21%] ██████                         ETA: 575s | Loss: 1265721224986624.000

100/500 [ 20%] ██████                         ETA: 587s | Loss: 31462342524928.000

 95/500 [ 19%] █████                          ETA: 599s | Loss: 89837789184.000   

 90/500 [ 18%] █████                          ETA: 613s | Loss: 89179509751808.000 

 85/500 [ 17%] █████                          ETA: 628s | Loss: 440539895824384.000

 80/500 [ 16%] ████                           ETA: 644s | Loss: 758773752791040.000 

 75/500 [ 15%] ████                           ETA: 660s | Loss: 4372137389326336.000

 70/500 [ 14%] ████                           ETA: 674s | Loss: 420778348642304.000

 65/500 [ 13%] ███                            ETA: 692s | Loss: 541015354114048.000 

 60/500 [ 12%] ███                            ETA: 708s | Loss: 2532381974069248.000

 55/500 [ 11%] ███                            ETA: 729s | Loss: 306808908414976.000

 50/500 [ 10%] ███                            ETA: 755s | Loss: 22625931755520.000

 45/500 [  9%] ██                             ETA: 786s | Loss: 2328289154695168.000

 40/500 [  8%] ██                             ETA: 823s | Loss: 161779959201792.000

 35/500 [  7%] ██                             ETA: 856s | Loss: 314784830455808.000

 30/500 [  6%] █                              ETA: 913s | Loss: 2255782936576.000   

 25/500 [  5%] █                              ETA: 972s | Loss: 1617512332525568.000

 20/500 [  4%] █                              ETA: 1022s | Loss: 2793779494912.000

 15/500 [  3%]                                ETA: 1179s | Loss: 314555016151040.000

 10/500 [  2%]                                ETA: 1478s | Loss: 262545176788992.000

  5/500 [  1%]                                ETA: 2384s | Loss: 2037256290304.000  

  1/500 [  0%]                                ETA: 9607s | Loss: 418365516546048.000

#  Model Criticism

Just as a first sanity check, let's see which movies the model thinks are best. 

In [222]:
fit_movie_means = q_movie_betas.mean().eval()

In [223]:
movie_categories_df['fit_beta'] = fit_movie_means

In [224]:
movie_betas_df = movie_categories_df.merge(movies, on = 'movieId', how='left')

In [225]:
movie_betas_df.sort_values(['fit_beta']).head()

       movieId  movie_cat_code  fit_beta                                title  \
14952    75397           14952 -3.919719                   Torrid Zone (1940)   
17525    88349           17525 -3.876275                 Almighty Thor (2011)   
14079    70740           14079 -3.676230          I Really Hate My Job (2007)   
1452      1495            1452 -3.641302  Turbo: A Power Rangers Movie (1997)   
6947      7059            6947 -3.638973               National Velvet (1944)   

                          genres  
14952    Action|Adventure|Comedy  
17525          Adventure|Fantasy  
14079               Comedy|Drama  
1452   Action|Adventure|Children  
6947              Children|Drama  

In [226]:
movie_betas_df.sort_values(['fit_beta'], ascending=False).head(5)

       movieId  movie_cat_code  fit_beta                            title  \
26294   128908           26294  4.028948                Cloudburst (2011)   
17467    88127           17467  3.924351  Conan O'Brien Can't Stop (2011)   
22494   108255           22494  3.912767  Venus Wars (Venus Senki) (1989)   
24846   118776           24846  3.887387                 Lap Dance (2014)   
10084    33330           10084  3.774137         Edges of the Lord (2001)   

                            genres  
26294       Adventure|Comedy|Drama  
17467                  Documentary  
22494  Action|Animation|Sci-Fi|War  
24846                        Drama  
10084                    Drama|War  

It pretty clearly looks to me like the model is not yet finding ratings that make any sense at all.  Let's see if it's putting anything into the latent dimensions.  None of the 'best' movies fit. 

In [None]:
Let's also check the other learned parameters to see if they make sense.

In [234]:
print('Mu is at {}'.format(q_mu.mean().eval()))

print('Sigma for movies is at {}'.format(np.sqrt(np.exp(q_lnvar_users.mean().eval()))))
print('Sigma for users is at {}'.format(np.sqrt(np.exp(q_lnvar_movies.mean().eval()))))

Sigma for users is at [ 1.14207709]


Sigma for movies is at [ 1.50328493]


Mu is at [-0.06361528]


We can also check out the distribution of fit offsets for movies. Nothing too concerning.

In [133]:
_, _, _ = plt.hist(fit_movie_means, 30)

<matplotlib.figure.Figure at 0x16ab68190>

OK, so it does not look like I am grabbing the best movies.  

These are pretty much random, so I must be doing the IDs incorrectly.  I could also try displaying the most reviewed movies in two dimensions.  But let's first make sure I have the IDs right. 

#  Visualizing the Latent Dimensions

Now let's see if the factorization has learned meaningfull embeddings.

First let's pick a small subset of the movies to plot.

In [None]:
movie_counts = ratings['movieId'].value_counts()