#  Probabilistic Matrix Factorization of the MovieLens Ratings

Something something . Matrix factorization is a simple and powerful technique for predicting user's ratings of items using embedding in a latent space.

Modern methods have many useful adaptations, but the general system predicts user $u$'s rating of item $i$, $r_{u,i}$  as 
  *  $r_{u,i} = \mu + \beta_u + \beta_i + \vec{v}_u^T \cdot \vec{v}_i^T$
  *  $\mu$, $\beta_u$, and $\beta_i$ are overall mean ratings and offsets for users and movies
  *  $\vec{v}_i, \vec{v}_u \in \mathbb{R}^K$ are the user's and movie's embeddings in a $K$ dimensional space.
  *  We regularize user and item vectors by modeling them as coming from a multivariate gaussian, which can be learned or predetermined $\vec{v}_u \sim N(0, \Lambda_{\text{Users}})$ and $\vec{v}_i \sim N(0, \Lambda_{\text{Movies}})$.

The most famous application of PMF for recomendation systems is probably Netflix's.  Their system is described in [this paper](https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf) and you can also find a readable overview [here](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf).  Netflix took down the original dataset due to privacy concerns, but you can still run the same model using the [MovieLens data](https://grouplens.org/datasets/movielens/20m/).

Need to cite `MovieLens` with ```F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872```.

The MovieLens data comes contains 20,000,263 ratings of 27,278 movies by 138,493 users. It also contains free text tags, but we will not use them here. 

##  Read in the Data

Download [the data](https://grouplens.org/datasets/movielens/) and unzip.

We will only use `ml-20/movies.csv` and `ml-20/ratings.csv`.

In [4]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd

%pylab inline

movies = pd.read_csv('~/ml-20m/movies.csv')
ratings = pd.read_csv('~/ml-20m/ratings.csv')

Populating the interactive namespace from numpy and matplotlib


The minimum supported version is 2.4.6



In [5]:
print(movies.shape)
movies.head()

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

(27278, 3)


In [6]:
print(ratings.shape)
ratings.head()

   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580

(20000263, 4)


Let's cherry pick a few movies whose parameters we will track.  We'll pick the ones with the most ratings - hopefully our latent dimensions will be interpretable. 

We'll also bring in the titles now.  We'll need to remap the user and movie IDs, and bringing in the titles helps make sure we don't make mistakes mapping them back later on.

In [7]:
ratings = ratings.merge(movies, on = 'movieId', how = 'left')

In [8]:
ratings.head()

   userId  movieId  rating   timestamp  \
0       1        2     3.5  1112486027   
1       1       29     3.5  1112484676   
2       1       32     3.5  1112484819   
3       1       47     3.5  1112484727   
4       1       50     3.5  1112484580   

                                               title  \
0                                     Jumanji (1995)   
1  City of Lost Children, The (Cité des enfants p...   
2          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
3                        Seven (a.k.a. Se7en) (1995)   
4                         Usual Suspects, The (1995)   

                                   genres  
0              Adventure|Children|Fantasy  
1  Adventure|Drama|Fantasy|Mystery|Sci-Fi  
2                 Mystery|Sci-Fi|Thriller  
3                        Mystery|Thriller  
4                  Crime|Mystery|Thriller  

In [9]:
ratings['title'].value_counts()[0:5]

Pulp Fiction (1994)                 67310
Forrest Gump (1994)                 66172
Shawshank Redemption, The (1994)    63366
Silence of the Lambs, The (1991)    63299
Jurassic Park (1993)                59715
Name: title, dtype: int64

In [10]:
ratings.loc[ratings['rating'] >= 4.0]['title'].value_counts()[0:5]

Shawshank Redemption, The (1994)             55807
Pulp Fiction (1994)                          52353
Silence of the Lambs, The (1991)             50114
Forrest Gump (1994)                          47331
Star Wars: Episode IV - A New Hope (1977)    42612
Name: title, dtype: int64

In [11]:
ratings.loc[ratings['rating'] <= 1.0]['title'].value_counts()[0:5]

Dumb & Dumber (Dumb and Dumber) (1994)    4578
Ace Ventura: Pet Detective (1994)         4323
Ace Ventura: When Nature Calls (1995)     3976
Waterworld (1995)                         3013
Blair Witch Project, The (1999)           2992
Name: title, dtype: int64

##  Model Definition

Let's go ahead and define our model before we reformat our data.

We will use a vanilla PMF - exactly the one defined above, with predetermined $\Lambda_{\text{movies}} = \Lambda_{\text{users}} = I_K$ and $K=3$.

In [12]:
from edward.models import Normal
#  This will need to change if you train/test split
N_users = len(set(ratings.userId))
N_movies = len(set(ratings.movieId))
N_ratings = ratings.shape[0]
K = 2

lnvar_users = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
lnvar_movies = Normal(loc=tf.zeros([1]), scale=tf.ones([1]))
sigma_users = tf.sqrt(tf.exp(lnvar_users))
sigma_movies = tf.sqrt(tf.exp(lnvar_users))

user_vecs = Normal(loc = tf.zeros([N_users, K]), 
                   scale = sigma_users * tf.ones([N_users, K]))
movie_vecs = Normal(loc = tf.zeros([N_movies, K]), 
                    scale = sigma_movies * tf.ones([N_movies, K]))

#  Somewhat hacky prior on mu
mu = Normal(loc = 2.5*tf.ones([1]), 
            scale = tf.ones([1]))

user_betas = Normal(loc = tf.zeros([N_users]), 
                    scale = sigma_users * tf.ones([N_users]))
movie_betas = Normal(loc = tf.zeros([N_movies]), 
                     scale = sigma_movies * tf.ones([N_movies]))

#  Placeholders for data inputs
user_ids = tf.placeholder(tf.int32, [N_ratings])
movie_ids = tf.placeholder(tf.int32, [N_ratings])

predicted_ratings = tf.reduce_sum(tf.multiply(
    tf.gather(user_vecs, user_ids),
    tf.gather(movie_vecs, movie_ids)
)) + \
    tf.gather(user_betas, user_ids) + \
    tf.gather(movie_betas, movie_ids) + \
    mu

obs_ratings = Normal(loc=predicted_ratings, scale = tf.ones([N_ratings]))

##  Inference Definition

We have now sret up a probablistic graph for generating ratings from our learnable paramters (mu, offsets, and vectors).  We now explicitly define our inference.  Edward makes it easy to swap out sampling, ML, and variational methods.  We'll use simple MFVI.  

In [13]:
user_ids_train = ratings['userId']
movie_ids_train = ratings['movieId'].astype('category').cat.codes

user_ids_train = user_ids_train.values.astype(int) - 1
movie_ids_train = movie_ids_train.values.astype(int)
ratings_train = ratings['rating'].values.astype(float)

In [14]:
# INFERENCE
q_user_vecs = Normal(loc=tf.Variable(tf.random_normal([N_users, K])),
                     scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users, K]))))
q_movie_vecs = Normal(loc=tf.Variable(tf.random_normal([N_movies, K])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies, K]))))
q_user_betas = Normal(loc=tf.Variable(tf.random_normal([N_users])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_users]))))
q_movie_betas = Normal(loc=tf.Variable(tf.random_normal([N_movies])),
                      scale=tf.nn.softplus(tf.Variable(tf.random_normal([N_movies]))))

q_mu = Normal(loc=tf.Variable(tf.random_normal([1])),
              scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_users = Normal(loc=tf.Variable(tf.random_normal([1])),
                       scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))
q_lnvar_movies = Normal(loc=tf.Variable(tf.random_normal([1])),
                        scale=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))

                                    
parameter_inferences = {
    user_vecs: q_user_vecs,
    movie_vecs: q_movie_vecs,
    user_betas: q_user_betas,
    movie_betas: q_movie_betas,
    mu: q_mu,
    lnvar_users: q_lnvar_users,
    lnvar_movies: q_lnvar_movies
}
train_data = {
    user_ids: user_ids_train,
    movie_ids: movie_ids_train,
    obs_ratings: ratings_train
}

inference = ed.KLqp(parameter_inferences,
                    train_data)

#  Run the Inference

We now have everything set up and can let SGD run. 

In [15]:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-9)

inference.initialize(optimizer = optimizer,
                     n_print=10, n_iter=500)
tf.global_variables_initializer().run()

In [16]:
for _ in range(inference.n_iter):
    info_dict = inference.update()
    #inference.print_progress(info_dict)
    if info_dict['t'] % inference.n_print == 0:
        l_per_rating = info_dict['loss'] / (1.0 * N_ratings)
        print('Iter: {}    Loss per rating: {} \n'.format(info_dict['t'], l_per_rating))
    

Iter: 500    Loss per rating: 104967204.321 



Iter: 490    Loss per rating: 67204482.8075 



Iter: 480    Loss per rating: 109307426.448 



Iter: 470    Loss per rating: 30528340.1774 



Iter: 460    Loss per rating: 502415070.493 



Iter: 450    Loss per rating: 12991414.8509 



Iter: 440    Loss per rating: 56615836.5202 



Iter: 430    Loss per rating: 20602596.5978 



Iter: 420    Loss per rating: 80806109.6955 



Iter: 410    Loss per rating: 39704011.1366 



Iter: 400    Loss per rating: 60994947.7225 



Iter: 390    Loss per rating: 119919090.773 



Iter: 380    Loss per rating: 70704412.3438 



Iter: 370    Loss per rating: 72816266.9686 



Iter: 360    Loss per rating: 3677134.87235 



Iter: 350    Loss per rating: 24391006.5697 



Iter: 340    Loss per rating: 116629296.459 



Iter: 330    Loss per rating: 167832565.572 



Iter: 320    Loss per rating: 73966605.1448 



Iter: 310    Loss per rating: 10493912.4663 



Iter: 300    Loss per rating: 6554782.58157 



Iter: 290    Loss per rating: 29326489.9145 



Iter: 280    Loss per rating: 154010254.087 



Iter: 270    Loss per rating: 67518534.7394 



Iter: 260    Loss per rating: 503227882.365 



Iter: 250    Loss per rating: 100438348.368 



Iter: 240    Loss per rating: 185146330.855 



Iter: 230    Loss per rating: 18134529.4517 



Iter: 220    Loss per rating: 47617703.0815 



Iter: 210    Loss per rating: 91218792.1445 



Iter: 200    Loss per rating: 148432856.055 



Iter: 190    Loss per rating: 435390249.173 



Iter: 180    Loss per rating: 88064663.3262 



Iter: 170    Loss per rating: 3031102.90144 



Iter: 160    Loss per rating: 7079895.40682 



Iter: 150    Loss per rating: 101282519.8 



Iter: 140    Loss per rating: 210167743.093 



Iter: 130    Loss per rating: 170484109.295 



Iter: 120    Loss per rating: 51480284.15 



Iter: 110    Loss per rating: 201161060.097 



Iter: 100    Loss per rating: 227039830.485 



Iter: 90    Loss per rating: 11566097.5261 



Iter: 80    Loss per rating: 8589374.52892 



Iter: 70    Loss per rating: 96384549.8171 



Iter: 60    Loss per rating: 25166180.9306 



Iter: 50    Loss per rating: 37562837.0347 



Iter: 40    Loss per rating: 6513808.12575 



Iter: 30    Loss per rating: 39516595.3211 



Iter: 20    Loss per rating: 676.476685132 



Iter: 10    Loss per rating: 193482605.115 



In [None]:
#  Model Criticism
Just as a first sanity check, let's see which movies the model thinks are best. 

In [25]:
movie_categories_df = pd.DataFrame(
    {'movie_cat_code': range(N_movies),
     'movieId': ratings['movieId'].astype('category').cat.categories}
)

fit_movie_means = q_movie_betas.mean().eval()
movie_categories_df['fit_beta'] = fit_movie_means
movie_betas_df = movie_categories_df.merge(movies, on = 'movieId', how='left')
movie_betas_df.sort_values(['fit_beta']).head()

       movieId  movie_cat_code  fit_beta  \
9609     30958            9609 -3.971159   
23971   114492           23971 -3.960724   
24005   114652           24005 -3.836536   
25525   124050           25525 -3.810666   
8605     26096            8605 -3.712745   

                                                   title              genres  
9609                           Who's the Caboose? (1997)  Comedy|Documentary  
23971                                    Not Cool (2014)              Comedy  
24005  Case of the Grinning Cat, The (Chats perchés) ...         Documentary  
25525                   Pleasure at Her Majesty's (1976)  Comedy|Documentary  
8605                                Cardinal, The (1963)               Drama  

And just to check our Ids are matched up correctly.

In [27]:
movie_betas_df.sort_values('movie_cat_code')[0:3]

   movieId  movie_cat_code  fit_beta                    title  \
0        1               0 -1.579556         Toy Story (1995)   
1        2               1 -1.182469           Jumanji (1995)   
2        3               2  0.065379  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  

In [29]:
ratings['movie_cat_code'] = ratings['movieId'].astype('category').cat.codes

In [30]:
ratings.head()

   userId  movieId  rating   timestamp  \
0       1        2     3.5  1112486027   
1       1       29     3.5  1112484676   
2       1       32     3.5  1112484819   
3       1       47     3.5  1112484727   
4       1       50     3.5  1112484580   

                                               title  \
0                                     Jumanji (1995)   
1  City of Lost Children, The (Cité des enfants p...   
2          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
3                        Seven (a.k.a. Se7en) (1995)   
4                         Usual Suspects, The (1995)   

                                   genres  movie_cat_code  
0              Adventure|Children|Fantasy               1  
1  Adventure|Drama|Fantasy|Mystery|Sci-Fi              28  
2                 Mystery|Sci-Fi|Thriller              31  
3                        Mystery|Thriller              46  
4                  Crime|Mystery|Thriller              49  

Should check a bit more, but it looks good so far. 

In [31]:
movie_betas_df.sort_values(['fit_beta'], ascending=False).head(5)

       movieId  movie_cat_code  fit_beta  \
15900    80742           15900  4.280601   
20026    98989           20026  3.785681   
18343    91784           18343  3.719136   
19513    96923           19513  3.636514   
21100   103210           21100  3.588994   

                                                   title  \
15900       Last Letter, The (La dernière lettre) (2002)   
20026                               Ghost Machine (2010)   
18343                       Girl Walks Into a Bar (2011)   
19513                       2-Headed Shark Attack (2012)   
21100  Fullmetal Alchemist: The Sacred Star of Milos ...   

                           genres  
15900                       Drama  
20026      Action|Sci-Fi|Thriller  
18343        Comedy|Drama|Fantasy  
19513               Comedy|Horror  
21100  Action|Adventure|Animation  

It pretty clearly looks to me like the model is not yet finding ratings that make any sense at all.  Let's see if it's putting anything into the latent dimensions.  We should be reliably recovering some well known highly rated movies. 

In [None]:
Let's also check the other learned parameters to see if they make sense.

In [267]:
print('Mu is at {}'.format(q_mu.mean().eval()))

print('Sigma for movies is at {}'.format(np.sqrt(np.exp(q_lnvar_users.mean().eval()))))
print('Sigma for users is at {}'.format(np.sqrt(np.exp(q_lnvar_movies.mean().eval()))))

Mu is at [-0.03395778]


Sigma for movies is at [ 1.72941291]


Sigma for users is at [ 1.26464629]


We can also check out the distribution of fit offsets for movies. Nothing too concerning.

In [133]:
_, _, _ = plt.hist(fit_movie_means, 30)

<matplotlib.figure.Figure at 0x16ab68190>

OK, so it does not look like I am grabbing the best movies.  

These are pretty much random, so I must be doing the IDs incorrectly.  I could also try displaying the most reviewed movies in two dimensions.  But let's first make sure I have the IDs right. 

#  Visualizing the Latent Dimensions

Now let's see if the factorization has learned meaningfull embeddings.

First let's pick a small subset of the movies to plot.

In [None]:
movie_counts = ratings['movieId'].value_counts()