# Goal

This notebook is a practice notebook after I watched Jeremy Howard's MOOC and [notebook](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/lesson4.ipynb)

This notebook shows how to use keras functional API for collaborative filtering. This will give you some sense of computational graph and maybe some motivation to learn tensorflow.

In [1]:
%matplotlib inline
import os
import time
import numpy as np
import pandas as pd
data_path = "data/"



In [2]:
ratings = pd.read_csv(os.path.join(data_path, "ratings.csv"), usecols=["userId", "movieId", "rating"])
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,16,4.0
1,1,24,1.5
2,1,32,4.0
3,1,47,4.0
4,1,50,4.0


In [3]:
userIdidx = {uid: i for i, uid in enumerate(ratings.userId.unique())}
movieIdidx = {mid: i for i, mid in enumerate(ratings.movieId.unique())}
ratings["userId"] = ratings.userId.apply(lambda x: userIdidx[x])
ratings["movieId"] = ratings.movieId.apply(lambda x: movieIdidx[x])

In [4]:
len(ratings)

105339

In [5]:
ratings.describe()

Unnamed: 0,userId,movieId,rating
count,105339.0,105339.0,105339.0
mean,363.924539,1810.855989,3.51685
std,197.486905,2083.124762,1.044872
min,0.0,0.0,0.5
25%,191.0,370.0,3.0
50%,382.0,1049.0,3.5
75%,556.0,2435.0,4.0
max,667.0,10324.0,5.0


In [6]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_factors = 50
print n_users, n_movies, n_factors

668 10325 50


split train and validation set

In [7]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

In [8]:
from keras.layers import Input, Dense, Dropout, Flatten, Embedding, merge
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.models import Model

Using Theano backend.


# Dot product

In [9]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype="int64", name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

In [10]:
user_in, u = embedding_input("user_in", n_users, n_factors, 1e-4)
movie_in, m = embedding_input("movie_in", n_movies, n_factors, 1e-4)

In [11]:
x = merge([u, m], mode="dot")
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.01), loss="mse")

According to keras documentation, The `Merge` layer supports a number of pre-defined modes:

* `sum` (default): element-wise sum
* `concat`: tensor concatenation. You can specify the concatenation axis via the argument `concat_axis`.
* `mul`: element-wise multiplication
* `ave`: tensor average
* `dot`: dot product. You can specify which axes to reduce along via the argument `dot_axes`.
* `cos`: cosine proximity between vectors in 2D tensors.

You can also pass a function as the `mode` argument

In [12]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
user_in (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
movie_in (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1, 50)         33400       user_in[0][0]                    
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1, 50)         516250      movie_in[0][0]                   
___________________________________________________________________________________________

In [13]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))
time.sleep(0.1)

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [14]:
model.optimizer.lr = 1e-3
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x10e083950>

# Bias

In [15]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [16]:
user_in, u = embedding_input("user_in", n_users, n_factors, 1e-4)
movie_in, m = embedding_input("movie_in", n_movies, n_factors, 1e-4)

ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)

In [17]:
x = merge([u, m], mode="dot")
x = Flatten()(x)
x = merge([x, ub], mode="sum")
x = merge([x, mb], mode="sum")
model = Model([user_in, movie_in], x)
model.compile(Adam(0.01), loss="mse")

In [18]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
user_in (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
movie_in (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_3 (Embedding)          (None, 1, 50)         33400       user_in[0][0]                    
____________________________________________________________________________________________________
embedding_4 (Embedding)          (None, 1, 50)         516250      movie_in[0][0]                   
___________________________________________________________________________________________

In [19]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))
time.sleep(0.1)

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [20]:
model.optimizer.lr = 1e-3
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x10f4b3590>

# Neural net

In [25]:
user_in, u = embedding_input("user_in", n_users, n_factors, 1e-4)
movie_in, m = embedding_input("movie_in", n_movies, n_factors, 1e-4)

In [26]:
x = merge([u, m], mode="concat")
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation="relu")(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(Adam(0.001), loss="mse")

In [27]:
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1122aae90>

In [28]:
nn.optimizer.lr = 1e-4
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=4, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 84413 samples, validate on 20926 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1122aaed0>