# Approach

An implementation expanding on the code fragment in [Alkahest](http://www.fenris.org/)'s blog post [Collaborative Filtering in Keras](http://www.fenris.org/2016/03/07/collaborative-filtering-in-keras), using the [MovieLens 1M Dataset](http://grouplens.org/datasets/movielens/1m/) for training data.

## Import packages

In [1]:
import os
import numpy as np
from keras.layers import Embedding, Reshape, Merge
from keras.models import Sequential
from keras.optimizers import Adamax
from keras.callbacks import EarlyStopping, ModelCheckpoint

Using Theano backend.


## Define constants
The MovieLens 1M Dataset can be downloaded from http://files.grouplens.org/datasets/movielens/ml-1m.zip.

In [2]:
BASE_DIR = '.' # Modify this if needed to the local directory that the MovieLens 1M Dataset has been unzipped into. 
MOVIELENS_DIR = BASE_DIR + '/ml-1m/'
FACTORS = 20
N_USERS_IN_RATINGS = 6040
N_MOVIES_IN_RATINGS = 3952

## Load MovieLens 1M data

In [3]:
# User records are of the form: UserID::Gender::Age::Occupation::Zip-code
user_data = {}
user_records = open(os.path.join(MOVIELENS_DIR, 'users.dat'))
for user in user_records:
    values = user.split('::')
    (userid, gender, age, occupation, zipcode) = values
    user_data[int(userid)-1] = {'gender': gender, 
                              'age': age, 
                              'occupation': occupation, 
                              'zipcode': zipcode.strip()}
user_records.close()
print 'Data about', len(user_data), 'of', N_USERS_IN_RATINGS, 'users loaded.'

# Movie records are of the form: MovieID::Title::Genres
movie_data = {}
movie_records = open(os.path.join(MOVIELENS_DIR, 'movies.dat'))
for movie in movie_records:
    values = movie.split('::')
    (movieid, title, genres) = values
    movie_data[int(movieid)-1] = {'title': title, 'genres': genres.strip().split('|')}
movie_records.close()
print 'Data about', len(movie_data), 'of', N_MOVIES_IN_RATINGS, 'movies loaded.'

# Rating records are of the form UserID::MovieID::Rating::Timestamp
rating_data = []
rating_records = open(os.path.join(MOVIELENS_DIR, 'ratings.dat'))
for rating in rating_records:
    values = rating.split('::')
    (userid, movieid, rating, timestamp) = values
    rating_data.append([int(userid)-1, int(movieid)-1, int(rating)])
rating_records.close()
print len(rating_data), 'ratings loaded.'

Data about 6040 of 6040 users loaded.
Data about 3883 of 3952 movies loaded.
1000209 ratings loaded.


## Create training set

In [4]:
data = np.array(rating_data)
np.random.shuffle(data)
data = data.T
Users = data[0]
print 'Users:', Users, ', shape =', Users.shape
Movies = data[1]
print 'Movies:', Movies, ', shape =', Movies.shape
Ratings = data[2]
print 'Ratings:', Ratings, ', shape =', Ratings.shape

Users: [5852 5463 5681 ..., 1610  876 2037] , shape = (1000209,)
Movies: [3500 2727 3860 ..., 3783 2715  234] , shape = (1000209,)
Ratings: [1 4 3 ..., 4 4 4] , shape = (1000209,)


## Define model

In [5]:
left = Sequential()
left.add(Embedding(N_USERS_IN_RATINGS, FACTORS, input_length=1))
left.add(Reshape((FACTORS,)))

right = Sequential()
right.add(Embedding(N_MOVIES_IN_RATINGS, FACTORS, input_length=1))
right.add(Reshape((FACTORS,)))

model = Sequential()
model.add(Merge([left, right], mode='dot', dot_axes=1))
model.compile(loss='mse', optimizer='adamax')
callbacks = [EarlyStopping('val_loss', patience=2), ModelCheckpoint('movie_weights.h5', save_best_only=True)]

## Train model

In [6]:
model.fit([Users, Movies], Ratings, nb_epoch=15, validation_split=.1, verbose=2, callbacks=callbacks)

Train on 900188 samples, validate on 100021 samples
Epoch 1/15
565s - loss: 11.1471 - val_loss: 5.4217
Epoch 2/15
609s - loss: 3.2400 - val_loss: 2.0896
Epoch 3/15
595s - loss: 1.6470 - val_loss: 1.3981
Epoch 4/15
593s - loss: 1.2226 - val_loss: 1.1470
Epoch 5/15
601s - loss: 1.0499 - val_loss: 1.0289
Epoch 6/15
597s - loss: 0.9649 - val_loss: 0.9651
Epoch 7/15
597s - loss: 0.9186 - val_loss: 0.9289
Epoch 8/15
599s - loss: 0.8916 - val_loss: 0.9070
Epoch 9/15
601s - loss: 0.8745 - val_loss: 0.8916
Epoch 10/15
609s - loss: 0.8629 - val_loss: 0.8818
Epoch 11/15
612s - loss: 0.8543 - val_loss: 0.8736
Epoch 12/15
613s - loss: 0.8473 - val_loss: 0.8672
Epoch 13/15
603s - loss: 0.8406 - val_loss: 0.8615
Epoch 14/15
614s - loss: 0.8344 - val_loss: 0.8558
Epoch 15/15
612s - loss: 0.8280 - val_loss: 0.8497


<keras.callbacks.History at 0x113173290>