# Neural Collaborative Filtering

Paper: https://arxiv.org/pdf/1708.05031.pdf

Paper's Code: https://github.com/hexiangnan/neural_collaborative_filtering

In [2]:
import pandas as pd
import scipy.sparse as sp
import numpy as np
import keras
import multiprocessing
import math
import heapq
import json
import sqlite3
from time import time
from keras import backend as K
from keras import initializers
from keras.models import Model, load_model
from keras.layers.core import Dense
from keras.layers import Embedding, Input, Dense, Multiply, Flatten, Concatenate
from keras.optimizers import Adam
from keras.regularizers import l2

Using TensorFlow backend.


## The neural network architecture

<img src="../media/figure3.png">

The model takes as input two one-hot encoded vectors, one for the user and one for the subreddit. Each training example takes a binary label, 0 or 1. 0 if the user has never commented in the subreddit. 1 if the user has commented in the subreddit. 

Each vector passes through two embedding layers, the General Matrix Factorization embedding layer and the Multi-layer Perceptron embedding layer. These are fed to the GMF and MLP. 

The two output vector are then concatenated and passed through a final sigmoid-activation layer that predicts a number between 0 or 1. 

### General Matrix Factorization (GMF) Layer

This component simply transforms the user and item vectors into their corresponding embedding vectors. The two are then multiplied elementwise. It takes as a hyperparameter a latent_dim, which specifies the dimension of the embeddings.

In [2]:
def GMF(X_user, X_item, num_users, num_items, latent_dim):
    
    initializer = initializers.RandomUniform(minval=-0.01, maxval=0.01, seed=1)
    
    X_user = Embedding(input_dim=num_users, output_dim=latent_dim, embeddings_initializer=initializer, \
                                 input_length=1, name='GMF_user_embedding')(X_user)
    X_item = Embedding(input_dim=num_items, output_dim=latent_dim, embeddings_initializer=initializer, \
                                 input_length=1, name='GMF_item_embedding')(X_item)
    
    X_user = Flatten()(X_user)
    X_item = Flatten()(X_item)
    
    X = Multiply()([X_user, X_item])

    return X


def get_GMF_model(num_users, num_items, latent_dim, initializer='uniform'):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = GMF(X_user, X_item, num_users, num_items, latent_dim, initializer)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

### Multi-Layer Perceptron (MLP) Layer

The MLP component also transforms the user and item vectors into embeddings. They are then concatenated and fed to a standard multi-layer perceptron. It takes as a hyeprparameter a list specifying the layer dimensions. The first layer specifies the dimensions of the embedding layer, and the rest of the layers specify the dimensions of the multi-layer perceptron.

In [3]:
def MLP(X_user, X_item, num_users, num_items, layers = [20, 10]):

    initializer = initializers.RandomUniform(minval=-0.01, maxval=0.01, seed=1)
    
    X_user = Embedding(input_dim=num_users, output_dim=layers[0]//2, embeddings_initializer=initializer, \
                                 input_length=1, name='MLP_user_embedding')(X_user)
    X_item = Embedding(input_dim=num_items, output_dim=layers[0]//2, embeddings_initializer=initializer, \
                                 input_length=1, name='MLP_item_embedding')(X_item)
    
    X_user = Flatten()(X_user)
    X_item = Flatten()(X_item)
    
    X = Concatenate()([X_user, X_item])

    for i in range(1, len(layers)):
        X = Dense(layers[i], activation='relu', name='layer'+str(i))(X)
    
    return X

def get_MLP_model(num_users, num_items, layers=[20,10]):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = MLP(X_user, X_item, num_users, num_items, initializer, layers)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

### Fusion of GMF and MLP

The outputs of the GMF and MLP are concatenated together and passed through a final layer that predicts the label using the sigmoid activation function.

In [4]:
def NeuMF(X_user, X_item, num_users, num_items, gmf_latent_dim=10, layers=[20, 10]):
    X_GMF = GMF(X_user, X_item, num_users, num_items, gmf_latent_dim)
    X_MLP = MLP(X_user, X_item, num_users, num_items, layers)
    
    X = Concatenate()([X_GMF, X_MLP])
    
    return X

def get_NeuMF_model(num_users, num_items, gmf_latent_dim=10, layers=[20, 10]):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = NeuMF(X_user, X_item, num_users, num_items, gmf_latent_dim, layers)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

## Training

For training the models, I used the hyperparameters suggested by the paper: latent_dim=8 and layers=[64,32,16,8]. This means that the GMF will learn an 8-dimensional embedding, while the MLP will learn a 32-dimensional embedding.

In [7]:
gmf_latent_dim=8
layers = [64,32,16,8]

### Data Preprocessing

After collecting the data through Reddit's API, I had to separate it into a training and test set, and format it, accordingly. The code for this process can be found [here](https://github.com/ejyap/neural_subreddit_recommender/blob/master/notebooks/perpare_datasets.ipynb).

First, I deleted all duplicated user-subreddit interactions. I kept only users that had commented in 10 unique subreddits. 

The paper uses the leave-one out evaluation method. For each user, the last interaction was held out as the test set, and the remaining interactions were used for the training set. 

In [3]:
df_train = pd.read_csv('../data/reddit_train_10.csv', header=None)
df_test_positive = pd.read_csv('../data/reddit_test_positive_10.csv', header=None, usecols=[0,1])
df_test_negative = pd.read_csv('../data/reddit_test_negative_10.csv', header=None)

In [7]:
num_users = max(df_train.iloc[:, 0])
num_items = max(df_train.iloc[:, 1])

train = sp.dok_matrix((num_users+1, num_items+1), dtype=np.float32)
for i, row in df_train.iterrows():
    user = row[0]
    item = row[1]
    train[user, item] = 1.0
    
test_positive = [(row[0], row[1]) for _, row in df_test_positive.iterrows()]
test_negative = [row[1:100].values.flatten().tolist() for _, row in df_test_negative.iterrows()]

### Evaluation Metrics

The performance of the mdoel was evaluated using the Hit Ratio (HR). The hit ratio takes in a list of recommended items and a specific item. It returns 1 if the item is in the recommended list; it returns 0, otherwise.

For each user, I sampled 99 non-interacted subreddits and combined it with the interacted subreddit. The model generates a ranked recommended list for the user consisting of these 100 subreddits. If the interacted subreddit is in the top 10 (topK=10), the Hit Ratio returns 1. If it's in any of the other 90 ranks, the hit ratio returns 1.

In [5]:
topK = 10

def getHitRatio(ranklist, pos_item):
    for item in ranklist:
        if item == pos_item:
            return 1
    return 0

def evaluate_model(model, test_positive, test_negative, K):
    hits = []
    for i in range(len(test_positive)):
        
        rating = test_positive[i]
        items = test_negative[i]
        user, pos_item = rating
        items.append(pos_item)
        
        map_item_score = {}
        
        users = np.full(len(items), user, dtype='int32')
        predictions = model.predict([users, np.array(items)], batch_size=100, verbose=0)
        for i in range(len(items)):
            map_item_score[items[i]] = predictions[i]
        items.pop()
        
        ranklist = heapq.nlargest(K, map_item_score, key=map_item_score.get)
        hr = getHitRatio(ranklist, pos_item)
        hits.append(hr)
        
    return hits

### Compiling

The model is compiled with an adam optimizer and the binary cross-entropy loss.  Nevertheless, the goal is to optimize the hit ratio, not the loss.

In [9]:
model = get_NeuMF_model(num_users+1, num_items+1, layers=layers, gmf_latent_dim=gmf_latent_dim)
model.compile(optimizer='adam', loss='binary_crossentropy')

### Training

In addition to the user-subreddit interactions in the training set, we also have to train the model on non-interacted subreddits. So, for each interaction, we sample 4 "non-interactions" (num_negatives=4).

In [6]:
num_negatives = 4

def get_train_instances(train, num_negatives):
    user_input, item_input, labels = [],[],[]
    num_users = train.shape[0]
    for (u, i) in train.keys():
        user_input.append(u)
        item_input.append(i)
        labels.append(1)
        for t in range(num_negatives):
            j = np.random.randint(num_items)
            while (u, j) in train:
                j = np.random.randint(num_items)
            user_input.append(u)
            item_input.append(j)
            labels.append(0)
    return user_input, item_input, labels

The model is trained for 20 epochs. Since the goal is to optimize the hit ratio, the model that yields the best average hit ratio is saved. We also save the last iteration, in case I want to train further.

In [13]:
epochs = 20
batch_size = 256

model_out_file = '../pretrain/reddit_NeuMF_%d_%s_%d.h5' %(gmf_latent_dim, layers, time())

t1 = time()
hits = evaluate_model(model, test_positive, test_negative, topK)
hr = np.array(hits).mean()
print('Init: HR = %.4f\t [%.1f s]' % (hr, time()-t1))

best_hr, best_iter = hr, -1
for epoch in range(epochs):
    t1=time()
    user_input, item_input, labels = get_train_instances(train, num_negatives)
    hist = model.fit([np.array(user_input), np.array(item_input)], np.array(labels), batch_size=batch_size, \
                     epochs=1, shuffle=True)
    
    t2 = time()
    hits = evaluate_model(model, test_positive, test_negative, topK)
    hr, loss = np.array(hits).mean(), hist.history['loss'][0]
    print('Iteration %d [%.1f s]: HR = %.4f, loss = %.4f [%.1f s]' 
        % (epoch,  t2-t1, hr, loss, time()-t2))
    if hr > best_hr:
        best_hr, best_iter = hr, epoch
        model.save_weights(model_out_file, overwrite=True)
        
print("End. Best Iteration %d:  HR = %.4f. " %(best_iter, best_hr))

model_out_file = '../pretrain/reddit_NeuMF_%d_%s_%d.h5' %(gmf_latent_dim, layers, time())
model.save_weights(model_out_file, overwrite=True)

Init: HR = 0.0957	 [18.2 s]
Epoch 1/1
Iteration 0 [1101.9 s]: HR = 0.8952, loss = 0.2132 [17.8 s]
Epoch 1/1
Iteration 1 [1118.3 s]: HR = 0.9114, loss = 0.1873 [20.0 s]
Epoch 1/1
Iteration 2 [1161.2 s]: HR = 0.9205, loss = 0.1763 [22.6 s]
Epoch 1/1
Iteration 3 [1150.8 s]: HR = 0.9205, loss = 0.1706 [18.2 s]
Epoch 1/1
Iteration 4 [1110.7 s]: HR = 0.9223, loss = 0.1662 [18.7 s]
Epoch 1/1
Iteration 5 [1124.3 s]: HR = 0.9224, loss = 0.1632 [19.8 s]
Epoch 1/1
Iteration 6 [1128.0 s]: HR = 0.9237, loss = 0.1601 [18.9 s]
Epoch 1/1
Iteration 7 [1145.0 s]: HR = 0.9233, loss = 0.1576 [17.5 s]
Epoch 1/1
Iteration 8 [1122.3 s]: HR = 0.9227, loss = 0.1552 [18.2 s]
Epoch 1/1
Iteration 9 [1108.2 s]: HR = 0.9231, loss = 0.1530 [19.5 s]
Epoch 1/1
Iteration 10 [1182.7 s]: HR = 0.9239, loss = 0.1511 [18.6 s]
Epoch 1/1
Iteration 11 [1133.8 s]: HR = 0.9262, loss = 0.1492 [19.4 s]
Epoch 1/1
Iteration 12 [1113.8 s]: HR = 0.9210, loss = 0.1475 [18.4 s]
Epoch 1/1
Iteration 13 [1078.6 s]: HR = 0.9240, loss = 0.14

The best hit ratio was given at iteration 15. The model achieved a hit ratio of 0.9263. This means that 92% of the time, the model was able to correctly recommend a subreddit to a user.

Finally, since we're mainly interested in the subreddit embeddings in order to calculate their similairty, we save those. We concatenate the 8-dimensional embedding from the GMF component and the 32-dimensional embedding from the MLP component to obtain a final 40-dimensional embedding vector. We normalize the vector and save.

In [95]:
model.load_weights('../pretrain/reddit_NeuMF_8_[64, 32, 16, 8]_1563000005.h5')

In [96]:
GMF_item_embedding = model.get_layer('GMF_item_embedding')
MLP_item_embedding = model.get_layer('MLP_item_embedding')

GMF_item_weights = GMF_item_embedding.get_weights()[0]
MLP_item_weights = MLP_item_embedding.get_weights()[0]

NeuMF_item_weights = np.concatenate((GMF_item_weights, MLP_item_weights),axis=1)
NeuMF_item_weights_norma = NeuMF_item_weights/np.linalg.norm(NeuMF_item_weights, axis = 1).reshape((-1, 1))

In [114]:
h5f = h5py.File('../data/neumf_weights.h5', 'w')
h5f.create_dataset('neumf_weights', data=NeuMF_item_weights_norma)
h5f.close()