# Building a Subreddit Recommender System Using Neural Collaborative Filtering

Sources: https://arxiv.org/pdf/1708.05031.pdf, https://github.com/hexiangnan/neural_collaborative_filtering

## Introduction

Not too long ago, I thought of the idea of creating a subreddit recommender system as my next project. Reddit has over 130,000 active subreddits and there's not a recommendation system in place that points you to the subreddits you might be interested in. At the time, I was also finishing Andrew Ng's deep learning specialization. I thought to myself what better way to practice what I've learned by implementing my own recommender system using a neural network. 

I found that there are 2 main approaches to building a recommender system: content-based filtering and collaborative filtering. Content-based filtering approaches recommend you items based on the descriptions of the items. This approach would not work well in our case since subreddits don't have specific categories or metadata to them. It would be a difficult task to figure out which subreddits have similar content. Collaborative filtering systems, on the other hand, recommend items based on other users' preferences. This is much more feasible for my application. I could collect data on a lot of users and their subreddit activities. Then, if I want to make recommendations for, let's say, the /r/python subreddit, I could see which users have been active in /r/python, what other subreddits they comment in, and recommend those subreddits to users visiting /r/python. Of course, it'll be more complicated than that since we'll be using a neural network to learn which subreddits to recommend based on user preferences.

I started researching on deep learning being applied to collaborative recommender systems and found this paper done by the National University of Singapore in 2017 titled "Neural Collaborative Filtering". The paper proposes one of the first neural network architectures modelled for collaborative filtering and claims to outperform many of the popular collaborative filtering methods at the time. In this project, I create my own simpler implementation of the model and apply it to a user-subreddit interaction dataset that I've created. Then, I deploy the model to a web app that recommends subreddits given a subreddit. 

In [103]:
import pandas as pd
import scipy.sparse as sp
import numpy as np
import keras
import multiprocessing
import math
import heapq
import json
import sqlite3
from time import time
from keras import backend as K
from keras import initializers
from keras.models import Model, load_model
from keras.layers.core import Dense
from keras.layers import Embedding, Input, Dense, Multiply, Flatten, Concatenate
from keras.optimizers import Adam
from keras.regularizers import l2

## The neural network architecture

<img src="../media/figure3.png">

In [2]:
def GMF(X_user, X_item, num_users, num_items, latent_dim):
    
    """
    Implementation of the Generalized Matrix Factorization
    
    Arguments:
    X_user -- input tensor of shape (1,), specifying the user id
    X_item -- input tensor of shape (1,), specifying the item id
    num_users -- integer, number of total users 
    num_items -- integer, number of total items
    latent_dim -- dimension of the embedding latent vector
    initializer -- initializer for the embeddings matrix
    regs -- float list of size 2, specifying the regularization parameters for both embedding layers
    
    Returns:
    X -- output of the GMF, tensor of shape (1,), specifies the likelihood that X_item is relevant to X_user
    """
    
    initializer = initializers.RandomUniform(minval=-0.01, maxval=0.01, seed=1)
    
    X_user = Embedding(input_dim=num_users, output_dim=latent_dim, embeddings_initializer=initializer, \
                                 input_length=1, name='GMF_user_embedding')(X_user)
    X_item = Embedding(input_dim=num_items, output_dim=latent_dim, embeddings_initializer=initializer, \
                                 input_length=1, name='GMF_item_embedding')(X_item)
    
    X_user = Flatten()(X_user)
    X_item = Flatten()(X_item)
    
    X = Multiply()([X_user, X_item])

    return X


def get_GMF_model(num_users, num_items, latent_dim, initializer='uniform'):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = GMF(X_user, X_item, num_users, num_items, latent_dim, initializer)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

### 3.3 Multi-Layer Perceptron (MLP)

The GMF only uses a fixed element-wise product between the two latent vectors to model their interactions. More flexibility and non-linearity can be obtained by also concatenating the two latent vectors and feeding the concatenation to a standard Multi-Layer Perceptron that can learn the interaction between the user and the item.

The MLP model is defined as:

$$\textbf{z}_1=\phi_1(\textbf{p}_u,\textbf{q}_i)=[\textbf{p}_u\;\textbf{q}_i]^T$$

$$\phi_2(\textbf{z}_1)=a_2(\textbf(W)_2^T\textbf{z}_1+\textbf{b}_2)$$

$$...$$

$$\phi_L(\textbf{z}_{L-1})=a_L(\textbf{W}_L^T\textbf{z}_{L-1}+\textbf{b}_L)$$

$$\hat{y}_{ui}=\sigma(\textbf{h}^T\phi_L(\textbf{z}_{L-1}))$$

where $\textbf{W}_x$, $\textbf{b}_x$, and $a_x$ denote the weight matrix, bias vector, and activation function for the x-th layer's perceptron, respectively. Various activation functions can be chosen for the MLP layers, but the paper opts to use the ReLU, because it is more biologically plausible, proven to be non-saturated, and encourages sparse activations.

In [3]:
def MLP(X_user, X_item, num_users, num_items, layers = [20, 10]):

    """
    Implementation of the Multi-Layer Perceptron
    
    Arguments:
    X_user -- input tensor of shape (1,), specifying the user id
    X_item -- input tensor of shape (1,), specifying the item id
    num_users -- integer, number of total users 
    num_items -- integer, number of total items
    initializer -- initializer for the embeddings matrix
    layers -- integer list, specifying the units for each layer
    regs -- float list, specifying the regularization parameters for each layer, reg_layers[0] is for the embeddings.
    
    Returns:
    X -- output of the MLP, tensor of shape (1,), specifies the likelihood that X_item is relevant to X_user
    """
    
    initializer = initializers.RandomUniform(minval=-0.01, maxval=0.01, seed=1)
    
    X_user = Embedding(input_dim=num_users, output_dim=layers[0]//2, embeddings_initializer=initializer, \
                                 input_length=1, name='MLP_user_embedding')(X_user)
    X_item = Embedding(input_dim=num_items, output_dim=layers[0]//2, embeddings_initializer=initializer, \
                                 input_length=1, name='MLP_item_embedding')(X_item)
    
    X_user = Flatten()(X_user)
    X_item = Flatten()(X_item)
    
    X = Concatenate()([X_user, X_item])

    for i in range(1, len(layers)):
        X = Dense(layers[i], activation='relu', name='layer'+str(i))(X)
    
    return X

def get_MLP_model(num_users, num_items, layers=[20,10]):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = MLP(X_user, X_item, num_users, num_items, initializer, layers)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

### 3.4 Fusion of GMF and MLP

The one-hot encoding user and item vectors can be fed to two different embeddings, one for the GMF and one for the MLP. Then, the two models can be combined by concatenating their last hidden layer. The fused model can be pictured below.

<img src="../media/figure3.png">

The final fused model can be formulated as such:

$$\phi^{GMF}=\textbf{p}_u^G\odot\textbf{q}_i^G$$

$$\phi^{MLP}=a_L(\textbf{W}_L^T(a_{L-1}(...a_2(\textbf{W}_2^T[\textbf{p}_u^M\;\textbf{q}_i^M]^T+\textbf{b}_2)...))+\textbf{b}_L)$$

$$\hat{y}_{ui}=\sigma(\textbf{h}^T[\phi^{GMF}\;\phi^{MLP}]^T)$$

where $\textbf{p}_u^G$ and $\textbf{p}_u^M$ denote the user embedding for GMF and MLP parts, and similar notations of $\textbf{q}_i^G$ and $\textbf{q}_i^M$ for item embeddings. This model is dubbed as "NeuMF", short for Neural Matrix Factorization. 

In [4]:
def NeuMF(X_user, X_item, num_users, num_items, gmf_latent_dim=10, layers=[20, 10]):
    X_GMF = GMF(X_user, X_item, num_users, num_items, gmf_latent_dim)
    X_MLP = MLP(X_user, X_item, num_users, num_items, layers)
    
    X = Concatenate()([X_GMF, X_MLP])
    
    return X

def get_NeuMF_model(num_users, num_items, gmf_latent_dim=10, layers=[20, 10]):
    X_user = Input(shape=(1,), dtype='int32', name = 'user_input')
    X_item = Input(shape=(1,), dtype='int32', name = 'item_input')
    
    X = NeuMF(X_user, X_item, num_users, num_items, gmf_latent_dim, layers)
    
    X = Dense(1, activation='sigmoid', kernel_initializer='lecun_uniform', name='prediction')(X)
    
    model = Model(inputs=[X_user, X_item], outputs=X)
    
    return model

#### 3.4.1 Pre-training

For training, one seeks to minimize the object function of NeuMF. However, due to the function's non-convexity, one can only find local solutions using gradient-based optimization. Initialization plays an important role for the convergence and performance of deep learning models. The paper proposes to first train GMF and MLP, first. Then, their model parameters are used as the initialization for the corresponding parts of NeuMF's parameters. In the output layer, the weights of the 2 models are concatenated:

$$\textbf{h}=[\alpha\textbf{h}^{GMF}\;(1-\alpha)\textbf{h}^{MLP}]^T$$

where $\textbf{h}^{GMF}$ and $\textbf{h}^{MLP}$ denote the $\textbf{h}$ vector of the pretrained GMF and MLP model, respectively, and $\alpha$ is a hyper-parameter determining the trade-off between the two pre-trained models.

For training GMF and MLP from scratch, Adaptive Moment Estimation (Adam) is used. After feeding the pre-trained parameters into NeuMF, the ensemble model is optimized with Vanilla SGD.

In [5]:
epochs = 20
num_negatives = 4
batch_size = 256
layers = [64,32,16,8]
topK = 10
gmf_latent_dim=8

In [6]:
df_train = pd.read_csv('../data/reddit_train_10.csv', header=None)
df_test_positive = pd.read_csv('../data/reddit_test_positive_10.csv', header=None, usecols=[0,1])
df_test_negative = pd.read_csv('../data/reddit_test_negative_10.csv', header=None)

In [7]:
num_users = max(df_train.iloc[:, 0])
num_items = max(df_train.iloc[:, 1])

train = sp.dok_matrix((num_users+1, num_items+1), dtype=np.float32)
for i, row in df_train.iterrows():
    user = row[0]
    item = row[1]
    train[user, item] = 1.0
    
test_positive = [(row[0], row[1]) for _, row in df_test_positive.iterrows()]
test_negative = [row[1:100].values.flatten().tolist() for _, row in df_test_negative.iterrows()]

In [9]:
model = get_NeuMF_model(num_users+1, num_items+1, layers=layers, gmf_latent_dim=gmf_latent_dim)
model.compile(optimizer='adam', loss='binary_crossentropy')

## 4. Building Dataset

### Evaluation Protocols

The paper uses leave-one-out method to evaluate the performance of item recommendation algorithms. Basically, for each user, the last interaction is held-out. All held-out interactions are used as the test set, and the rest are used for training. Then, for each user, 100 items are randomly sampled, and then ranked. These are items that the user hasn't interacted with before. This method separates our dataset into three files: train.rating, test.rating, test.negative.

In [10]:
def getHitRatio(ranklist, pos_item):
    for item in ranklist:
        if item == pos_item:
            return 1
    return 0

def evaluate_model(model, test_positive, test_negative, K):
    hits = []
    for i in range(len(test_positive)):
        
        rating = test_positive[i]
        items = test_negative[i]
        user, pos_item = rating
        items.append(pos_item)
        
        map_item_score = {}
        
        users = np.full(len(items), user, dtype='int32')
        predictions = model.predict([users, np.array(items)], batch_size=100, verbose=0)
        for i in range(len(items)):
            map_item_score[items[i]] = predictions[i]
        items.pop()
        
        ranklist = heapq.nlargest(K, map_item_score, key=map_item_score.get)
        hr = getHitRatio(ranklist, pos_item)
        hits.append(hr)
        
    return hits

In [11]:
def get_train_instances(train, num_negatives):
    user_input, item_input, labels = [],[],[]
    num_users = train.shape[0]
    for (u, i) in train.keys():
        user_input.append(u)
        item_input.append(i)
        labels.append(1)
        for t in range(num_negatives):
            j = np.random.randint(num_items)
            while (u, j) in train:
                j = np.random.randint(num_items)
            user_input.append(u)
            item_input.append(j)
            labels.append(0)
    return user_input, item_input, labels

In [13]:
model_out_file = '../pretrain/reddit_NeuMF_%d_%s_%d.h5' %(gmf_latent_dim, layers, time())

t1 = time()
hits = evaluate_model(model, test_positive, test_negative, topK)
hr = np.array(hits).mean()
print('Init: HR = %.4f\t [%.1f s]' % (hr, time()-t1))

best_hr, best_iter = hr, -1
for epoch in range(epochs):
    t1=time()
    user_input, item_input, labels = get_train_instances(train, num_negatives)
    hist = model.fit([np.array(user_input), np.array(item_input)], np.array(labels), batch_size=batch_size, \
                     epochs=1, shuffle=True)
    
    t2 = time()
    hits = evaluate_model(model, test_positive, test_negative, topK)
    hr, loss = np.array(hits).mean(), hist.history['loss'][0]
    print('Iteration %d [%.1f s]: HR = %.4f, loss = %.4f [%.1f s]' 
        % (epoch,  t2-t1, hr, loss, time()-t2))
    if hr > best_hr:
        best_hr, best_iter = hr, epoch
        model.save_weights(model_out_file, overwrite=True)
        
print("End. Best Iteration %d:  HR = %.4f. " %(best_iter, best_hr))

Init: HR = 0.0957	 [18.2 s]
Epoch 1/1
Iteration 0 [1101.9 s]: HR = 0.8952, loss = 0.2132 [17.8 s]
Epoch 1/1
Iteration 1 [1118.3 s]: HR = 0.9114, loss = 0.1873 [20.0 s]
Epoch 1/1
Iteration 2 [1161.2 s]: HR = 0.9205, loss = 0.1763 [22.6 s]
Epoch 1/1
Iteration 3 [1150.8 s]: HR = 0.9205, loss = 0.1706 [18.2 s]
Epoch 1/1
Iteration 4 [1110.7 s]: HR = 0.9223, loss = 0.1662 [18.7 s]
Epoch 1/1
Iteration 5 [1124.3 s]: HR = 0.9224, loss = 0.1632 [19.8 s]
Epoch 1/1
Iteration 6 [1128.0 s]: HR = 0.9237, loss = 0.1601 [18.9 s]
Epoch 1/1
Iteration 7 [1145.0 s]: HR = 0.9233, loss = 0.1576 [17.5 s]
Epoch 1/1
Iteration 8 [1122.3 s]: HR = 0.9227, loss = 0.1552 [18.2 s]
Epoch 1/1
Iteration 9 [1108.2 s]: HR = 0.9231, loss = 0.1530 [19.5 s]
Epoch 1/1
Iteration 10 [1182.7 s]: HR = 0.9239, loss = 0.1511 [18.6 s]
Epoch 1/1
Iteration 11 [1133.8 s]: HR = 0.9262, loss = 0.1492 [19.4 s]
Epoch 1/1
Iteration 12 [1113.8 s]: HR = 0.9210, loss = 0.1475 [18.4 s]
Epoch 1/1
Iteration 13 [1078.6 s]: HR = 0.9240, loss = 0.14

In [14]:
model_out_file = '../pretrain/reddit_NeuMF_%d_%s_%d.h5' %(gmf_latent_dim, layers, time())
model.save_weights(model_out_file, overwrite=True)

In [95]:
model.load_weights('../pretrain/reddit_NeuMF_8_[64, 32, 16, 8]_1563000005.h5')

In [96]:
GMF_item_embedding = model.get_layer('GMF_item_embedding')
MLP_item_embedding = model.get_layer('MLP_item_embedding')

GMF_item_weights = GMF_item_embedding.get_weights()[0]
MLP_item_weights = MLP_item_embedding.get_weights()[0]

NeuMF_item_weights = np.concatenate((GMF_item_weights, MLP_item_weights),axis=1)
NeuMF_item_weights_norma = NeuMF_item_weights/np.linalg.norm(NeuMF_item_weights, axis = 1).reshape((-1, 1))

In [111]:
conn = sqlite3.connect('../app/ncf.db')
c = conn.cursor()

gmf_weights_columns = ','.join([ f'gmf_weights_{i+1}' for i in range(gmf_latent_dim)])
mlp_weights_columns = ','.join([ f'mlp_weights_{i+1}' for i in range(layers[0]//2)])

c.execute(f'CREATE TABLE weights (item_id, {gmf_weights_columns}, {mlp_weights_columns})')
#c.execute("")

#conn.commit()

#conn.close()

mlp_weights_1,mlp_weights_2,mlp_weights_3,mlp_weights_4,mlp_weights_5,mlp_weights_6,mlp_weights_7,mlp_weights_8,mlp_weights_9,mlp_weights_10,mlp_weights_11,mlp_weights_12,mlp_weights_13,mlp_weights_14,mlp_weights_15,mlp_weights_16,mlp_weights_17,mlp_weights_18,mlp_weights_19,mlp_weights_20,mlp_weights_21,mlp_weights_22,mlp_weights_23,mlp_weights_24,mlp_weights_25,mlp_weights_26,mlp_weights_27,mlp_weights_28,mlp_weights_29,mlp_weights_30,mlp_weights_31,mlp_weights_32


<sqlite3.Cursor at 0x265000328f0>

In [112]:
NeuMF_item_weights_norma

array([[ 0.32897615,  0.10641152, -0.42197388, ..., -0.04275219,
         0.12431271,  0.02298403],
       [ 0.3290243 ,  0.10623664, -0.37675765, ..., -0.12078571,
         0.19027624, -0.06052374],
       [ 0.37128088,  0.11445459, -0.39325753, ...,  0.01715298,
         0.11689451,  0.02376154],
       ...,
       [ 0.2675498 ,  0.08609024, -0.3365847 , ..., -0.01009374,
         0.19148614, -0.05352863],
       [ 0.14706582,  0.07890623, -0.31575355, ...,  0.06139897,
         0.22040662, -0.02438055],
       [-0.2887616 , -0.04176541,  0.25244355, ..., -0.22265546,
         0.08488164, -0.15461084]], dtype=float32)

In [100]:
import json

with open('../data/subreddit10.json') as f:
    [d, inv_d] = json.load(f)

In [102]:
def get_recommendations(subreddit, num_recommendations):
    index = inv_d[subreddit.lower()]
    dists = np.dot(NeuMF_item_weights_norma, NeuMF_item_weights_norma[int(index)])
    sorted_dists = np.argsort(dists)
    closest = sorted_dists[-num_recommendations:]
    max_width = max([len(d[str(c)]) for c in closest])
    for c in reversed(closest):
        print(f'Subreddit: {d[str(c)]:{max_width + 2}} Similarity: {dists[c]:.{3}}')
        
get_recommendations('twice', 20)

Subreddit: twice           Similarity: 1.0
Subreddit: blackpink       Similarity: 0.944
Subreddit: twicemedia      Similarity: 0.922
Subreddit: digimonlinkz    Similarity: 0.905
Subreddit: japanesemusic   Similarity: 0.894
Subreddit: supersentai     Similarity: 0.89
Subreddit: falcom          Similarity: 0.884
Subreddit: pocophones      Similarity: 0.883
Subreddit: animepiracy     Similarity: 0.881
Subreddit: magiarecord     Similarity: 0.88
Subreddit: osuskins        Similarity: 0.878
Subreddit: symmetramains   Similarity: 0.878
Subreddit: bangtan         Similarity: 0.875
Subreddit: hajimenoippo    Similarity: 0.875
Subreddit: dissidia        Similarity: 0.874
Subreddit: dissidiaffoo    Similarity: 0.872
Subreddit: sololeveling    Similarity: 0.871
Subreddit: citrusmanga     Similarity: 0.87
Subreddit: izone           Similarity: 0.869
Subreddit: koreanvariety   Similarity: 0.868


In [79]:
import h5py

f = h5py.File('../pretrain/reddit_NeuMF_8_[64, 32, 16, 8]_1563000005.h5', 'r')
d = f['GMF_item_embedding']

In [92]:
f['GMF_item_embedding']['GMF_item_embedding_1']['embeddings:0'].shape

(54926, 8)

In [93]:
f['MLP_item_embedding']['MLP_item_embedding_1']['embeddings:0'].shape

(54926, 32)