# Neural Collaborative Filtering Paper Summary

Sources: https://arxiv.org/pdf/1708.05031.pdf, https://github.com/hexiangnan/neural_collaborative_filtering

## 1. Introduction

Collaborative filtering is a type of personalized recommender system that models user's preference on items based on his/her past interactions. Matrix factorization (MF) is the most popular method for collaborative filtering. The MF models user and item interaction as an inner product of their latent vectors. This paper focuses on using neural networks to learn the interaction function, rather than handcrafting it.

It specifically focuses on the implicit feedback case, in which user interactions are reflected through behaviors like watching videos and purchasing products. In contrast, explicit feedback reflects user interaction through reviews and ratings of items. The implicit case is more difficult because user satisfaction is not observed and negative feedback is absent.

## 2 Preliminaries

### 2.1 Learning from Implicit Data

Let M denote the number of users and N denote the number of items. The user-item iteraction matrix $\textbf{Y} \in \mathbb{R}^{MxN}$ from user's implicit feedback is defined as:

$$y_{ui}=\begin{cases} 
          1 & \text{if interaction (user u, item i) is observed} \\
          0 & \text{otherwise.} 
       \end{cases}$$

The problem is then to estimate the scores of all unobserved entries in $\textbf{Y}$, which are used to rank the items, through an interaction function. The interaction function can be formalized as $\hat{y}_{ui} = f(u,i|\Theta)$, where $\hat{y}_{ui}$ denotes the predicted score of interaction $y_{ui}$, $\Theta$ denotes model parameters, and f denotes the function that maps model parameters to the predicted score.

The parameters $\Theta$ can be estimated by optimizing an objective function. Two commonly used functions are the pointwise loss and the pairwise loss. The pointwise approach looks at a single item at a time in the loss function. The pointwise approach looks at a pair of items at a time. Pairwise learning maximizes the margin between observed entry $\hat{y}_{ui}$ and unobserved entry $\hat{y}_{uj}$

The presented NCF framwork parametrizes the function f using neural networks to estimate $\hat{y}_{ui}$. It naturally supports both pointwise and pairwise learning.

### 2.2 Matrix Factorization

MF associates each user and item with a real-valued vector of latent features. Let $\textbf{p}_u$ and $\textbf{q}_u$ denote the latent vector for user u and item i, respectively. MF estimates an interaction $y_{ui}$ as the inner product of  $\textbf{p}_u$ and $\textbf{q}_u$:

$$\hat{y}_{ui}=f(u,i| \textbf{p}_u,\textbf{q}_u)=\textbf{p}_u^T\textbf{q}_u=\sum_{k=1}^{K}p_{uk}q_{ik}$$

where K denotes the dimension of the latent space. MF, however, has its limitations. Consider the following figure.

<img src="../media/figure1.png">

Let's say we use the Jaccard coefficient to calculate the similarity between two users. Initially, we have $s_{23}>s_{12}>s_{13}$. Now, if we consider a new user $u_4$, where $s_{41}>s_{43}>s_{42}$, the MF model will place $\textbf{p}_4$ closer to $\textbf{p}_2$ than $\textbf{p}_3$, which leads to a large ranking loss. A simple solution would be to use a large number of latent factors K to increase the latent space, but this may cause the model to overfit the data. The paper looks to address this limitation by learning the interaction function using DNNs from data.

## 3. Neural Collaborative Filtering

### 3.1 General Framework

Here, we frame the collaborative filtering process as a multi-layer perceptron, where we input two feature vectors, $\textbf{v}_u$ and $\textbf{v}_i$, that describe user u and item i. 

<img src="../media/figure2.png">

Since, we're focused on the pure collaborative filtering case, the feature vectors are just the one-hot encodings of the user and the item. Then, the sparse user vector is passed through a fully connected layer that outputs the dense latent vector. The same is done for the item vector. 

The user and item latent vectors are then fed to a multi-layer neural architecture (neural collaborative filtering layers) that maps them to a prediction score.  The network is then trained through pointwise or pairwise learning. The paper focuses only on pointwise training.

The predictive model can be formulated as such:

$$\hat{y}_{ui}=f(\textbf{P}^T\textbf{v}_u,\textbf{Q}^T\textbf{v}_i|\textbf{P},\textbf{Q},\Theta)$$

$$f(\textbf{P}^T\textbf{v}_u,\textbf{Q}^T\textbf{v}_i)=\phi_{out}(\phi_{X}(...\phi_2(\phi_1(\textbf{P}^T\textbf{v}_u,\textbf{Q}^T\textbf{v}_i))))$$

Here, $\phi_{out}$ denotes the mapping function for the output layer and $\phi_X$ denotes the x-th neural collaborative filtering layer, and there are X neural CF layers in total.

#### 3.1.1 Learning NCF

We can then treat $y_{ui}$ as a label, where 1 means item i is relevant to user u, and 0 otherwise. Therefore, we need to constrain the output $\hat{y}_{ui}$ in the range of [0, 1] which can be done using the Logistic function as the activation function for the output layer $\phi_{out}$.

With the above settings, we define the loss function as 

$$L=-\sum_{(u,i)\in\gamma\cup\gamma^-}y_{ui}log\hat{y}_{ui}+(1-y_{ui})log(1-\hat{y}_{ui})$$

where $\gamma$ denotes the set of observed interaction in $\textbf{Y}$, and $\gamma^-$ denotes the set of negative instances. This is the loss function to minimize and its omptimization can be done with stochastic gradient descent. For the negative instances $\gamma^-$, we uniformly sample from unobserved instances in each iteration and control the sampling ratio w.r.t. the number of observed interactions.

### 3.2 Generalized Matrix Factorization (GMF)

Next, we show how MF is just a special case of the NCF architecture. We can think of the output of the embedding layer as the latent vectors for user and item. Let user latent vector $\textbf{p}_u$ be $\textbf{P}^T\textbf{v}_u$ and item latent vector $\textbf{q}_i$ be $\textbf{Q}^T\textbf{v}_i$. We define the mapping function of first neural CF layer as:

$$\phi_1(\textbf{p}_u,\textbf{q}_i)=(\textbf{p}_u\odot\textbf{q}_i)$$

where $\odot$ denotes element-wise product of vectors. Then, we project vector to the output layer:

$$\hat{y}_{ui}=a_{out}(\textbf{h}^T(\textbf{p}_u\odot\textbf{q}_i))$$

If $a_{out}$ is an identity function and $\textbf{h}$ is just a vector of 1's, the MF model can be recovered.

The paper proposes a Generalized Matrix Factorization that uses the sigmoid function as $a_{out}$ and learns $\textbf{h}$ from data with the log loss.



### 3.3 Multi-Layer Perceptron (MLP)

The GMF only uses a fixed element-wise product between the two latent vectors to model their interactions. More flexibility and non-linearity can be obtained by also concatenating the two latent vectors and feeding the concatenation to a standard Multi-Layer Perceptron that can learn the interaction between the user and the item.

The MLP model is defined as:

$$\textbf{z}_1=\phi_1(\textbf{p}_u,\textbf{q}_i)=[\textbf{p}_u\;\textbf{q}_i]^T$$

$$\phi_2(\textbf{z}_1)=a_2(\textbf(W)_2^T\textbf{z}_1+\textbf{b}_2)$$

$$...$$

$$\phi_L(\textbf{z}_{L-1})=a_L(\textbf{W}_L^T\textbf{z}_{L-1}+\textbf{b}_L)$$

$$\hat{y}_{ui}=\sigma(\textbf{h}^T\phi_L(\textbf{z}_{L-1}))$$

where $\textbf{W}_x$, $\textbf{b}_x$, and $a_x$ denote the weight matrix, bias vector, and activation function for the x-th layer's perceptron, respectively. Various activation functions can be chosen for the MLP layers, but the paper opts to use the ReLU, because it is more biologically plausible, proven to be non-saturated, and encourages sparse activations.

### 3.4 Fusion of GMF and MLP

The one-hot encoding user and item vectors can be fed to two different embeddings, one for the GMF and one for the MLP. Then, the two models can be combined by concatenating their last hidden layer. The fused model can be pictured below.

<img src="../media/figure3.png">

The final fused model can be formulated as such:

$$\phi^{GMF}=\textbf{p}_u^G\odot\textbf{q}_i^G$$

$$\phi^{MLP}=a_L(\textbf{W}_L^T(a_{L-1}(...a_2(\textbf{W}_2^T[\textbf{p}_u^M\;\textbf{q}_i^M]^T+\textbf{b}_2)...))+\textbf{b}_L)$$

$$\hat{y}_{ui}=\sigma(\textbf{h}^T[\phi^{GMF}\;\phi^{MLP}]^T)$$

where $\textbf{p}_u^G$ and $\textbf{p}_u^M$ denote the user embedding for GMF and MLP parts, and similar notations of $\textbf{q}_i^G$ and $\textbf{q}_i^M$ for item embeddings. This model is dubbed as "NeuMF", short for Neural Matrix Factorization. 

#### 3.4.1 Pre-training

For training, one seeks to minimize the object function of NeuMF. However, due to the function's non-convexity, one can only find local solutions using gradient-based optimization. Initialization plays an important role for the convergence and performance of deep learning models. The paper proposes to first train GMF and MLP, first. Then, their model parameters are used as the initialization for the corresponding parts of NeuMF's parameters. In the output layer, the weights of the 2 models are concatenated:

$$\textbf{h}=[\alpha\textbf{h}^{GMF}\;(1-\alpha)\textbf{h}^{MLP}]^T$$

where $\textbf{h}^{GMF}$ and $\textbf{h}^{MLP}$ denote the $\textbf{h}$ vector of the pretrained GMF and MLP model, respectively, and $\alpha$ is a hyper-parameter determining the trade-off between the two pre-trained models.

For training GMF and MLP from scratch, Adaptive Moment Estimation (Adam) is used. After feeding the pre-trained parameters into NeuMF, the ensemble model is optimized with Vanilla SGD.