In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir( os.path.join('..','..', 'notebook_format') )
from formats import load_style
load_style(css_style = 'custom2.css')

In [2]:
os.chdir(path)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
%matplotlib inline
%load_ext watermark
%load_ext autoreload 
%autoreload 2

from tqdm import tqdm
from scipy.sparse import csr_matrix
from sklearn.metrics import mean_squared_error

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib,sklearn,tqdm,scipy

Ethen 2017-01-25 22:22:54 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.3
pandas 0.18.1
matplotlib 1.5.1
sklearn 0.18
tqdm 4.11.0-5982db6
scipy 0.18.1


# Collaborative Filtering for Implicit Feedback Datasets

One common scenario in real-world recommendation system is we only have **implicit** instead of **explicit** user-item interaction data. To elaborate on this a little bit more, a user may be searching for an item on the web, or listening to songs. Unlike a rating data, where we have direct access to the user's preference towards an item, these type of actions do not **explicitly** state or quantify any preference of the user for the item, but instead gives us **implicit confidence** about the user’s opinion.

Even when we do have explicit data, it might still be a good idea to incorporate implicit data into the model. Consider, for example, listening to songs. When users listen to music on a streaming service, they might rarely ever rate a song that he/she like or dislike. But more often they skip a song, or listen only halfway through it if they dislike it. If the user really liked a song, they will often come back and listen to it. So, to infer a user's musical taste profile, their listens, repeat listens, skips and fraction of tracks listened to, etc. might be far more valuable signals than explicit ratings.

## Formulation

Recall from the previous notebook that the loss function for training the recommendation model on explicit feedback data was:

\begin{align}
L_{explicit} &= \sum\limits_{u,i \in S}( r_{ui} - x_{u} y_{i}^{T} )^{2} + \lambda \big( \sum\limits_{u} \left\Vert x_{u} \right\Vert^{2} + \sum\limits_{i} \left\Vert y_{i} \right\Vert^{2} \big)
\end{align}

Where:

- $r_{ui}$ is the true rating given by user $u$ to the item $i$
- $x_u$ and $y_i$ are user u's and item i's latent factors, both are $1×d$ dimensional, where $d$ the number of latent factors that the user can specify
- $S$ was the set of all user-item ratings
- $\lambda$ controls the regularization strength that prevents overfitting the user and item vectors

To keep it concrete, let's assume we're working music data and the value of our $r_{ui}$ will consists of implicit ratings that counts the number of times a user has listened to a song (song listen count). Then new formulation becomes:

\begin{align}
L_{implicit} &= \sum\limits_{u,i} c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2 + \lambda \big( \sum\limits_{u} \left\Vert x_{u} \right\Vert^{2} + \sum\limits_{i} \left\Vert y_{i} \right\Vert^{2} \big)
\end{align}

Recall that with implicit feedback, we do not have ratings anymore; rather, we have users' preferences for items. Therefore, in the new loss function, the ratings $r_{ui}$ has been replaced with a preference $p_{ui}$ indicating the preference of user $u$ to item $i$. $p_{ui}$ is a set of binary variables and is computed by binarizing $r_{ui}$.

\begin{align}
p_{ui} &= \begin{cases} 1 &\mbox{if } r_{ui} > 0 \\ 0 & \mbox{otherwise} \end{cases}
\end{align}

We make the assumption that if a user has interacted at all with an item ($r_{ui} > 0$), then we set $p_{ui} = 1$ to indicate that user $u$ has a liking/preference for item $i$. Otherwise, we set $p_{ui} = 0$. However, these assumptions comes with varying degrees of confidence. First of all, when $p_{ui} = 0$, we assume that it should be associated with a lower confidence, as there are many reasons beyond disliking the item as to why the user has not interacted with it. e.g. Unaware of it's existence. On the other hand, as the number of implicit feedback, $r_{ui}$,   grows, we have a stronger indication that the user does indeed like the item (regardless of whether he/she if buying a gift for someone else). So to measure the level of confidence mentioned above, we introduce another set of variables $c_{ui}$ that measures our confidence in observing $p_{ui}$:

\begin{align}
c_{ui} = 1 + \alpha r_{ui}
\end{align}

Where the 1 ensures we have some minimal confidence for every user-item pair, and as we observe more and more implicit feedback (as $r_{ui}$ gets larger and larger), our confidence in $p_{ui} = 1$ increases accordingly. And the term $\alpha$ is a parameter that we have to specify to control the rate of the increase. This formulation takes intuitive sense when we look back at the $c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2$ term in the loss function. A larger $c_{ui}$ means that the prediction $x_{u} y_{i}^{T}$ has to be that much closer to $p_{ui}$ so that term will not to contribute too much to the total loss.

The implementation in the later section will be based on the formula above, but note that there are many ways in which we can tune the formulation above. For example, we can derive $p_{ui}$ from $r_{ui}$ differently. So instead of setting the binary cutoff at 0, we can set it at a threshold that we chose. Similarly, there are many ways to transform $r_{ui}$ into the confidence level $c_{ui}$. e.g. we can use:

\begin{align}
c_{ui} = 1 + \alpha log \left( 1 + r_{ui} / \epsilon \right)
\end{align}

Regardless of the scheme, it's important to realize that we are transforming the raw observation $r_{ui}$ into two distinct representation, the preference $p_{ui}$ and the confidence levels of the preference $c_{ui}$.

## Alternating Least Squares

Let's assume that we have $m$ useres and $n$ items. Now, to solve for the loss function above, we start by treating $y_i$ as constant and solve the loss function with respect to $x_u$.

\begin{align}
\frac{\partial L_{implicit}}{\partial x_u} 
&\implies -2 \sum_i c_{ui}(p_{ui} - x_{u} y_{i}^{T})y_i + 2 \lambda x_u = 0 \\
&\implies -2 Y^T C^u p_u + 2 Y^T C^u Y x_u + 2 \lambda x_u = 0 \\
&\implies (Y^T C^u Y + \lambda I)x_u = Y^T C^u p_{u} \\
&\implies x_u = (Y^T C^u Y + \lambda I)^{-1} Y^T C^u p_u
\end{align}

Where: 

- $Y \in \mathbb{R}^{n \times d}$ represents all item row vectors vertically stacked on each other
- $p_u \in \mathbb{R^{n \times 1}}$ contains element all of the preferences of the user
- The diagonal matrix $C^u \in \mathbb{R^{n \times n}}$ consists of $c_{ui}$ in row/column $i$, which is the of confidence in items for this user. e.g. if $u = 0$ then the matrix for user $u_0$ will look like:

\begin{align}
{C}^{u_0} = \begin{bmatrix} c_{u_{01}} & 0 & 0 & 0 & ... & 0 \\ 0 & c_{u_{02}} & 0 & 0 & ... &0\\ ... \\ ... \\ 0 & 0 & 0 & 0 & ... & c_{u_{0n}}\end{bmatrix}
\end{align}

The main computational bottleneck in the expression above is the need to compute $Y^T C^u Y$ for every user. Speedup can be obtained by re-writing the expression as:

\begin{align}
{Y}^T {C}^{u} {Y} &= Y^T Y + {Y}^T \left( C^u - I \right) Y
\end{align}

Notice now the term $Y^T Y$ becomes independent of each user and can be computed independently, next notice $\left( C^u - I \right)$ now has only $n_u$ non-zero elements, where $n_u$ is the number of items for which $r_{ui} > 0$. Similarly, $C^u p_u$ contains only $n_u$ non-zero elements since $p_u$ is a binary transformation of $r_{ui}$. Thus the final formulation becomes:

\begin{align}
\frac{\partial L_{implicit}}{\partial x_u} 
&\implies x_u = (Y^T Y + Y^T \left( C^u - I \right) Y + \lambda I)^{-1} Y^T C^u p_u
\end{align}

After solving for $x_u$ the same procedure can be carried out for to solve for $y_i$ giving a similar expression:

\begin{align}
\frac{\partial L_{implicit}}{\partial y_i} 
&\implies y_i = (X^T X + X^T \left( C^i - I \right) X + \lambda I)^{-1} X^T C^i p_i
\end{align}

## Implementation

We'll use the same movielens dataset like the previous notebook. The movielens data is not an implicit feedback dataset as the user did provide explicit ratings, but we will use it for now to test out our implementation. The preprocessing procedure of loading the data and doing the train/test split is the same as the previous notebook, thus a helper function is created and saved under the same folder as the notebook. All we need to do is tell it the file directory.

In [4]:
from movielens import create_train_test
file_dir = 'ml-100k'
train, test = create_train_test(file_dir)

# convert to sparse matrix
Rui_train = csr_matrix(train)
Rui_test = csr_matrix(test)
Rui_train

<943x1682 sparse matrix of type '<class 'numpy.float64'>'
	with 90570 stored elements in Compressed Sparse Row format>

The following implementation uses some tricks to speed up the procedure. First of all, when we need to solve $Ax = b$ where $A$ is an $n \times n$ matrix, a lot of books might write the solution as $x = A^{-1} b$, however, in practice there is hardly ever a good reason to calculate that it that way as solving the equation $Ax = b$ is faster than finding $A^{-1}$.

The next one is the idea of computing matrix product $X^T X$ using a [outer product](https://docs.scipy.org/doc/numpy/reference/generated/numpy.outer.html) of each row.

In [4]:
# example matrix
X = np.array([[9, 3, 5], [4, 1, 2]]).T
X

array([[9, 4],
       [3, 1],
       [5, 2]])

In [5]:
# normal matrix product
X.T.dot(X)

array([[115,  49],
       [ 49,  21]])

In [6]:
# intialize an empty array
end_result = np.zeros((2, 2))

# loop through each row add up the outer product
for i in range(X.shape[0]):
    out = np.outer(X[i], X[i])
    end_result += out
    print('row:\n', X[i])
    print('outer product of row:\n', out)

end_result

row:
 [9 4]
outer product of row:
 [[81 36]
 [36 16]]
row:
 [3 1]
outer product of row:
 [[9 3]
 [3 1]]
row:
 [5 2]
outer product of row:
 [[25 10]
 [10  4]]


array([[ 115.,   49.],
       [  49.,   21.]])

The way that this speed things is that the matrix product is now the sum of the outer products of the rows, where each row's computation is independent of another can be computed in the parallelized fashion then added back together!

In [5]:
class ImplicitMF:
    """
    Alternating Least Squares for implicit feedback

    Parameters
    ----------
    n_iters : int
        number of iterations to train the algorithm

    n_factors : int
        number/dimension of user and item latent factors

    alpha : int
        scaling factor that indicates the level of confidence in preference

    reg : int
        regularization term for the user and item latent factors

    seed : int
        seed for the randomly initialized user, item latent factors

    Reference
    ---------
    Collaborative Filtering for Implicit Feedback Datasets
    http://yifanhu.net/PUB/cf.pdf
    """
    def __init__(self, n_iters, n_factors, alpha, reg, seed):
        self.reg = reg
        self.seed = seed
        self.alpha = alpha
        self.n_iters = n_iters
        self.n_factors = n_factors
    
    def fit(self, train):
        """train: csr_matrix that holds the training data"""
        
        # the original confidence vectors should include a + 1,
        # but this direct addition is not allowed when using sparse matrix,
        # thus we'll have to deal with this later in the computation
        Cui = self.alpha * train
        Ciu = Cui.T.tocsr()
        self.n_users, self.n_items = Cui.shape
        
        # initialize user latent vectors X and item latent vectors Y
        # randomly with a specified set seed
        rstate = np.random.RandomState(self.seed)
        self.X = rstate.normal( size = (self.n_users, self.n_factors) )
        self.Y = rstate.normal( size = (self.n_items, self.n_factors) )      
        for _ in tqdm(range(self.n_iters), desc = 'training progress'):
            self._als_step(Cui, self.X, self.Y)
            self._als_step(Ciu, self.Y, self.X)
            
            print(X.shape)
            print(Y.shape)
            print()
        
        return self
    
    def _als_step(self, Cui, X, Y):
        """
        when solving the user latent vectors,
        the item vectors will be fixed and vice versa
        """
        # the variable name follows the notation when holding
        # the item vector Y constant and solving for user vector X
        
        # YtY is a d * d matrix that is computed
        # independently of each user
        YtY = Y.T.dot(Y)
        data = Cui.data
        indptr, indices = Cui.indptr, Cui.indices

        # for every user build up A and b then solve for Ax = b,
        # this for loop is the bottleneck and can be easily parallized
        # as each users' computation is independent of one another
        for u in range(self.n_users):
            # initialize a new A and b for every user
            b = np.zeros(self.n_factors)
            A = YtY + self.reg * np.eye(self.n_factors)
            
            for index in range( indptr[u], indptr[u + 1] ):         
                # indices[index] stores non-zero positions for a given row
                # data[index] stores corresponding confidence,
                # we also add 1 to the confidence, since we did not 
                # do it in the beginning, when we were to give every 
                # user-item pair and minimal confidence
                i = indices[index]
                confidence = data[index] + 1

                # for b, Y^T C^u p_u
                # there should be a times 1 for the preference 
                # Pui = 1
                # b += confidence * Y[i] * Pui
                # but the times 1 can be dropped
                b += confidence * Y[i]
                
                # for A, Y^T (C^u - I) Y
                A += (confidence - 1) * np.outer(Y[i], Y[i])

            X[u] = np.linalg.solve(A, b)
        
        return self

    def predict(self):
        """predict ratings for every user and item"""
        pred = self.X.dot(self.Y.T)
        return pred

In [6]:
als = ImplicitMF(n_iters = 1, n_factors = 20, alpha = 15, reg = 0.01, seed = 1234)
als.fit(Rui_train)

training progress: 100%|██████████| 1/1 [00:01<00:00,  1.72s/it]

1





<__main__.ImplicitMF at 0x11c7b7e10>

In [9]:
def compute_mse(model, ratings):
    """ignore zero terms prior to comparing the mse"""
    mask = ratings.nonzero()
    y_true = ratings.data
    y_pred = model.predict()[mask]
    mse = mean_squared_error(y_true, y_pred)
    return mse

In [10]:
train_mse = compute_mse(als, Rui_train)
test_mse = compute_mse(als, Rui_test)
print( 'training mse {:.1f}, testing mse {:.1f}'.format(train_mse, test_mse) )

training mse 8.2, testing mse 9.2


## Reference

- [Blog: Don’t invert that matrix](https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/)
- [Blog: Implicit Feedback and Collaborative Filtering](http://datamusing.info/blog/2015/01/07/implicit-feedback-and-collaborative-filtering/)
- [Paper: Collaborative Filtering for Implicit Feedback Datasets](http://yifanhu.net/PUB/cf.pdf)
- [StackExchange: Analytic solution for matrix factorization using alternating least squares](http://math.stackexchange.com/questions/1072451/analytic-solution-for-matrix-factorization-using-alternating-least-squares)