<h1>CS4619: Artificial Intelligence II</h1>
<h1>Recommender Systems III</h1>
<h2>
    Derek Bridge<br />
    School of Computer Science and Information Technology<br />
    University College Cork
</h2>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
import os
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')
  base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
  base_dir = "." 

<h1>Warning</h1>
<ul>
    <li>Same warning: The code in this lecture is for educational purposes only &mdash; written for clarity (I hope). There is no attempt to achieve efficiency or robustness.</li>
</ul>

In [4]:
movies = pd.read_csv("../datasets/ml_movies.txt", delimiter="|", encoding="ISO-8859-1",
            names = ["item_id", "title", "release date", "video release date",
                "IMDb URL", "unknown", "Action", "Adventure", "Animation",
                "Children\'s", "Comedy", "Crime", "Documentary", "Drama",
                "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance",
                "Sci-Fi", "Thriller", "War", "Western"]).drop([
                "release date", "video release date",
                "IMDb URL", "unknown", "Action", "Adventure", "Animation",
                "Children\'s", "Comedy", "Crime", "Documentary", "Drama",
                "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance",
                "Sci-Fi", "Thriller", "War", "Western"], axis=1)
movies["item_id"] -= 1

In [5]:
ratings = pd.read_csv("../datasets/ml_ratings.txt", delimiter="\t", encoding="ISO-8859-1",
                names=["user_id", "item_id", "rating", "timestamp"]).drop("timestamp", axis=1)
ratings["user_id"] -= 1
ratings["item_id"] -= 1

In [6]:
ratings_matrix = ratings.pivot(index="user_id", columns="item_id", values="rating").fillna(0)

In [7]:
user_ids = ratings["user_id"].unique()
item_ids = ratings["item_id"].unique()
num_users = len(user_ids)
num_items = len(item_ids)
num_ratings = len(ratings)
mean = ratings["rating"].mean()

<h1>Matrix Factorization</h1>
<ul>
    <li>We continue to look at collaborative filtering.</li>
    <li>In the previous lecture, we saw an instance-based approach to collaborative filtering:
        user-based nearest-neighbours.
    </li>
    <li>In this lecture, we will look at a model-based approach to collaborative filtering: matrix factorization.</li>
</ul>

<h2>Embeddings</h2>
<ul>
    <li>Consider a ratings matrix, $\v{R}$:
        <table style="border: 1px solid; border-collapse: collapse;">
            <tr>
                <th style="border: 1px solid black; text-align: left;"></th>
                <th style="border: 1px solid black; text-align: left;">$i_1$</th>
                <th style="border: 1px solid black; text-align: left;">$i_2$</th>
                <th style="border: 1px solid black; text-align: left;">$i_3$</th>
                <th style="border: 1px solid black; text-align: left;">$i_4$</th>
                <th style="border: 1px solid black; text-align: left;">$i_5$</th>
                <th style="border: 1px solid black; text-align: left;">$i_6$</th>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_1$</th>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">2</td>
                <td style="border: 1px solid black; text-align: left;">5</td>
                <td style="border: 1px solid black; text-align: left;">3</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">2</td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_2$</th>
                <td style="border: 1px solid black; text-align: left;">5</td>
                <td style="border: 1px solid black; text-align: left;">5</td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">3</td>
                <td style="border: 1px solid black; text-align: left;">4</td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_3$</th>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">3</td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_4$</th>
                <td style="border: 1px solid black; text-align: left;">5</td>
                <td style="border: 1px solid black; text-align: left;">4</td>
                <td style="border: 1px solid black; text-align: left;">2</td>
                <td style="border: 1px solid black; text-align: left;">4</td>
                <td style="border: 1px solid black; text-align: left;">3</td>
                <td style="border: 1px solid black; text-align: left;">3</td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_5$</th>
                <td style="border: 1px solid black; text-align: left;">2</td>
                <td style="border: 1px solid black; text-align: left;">5</td>
                <td style="border: 1px solid black; text-align: left;">4</td>
                <td style="border: 1px solid black; text-align: left;">4</td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
        </table>
    </li>
    <li>In $\v{R}$,
        <ul>
            <li>each user is represented by a row vector of ratings with dimension $|I|$; and</li>
            <li>each item is represented by a (column) vector of ratings with dimension $|U|$.</li>
        </ul>
        These vectors have a high dimension and are sparse.
    </li>
    <li>So why not come up with <b>embeddings</b>:
        <ul>
            <li>These would map the high dimensional sparse vectors to low-dimensional dense vectors. 
                (This should sound familiar!)
            </li>
        </ul>
        But we will map users and items to the same space.
        <ul>
            <li>In other words, they will map to vectors that have the same dimension, call it $d$.</li>
            <li>Each element represents a feature.</li>
            <li>In the case of user embeddings, the values indicate how much the user likes that feature.</li>
            <li>In the case of item embeddings, the values indicate how much the item possesses that feature.</li>
        </ul>
        So, let the embedding for user $u$ be a row vector of dimension $d$ and refer to it as $\v{P}^{(u)}$.
        And let the embedding for item $i$ be a (column) vector of dimension $d$ and refer to it as $\v{Q}^{(i)}$.
    </li>
    <li>Predicting $\hat{r}_{ui}$.
        <ul>
            <li>Ley $\mu$ be the mean of all the ratings in $\v{R}$.</li>
            <li>We can predict $\hat{r}_{ui}$ by computing the product of the user embedding and item embedding and adding this product to the mean:
                $$\hat{r}_{ui} = \mu + \v{P}^{(u)}\v{Q}^{(i)}$$
            </li>
        </ul>
    </li>
    <li>Example for $d=3$ and $\mu=3.5$:
        <ul>
            <li>Consider $u_2$.
                <ul>
                    <li>Her ratings are $\rv{5,5,\bot,3,4,\bot}$.</li>
                    <li>Suppose the corresponding embedding is $\rv{0.8,0.4,-0.75}$ (never mind where this comes
                        from for the moment).
                    </li>
                </ul>
            </li>
            <li>And consider $i_3$.
                <ul>
                    <li>Its ratings are $\cv{5\\\bot\\\bot\\3\\4}$</li>
                    <li>Suppose the corresponding embedding is $\cv{0.0\\0.5\\0.3}$</li>
                </ul>
            </li>
            <li>Then the predicted rating of $u_2$ for $i_3$ is $\mu$ plus the product of the two embeddings, i.e.
                $$3.5 + 0.8\times0.0 + 0.4\times0.5 + -0.75\times0.3 = 3.475$$
            </li>
        </ul>
    </li>
</ul>    

<h2>Latent features</h2>
<ul>
    <li>The explanation should so far be very reminiscent of the simple content-based recommender from two lectures ago.</li>
    <li>Things that feel the same:
        <ul>
            <li>Users and items are represented by vectors of dimension $d$ in the same space.</li>
            <li>In the case of users, we can think of the values as being how much a user likes a feature.</li>
            <li>In the case of items, we can think of the values as being how much an item possesses that feature.</li>
            <li>Predictions involve computing the product of two vectors.</li>
        </ul>
    </li>
    <li>Things that feel different:
        <ul>
            <li>Previously, we <em>designed</em> the features, e.g. movie genres.
                But here, the features and the values are <em>learned</em> from the ratings data. We refer to these
                features as <b>latent features</b>, reflecting the idea that they are somehow hidden (latent) in
                the ratings data and that we are revealing them through a learning algorithm.
            </li>
            <li>Previously, the product was to be thought of as measuring the similarity of the user and item.
                Here, it is a predicted rating, and this is how we learn the features.
            </li>
        </ul>
    </li>
</ul>

<h2>Learning the embeddings</h2>
<ul>
    <li>We need to learn all the user embedings $\v{P}$ and all the item embeddings $\v{Q}$. How?</li>
    <li>We need a loss function. We can use, e.g., MSE.
        $$J(\v{P}, \v{Q}) = \frac{1}{|\Omega|}\sum_{\langle u,i,r\rangle \in \Omega} (\mu + \v{P}^{(u)}\v{Q}^{(i)} - r)^2$$
        Here $\v{P}^{(u)}$ is the user embedding for user $u$ (which is one of the rows in $\v{P}$) and $\v{Q}^{(i)}$ is the item embedding for item $i$ (which is one of the columns in $\v{Q}$). And recall that $\Omega$ is the set of ratings in $\v{R}$ that are not $\bot$.
    </li>
    <li>(By the way, the loss function in this case is not convex.)
    </li>
    <li>Then, we can use, for example, Gradient Descent.</li>
    <li>This, for example, is Stochastic Gradient Descent:
        <ul style="background: lightgrey;">
            <li>initialise $\v{P}$ and $\v{Q}$ randomly</li>
            <li>repeat until convergence
                <ul>
                    <li>repeat $|\Omega|$ times
                        <ul>
                            <li>select $\langle u,i,r\rangle$ from $\Omega$ at random</li>
                            <li>$\mathit{prediction} = \mu + \v{P}^{(u)}\v{Q}^{(i)}$</li>
                            <li>$\v{P}^{(u)} \gets \v{P}^{(u)} - \alpha \times (\mathit{prediction} - r) \times \v{Q}^{(i)}$</li>
                            <li>$\v{Q}^{(i)} \gets \v{Q}^{(i)} - \alpha \times (\mathit{prediction} - r) \times \v{P}^{(u)}$</li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

In [8]:
def matrix_factorization(d, alpha, num_epochs):
    
    P = np.random.rand(num_users, d)
    Q = np.random.rand(num_items, d)
    
    for epoch in range(num_epochs):
        for i in range(num_users):
            for j in range(num_items):
                r = ratings_matrix.loc[i][j]
                if r != 0.0:
                    eij = mean + np.dot(P[i,:],Q[j,:]) - r
                    P[i] = P[i] - alpha * eij * Q[j]
                    Q[j] = Q[j] - alpha * eij * P[i]
                    
    return P, Q

In [9]:
def predict_rating_by_mf(user_id, item_id):
    return mean + np.dot(P[user_id,:],Q[item_id,:])

In [10]:
P, Q = matrix_factorization(d=10, alpha=0.001, num_epochs=15)

In [11]:
predict_rating_by_mf(user_id=13, item_id=3)

3.716322535104865

<h2>Matrix completion</h2>
<ul>
    <li>In the presentation of this material above, we concentrated on individual predictions, $\mu + \v{P}^{(u)}\v{Q}^{(i)}$.</li>
    <li>But, if we multiply matrices $\v{P}$ and $\v{Q}$, $\v{P}\v{Q}$, and add $\mu$ element-wise, $\mu + \v{P}\v{Q}$, then we get all predictions at once!</li>
    <li>This is why some people refer to this as <b>matrix completion</b>: we're getting predictions for all the
        entries that are $\bot$ (and all those that are not $\bot$).
    </li>
    <li>It is also why this is <b>matrix factorization</b> (i.e. factorization is writing one thing as a product of other things).
        What we are doing is finding $\v{P}$ and $\v{Q}$, two lower-rank matrices, such that
        $$\mu + \v{P}\v{Q} \approx \v{R}$$
    </li>
</ul>

<h2>Discussion of matrix factorization</h2>
<ul>
    <li>Advantages of matrix factorization for collaborative filtering include:
        <ul>
            <li>It does not require any item or user descriptions, just user-item interactions (e.g. ratings) &mdash;
                and this is data we will collect during the normal operation of the system.
            </li>
            <li>It may recommend items that are pleasantly surprising (certainly more so than content-based
                approaches), since it recommends using <em>other peoples'</em> tastes.
            </li>
            <li>It is fast at prediction time.</li>
        </ul>
    </li>
    <li>Its disadvantages include:
        <ul>
            <li>Learning the embeddings (the SGD above) takes time. So new ratings generally cannot take 
                immediate effect. They will have to be buffered until the next time the model gets updated
                (e.g. every night). (There are, however, some incremental versions of matrix factorization,
                which do allow new ratings to take immediate effect.)
            </li>
            <li>It has problems recommending to cold-start users and recommending cold-start items.</li>
            <li>It can exhibit popularity bias: over-recommending popular items (although this may depend to
                some extent on details of the implementation).
            </li>
        </ul>
    </li>
    <li>Can it explain its recommendations?
        <ul>
            <li>The basic answer is, No. The latent features do not mean anything to human users. (Of course, there
                is research that tries to constrain the latent features in various ways to try to make them more
                human-interpretable.)
            </li>
        </ul>
    </li>
    <li>In concluding, let's mention some variants.
        <ul>
            <li>It is normal to use slightly more complicated formulae to learn values referred to as
                user biases and item biases&mdash; these help deal with the problems with ratings scales that we
                mentioned in the previous lecture.
            </li>
            <li>It is normal to use regularization.</li>
            <li>There are variants that combine with nearest-neighbours in various ways.
                <!-- E.g. learn embeddings then run user-based knn on these instead of the ratings vectors.
                     E.g. learn the similarities of a usr-based knn model in a way similar to MF.
                     E.g. combine the previous one with MF.
                  -->
            </li>
            <li>There are variants that allow item descriptions, user descriptions and contextual information
                to be added to $\v{R}$ in various ways, thus giving a system that handles all the kinds of
                data that we may have.
                <!-- E.g. block 00 is R, block 01 adds user descriptions as extra columns, block 10 adds item
                     descriptions as extra rows, block 11 is not used, then factorize.
                     E.g. factorization machines.
                     And so on.
                 -->
            </li>
            <li>Standard matrix factorization is actually a linear model! We can use the idea of user and item embeddings in, for example, neural networks, to give nonlinear models.</li>
        </ul>
    </li>
</ul>

<h1>Top-N Recommendation</h1>
<ul>
    <li>Recall that recommender systems typically proceed through (at least) three steps:
        <figure>
            <img src="images/rs_arch.png" />
        </figure>
    </li>
    <li>Let's select the top-$N$ candidates, this time using matrix factorization.</li>
</ul>

In [12]:
user_id = 11
N = 5

# Get the item_ids sorted by predicted rating
sorted_item_ids = np.argsort(
        [predict_rating_by_mf(user_id=user_id, item_id=i_id) 
         for i_id in item_ids]).tolist()
# To be a recommendation, the user must not have watched this movie 
taboo_item_ids = ratings_matrix.loc[user_id][ratings_matrix.loc[user_id] == 0.0].tolist()
# Get the item_ids for the top-N recommendations
recommended_item_ids = [i_id for i_id in sorted_item_ids if i_id not in taboo_item_ids][:N]
# Get the titles of these movies
movies.loc[recommended_item_ids]

Unnamed: 0,item_id,title
331,331,Kiss the Girls (1997)
404,404,Mission: Impossible (1996)
668,668,Body Parts (1991)
98,98,Snow White and the Seven Dwarfs (1937)
360,360,Incognito (1997)


<ul>
    <li>Selecting the $N$ candidates whose scores are highest and recommending these to the user is just the obvious thing.</li>
    <li>However, there may be some additional criteria to take into account at this stage. Some examples include:
        <ul>
            <li>There may be some business rules to take into account. For example, there may be some items the business
                is trying to push. So there may be a rule that requires that one or more slots in the top-$N$ are occupied by these items, displacing the 'organic' recommendations. (Think about sponsored content, for example.)
            </li>
            <li>We might have more than one recommender model whose scores we want to combine.</li>
            <li>We often carry out some re-ranking at this stage in order to ensure that the top-$N$ has a 
                degree of diversity or some notion of fairness.
            </li>
        </ul>
    </li>
    <li>Let's look at just one example: diversity.</li>
</ul>

<h2>Top-$N$ Diversity</h2>
<ul>
    <li>Suppose the user likes fantasy and thrillers and comedies. And suppose we recommend the $N$ candidate 
        items that obtained the highest scores (highest predicted rating
        in our case).
        <ul>
            <li>Maybe this is what we end up recommending:
                <figure>
                    <img src="images/top-N.png" />
                </figure>
            </li>
        </ul>
    </li>
    <li>Each recommendation is <em>relevant</em> to the user. She likes fantasy! But this top-$N$ lacks
        <b>diversity</b>.
        <ul>
            <li>A more diverse top-$N$ (e.g. containing at least one thriller, at least one comedy)
                would be more likely to include at least one recommendation that would
                satisfy the user.
            </li>
            <li>It would give her a meaningful choice.</li>
            <li>It would be one way of handling the recommender's uncertainty about the user's preferences.</li>
        </ul>
    </li>
</ul>

<h2>Defining diversity</h2>
<ul>
    <li>Diversity is a property of a set of items, not a property of an individual item.</li>
    <li>Suppose we have a set of recommendations $S$. We can measure the <em>marginal</em>
        increase in diversity obtained by adding item $i$ into set $S$ as the maximum
        distance (smallest similarity) between $i$ and the members of $S$:
        $$div(i, S) = max_{j \in S}(1 - sim(i, j))$$
    </li>
    <li>E.g. if we add another Star Wars movie to the top-$N$ shown previously, it is very similar
        to the ones we have already, so its marginal diversity is very low.
    </li>
    <li>But if we add a comedy to the top-$N$, then it is not very similar to the movies we have
        already, so its marginal diversity is very high.
    </li>
    <li>Similarity can be measured in any of the ways we have discussed already, e.g. cosine or
        Pearson of vectors whose features are genres or ratings or latent features.
    </li>
</ul>

<h2>Greedy re-ranking</h2>
<ul>
    <li>So, instead of recommending the $N$ with the highest predicted ratings, we select the
        $N$ that achieve the best balance between relevance and diversity, controlled by a hyperparameter
        $\lambda \in [0, 1]$:
        <ul style="background: lightgrey;">
            <li>$S \gets [\,\,]$</li>
            <li>while $|S| < N$
                <ul>
                    <li>$i^* \gets \arg\max_{i \in candidates} \hat{r}_{ui} + \lambda div(i,S)$</li>
                    <li>delete $i^*$ from candidates</li>
                    <li>append $i^*$ to the end of $S$</li>
                </ul>
            </li>
            <li>return $S$</li>
        </ul>
    </li>
    <li>As usual, there are lots of variations on this, especially lots of different ways of defining
        diversity.
    </li>
</ul>