# Recommender systems

<hr>

**Goal: Movie recommendation as an example**<br>

Given a $n \times m$ matrix, $Y$, containing $n$ users and $m$ movies, recommend movies to users that have not been seen by each of them.

<hr>

**K-Nearest Neighbours (KNN)**

*How many similar (nearest) neighbours do we want to use as information to recommend a movie to a given user?*

$\hat Y_{ij} = \frac{\sum_{b \in KNN(i, j)} Y_{bi}}{K}$

where

- $b$ represents each neighbour/user
- $K$ represents the number of nearest neighbours selected
- $Y_{bi}$ represents the score given by each of the nearest neighbour
- $\hat Y_{ij}$ represents the estimated score from $K$ nearest neighbours to a given user

*How do we define similarity distance between two users (vectors)?* i.e. $sim(a,b)$ as a similarity measure between users $a$ and $b \in KNN(a)$

- Cosine similarity $\cos \theta = \frac{x_a \cdot x_b}{\Vert x_a \Vert \Vert x_b \Vert}$
- Euclidean distance $\Vert x_a - x_b \Vert$, etc.
- Using algorithms, like *collaborative filtering*, which frees us from the need to define a good similarity measure

****

**Collaborative Filtering**

Given a $n \times m$ matrix, $Y$, which is generally a sparse matrix, output a matrix $X$ in the same dimensions that has all non-zero entries with estimated scores.

An empirical risk function can be defined as follows:

$J(X) = \sum_{(i, j) \in D} \frac{(Y_{ij} - X_{ij})^2}{2} + \frac{\lambda}{2} \cdot \sum_{(i, j)} X_{ij}^2$

where

- $D$ are all $i, j$ entries which are non-zero in the original matrix, $Y$
- The regularization term goes over all pairs in the matrix, $Y$


The risk function, when setting its derivative to zero, ends up with an unsatisfactory and trivial solution of $X$.

Using the collaborative filtering approach, we make a strong assumption that $X$ has low rank, i.e. there exist linear combinations in which we can now use matrix factorizations, such that $X$ can be decomposed into $X = U \cdot V^T$, where $U$ is a $n \times d$ matrix and $V^T$ is a $d \times m$ matrix, with $d$ as the rank of the matrix $X$.

Each row of $U$ represents a user's rating tendency and each column of $V^T$ represents the information on a movie.

$\therefore X_{ij} = U_i \cdot V_j$ in a rank 1 matrix, $X$

$J(X) = J(U, V) = \sum_{(i, j) \in D} \frac{(Y_{ij} - U_i \cdot V_j)^2}{2} + \frac{\lambda}{2} \sum_{i=1}^{n} U_i^2 + \frac{\lambda}{2} \sum_{j=1}^{m} V_j^2$


*Example*, given $Y = \begin{bmatrix} 5 & ? & 7 \\ 1 & 2 & ? \end{bmatrix}$, estimate $U = \begin{bmatrix} U_1 \\ U_2 \end{bmatrix}$ and $V = \begin{bmatrix} V_1 \\ V_2 \\ V_3 \end{bmatrix}$, assuming that $Y$ is a rank 1 matrix.

1. Initialize $V = \begin{bmatrix} 2 \\ 7 \\ 8 \end{bmatrix}$


2. Compute $U \cdot V^T$

    $U \cdot V^T = \begin{bmatrix} U_1 \\ U_2 \end{bmatrix} \begin{bmatrix} 2 & 7 & 8 \end{bmatrix} = \begin{bmatrix} 2U_1 & 7U_1 & 8U_1 \\ 2U_2 & 7U_2 & 8U_2 \end{bmatrix}$
    

3. Solve for $U_1, U_2$ using known entries in $Y$, by rewriting the loss function for each user, $i$

    $\nabla_U [ \frac{(5-2U_1)^2}{2} + \frac{(7-8U_1)^2}{2} + \frac{\lambda}{2} U_1^2 ]$
    
    $U_1 = \frac{66}{\lambda + 68}$
    
    $U_2 = \frac{16}{\lambda + 53}$
    
    Solve for both by selecting a fixed $\lambda$, say $\lambda = 1$
    

4. Given solved values of $U_1, U_2$, now solve for $V$

    $\begin{bmatrix} \frac{66}{69} \\ \frac{16}{54} \end{bmatrix} \begin{bmatrix} V_1 & V_2 & V_3 \end{bmatrix}$

    
5. Repeat process until covergence of values: This only guarantees local covergence and this depends on initialization values. Repeat the process with different initializations and check against validation sets

<hr>

# Basic code
A `minimal, reproducible example`