# Matrix Factorization Techniques for Recommender Systems

## Representation

Assume a store has $N$ items and $M$ users. Map each user $u$ and an item $i$ to a **joint latent factor space** of dimensionality $f$, $p_u$ and $q_i$ respectively. Mathematically,

$$\text{item }q_i \in \mathbb{R}^f \\
\text{user }p_u \in \mathbb{R}^f$$

Each element in $q_i$ represents if the item possesses the factor and each element in $p_u$ represents if the user is interested in items possessing the factor.

### Example:

$f=4$ and is represented as:

$$q_i = \begin{bmatrix} item\_is\_electronics\\item\_is\_home\_living\\item\_is\_fashion\\item\_is\_fmcg \end{bmatrix}$$

$$p_u = \begin{bmatrix} user\_likes\_electronics\\user\_likes\_home\_living\\user\_likes\_fashion\\user\_likes\_fmcg \end{bmatrix}$$

## Ratings

The resulting dot product $q_i^Tp_u$ captures the interaction between user $u$'s and item $i$ - the user's interest in the item's characteristics. This approximates the **observable** $u$'s rating of $i$, denoted by $r_{ui}$. This leads to the estimate

$$\hat{r}_{ui} = q_i^Tp_u \cdots(1)$$

The challenge is computing the **factor vectors** given the **rating matrix**.

Given the ratings by multiple users on multiple items, $S$, To learn the factor vectors, minimize the regularized squared error on the set of known ratings:

$$\begin{align}\min_{q*,p*} &\sum_{(u,i) \in S} (r_{ui} - q_i^Tp_u)^2 + \lambda (||q_i||^2 + ||p_u||^2)\\
=\min_{q*,p*} & \sum_{(u,i) \in S} r_{ui}^2 - 2r_{ui}q_i^Tp_u + (q_i^Tp_u)^2 + \lambda ||q_i||^2 + \lambda ||p_u||^2  \cdots (2)\end{align}$$

## Stochastic Gradient Descent

Predict $r_{ui}$ and compute the associated prediction error and computes the associated prediction error:

$$e_{ui} = r_{ui} - q_i^Tp_u$$

Then it modifies the parameters by a magnitude propotional to $\gamma$ in the opposite direction of the gradient:

$$\begin{align}
q_i \leftarrow q_i + \gamma \cdot (e_{ui} \cdot p_u - \lambda \cdot q_i)\\
p_u \leftarrow p_u + \gamma \cdot (e_{ui} \cdot q_i - \lambda \cdot p_u)
\end{align}$$

## Alternating Least Squares (ALS)

We fix one of the unknowns, $q_i$ or $p_u$. Then $(2)$ now becomes a quadratic function and now can be solved optimally. ALS rotate between fixing the $q_i$s and the $p_u$s. Then all $p_u$s are fixed, recompute the $q_i$s by solving a least-squares problem and vice versa. Repeat until $(2)$ converges.

## Adding Biases

Some users generally rate products better (or worse) than others and some items are generally better (or worse) than others. Hence, we cannot explain the full rating based on the interaction between $q_i$ and $p_u$. A first-order approximation of the bias involved in rating $r_{ui}$ is:

$$b_{ui} = \mu + b_i + b_u \cdots (3)$$

### Example

The average rating for *all movies* is $3.7$ stars and Civil War is better-than-average and hence is rated $0.4$ stars above the global average. Jeremy is a critical user and gives $1$ star lower than the average and hence the estimate will be $(3.7 + 0.4 - 1.0) = 3.1$

Biases extend $(1)$:

$$\hat{r}_{ui} = \mu + b_i + b_u + q_i^Tp_u \cdots(4)$$

and hence the error function is now:

$$\begin{align}\min_{q*,p*,b*} &\sum_{(u,i) \in S} (r_{ui} - \mu - b_u - b_i - q_i^Tp_u)^2 + \lambda (||q_i||^2 + ||p_u||^2 + b_u^2 + b_i^2) \cdots (5)\end{align}$$

# Additional Readings

### Additional Input Sources

Extend $(4)$ to
$$\hat{r}_{ui} = \mu + b_i + b_u + q_i^T\begin{bmatrix}p_u + |N(u)|^{-0.5} \sum_{i \in N(u)} x_i + \sum_{a\in A(u)} y_a\end{bmatrix} \cdots(6)$$

### Temporal Dynamics

Extend $(4)$ to
$$\hat{r}_{ui}(t) = \mu + b_i(t) + b_u(t) + q_i^Tp_u(t) \cdots(7)$$

### Varying Confidence Levels

Extend $(4)$ to
$$\begin{align}\min_{q*,p*,b*} &\sum_{(u,i) \in S} c_{ui}(r_{ui} - \mu - b_u - b_i - q_i^Tp_u)^2 + \lambda (||q_i||^2 + ||p_u||^2 + b_u^2 + b_i^2) \cdots (8)\end{align}$$

**References**

- https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf

**Additional Readings**
- http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/