# 1) Introduction

![](https://katbailey.github.io/images/matrix_factorization.png)

- Rating matrix(NxM)
- N = num users
- M = num items

## Matrix Factorization 
- In supervised ML, we want accuracy(predictions close to target)
- In recommenders, what we want is a score to sort recommendations by


## Section Outline
- Basic form of Matrix Factorization model
- Define a loss, minimize it
- 2 impl
    - numpy - direct from theory
    - Keras
- Extend keras model


# 2) Matrix Factorization 

## Factors
- 10 = 5 x2 
- 15 = 3 x 5
- 30 = 3 x 10 = 15 x 2

## Matirx Factorization 

- Split the matrix into the product of 2 other matrices
- R hat is approximates R - it is our model of R
![](https://www.kukuxiaai.com/images/blog/recommended_system/udemy/mf_1.png)

- W( N x K) - user matrix, U (M x K) - movie matrix
- K somewhere from 10 -50

## think about R
- W and U should be much smaller than R
- R is N x M
- represent it using a special data structure
    - Dict{(u,m) -> r}
- If N = 130k, M = 26k
    - N x M = 3.38 billion
    - ratings = 20 million
    - space used: 20 million/3.38billion = 0.006
- This is called a sparse representation

![](https://4.bp.blogspot.com/-95QD5t9Lha4/Wd7uWnBZBeI/AAAAAAAADg4/xB4VnnxM0UgUp15lNmB3aHCXYGejpm4OACLcBGAs/s1600/matrix_factorization.png)

## Some calculations
- If k = 10, N = 130k, M = 26k, then size of W and U 
- NK + MK = 1.56 million
- how much savings?
- 1.56 million / 3.38 billion = 0.0005
- this is good, we like # parameters < # of data pts)


- What happens if you try to calculate W$U^T$ in code?
- Don't do it, the result is NxM, which I just told you is exactly what we don't want
    - Unless you've selected a small subset of your data

## One rating
- this is easy, just a dot product between 2 vectors of size K

\begin{equation*}
\hat{r}_{ij} = w_{i}^{T}u_{j}, \hat{r}_{ij} = \hat{R}[i,j] , w_{i} = W[i], u_{j} = U[j]
\end{equation*}

## Why dose it make sense?
- From a mathmetical standpoint, we know from SVD(singular value decomposition) that a matrix X can be decomposed into 3 seperate matrices multiplied together

- X(N x M), U(N x K), S(K x K) , V(M x K)
- If I multiply U by S, I just get another N x K matrix
    - Then X is a product of 2 matrices, just like matrix factorization
    - or equivalently, I could combine S with $V^{T}$
- R(rating matrix) is sparse
    - If U,S and V can properly approximate a full X matrix, then surely it can approximate a mostly empty R matrix 
    
\begin{equation*}
X = USV^{T}
\end{equation*}

![](http://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/Image/229f77d2cb173c1cef4d6cfbab2e905e/svd_matrices.jpg)

## Interpretation
- Each of the K elements in $W_{i}$ and $u_{j}$ is a feature
- Let's suppose K=5, and they are
    - Action/adventure
    - Comedy
    - Romance
    - Horror
    - Animation
- $w_{i}(1)$ is how much user i likes action
- $w_{i}(2)$ is how much user i likes comedy 
- $u_{j}(1)$ is how much movie j contains action
- $u_{j}(2)$ is how much movie j contains comedy



- What happens when we dot $w_{i}^{T}u_{j}$ ?
- How well do user i's preferences correlate with movie j's attributes?

\begin{equation*}
w_{i}^{T}u_{j} = ||w_{i}||||u_{j}|| cos\theta \propto sim(i,j)
\end{equation*}

## Example
- Action/adventure
- Comedy
- Romance
- Horror
- Animation

- $w_{i}$ = (1, 0.8, -1, 0.1, 1)
- $u_{j}$ = (1,1.5,-1.3,0, 1.2)
- result = 1 * 1 + 0.8 * 1.5 + 1 * 1.3 + 0.1 * 0 + 1 * 1.2 = 4.7 (too high)
- Why?
    - +ve x +ve -> +ve
    - -ve x -ve -> +ve


- $w_{i}$ = (1, 0.8, -1, 0.1, 1)
- $u_{j}$ = (-1,-1,1,0, -1)
- result = 1 * -1 + 0.8 * -1 + -1 * 1 + 0.1 * 0 + 1 * -1 = -3.8 (too low)
- Why?
    - +ve x -ve -> -ve



## Features

- You can't choose feature 1 to be action, feature 2 to be comedy
- Each feature is latent, and K is the latent dimensionality
- Hidden causes
- Why user i like Power rangers?
    - hidden cause is that user i likes action, and Power rangers has action
- We don't know the meaning of any feature without inspecting it
- Ex. check top 10 movies that have the largest value for feature 1 


## Supervised machine learning

- recall our previous discussion, we could predict how much a user likes an item, by extracting features from both, and feeding it into a model like random forest or neural network
- the difference is that Matrix Fatorization extracts the features automatically using only ratings


## Dimensionality Reduction

![](https://www.kukuxiaai.com/images/blog/recommended_system/udemy/mf_2.png)

# 3) Training

- How can we ensure our approximation is good?

$R \approx \hat{R} = WU^{T}$

## Squared error loss

\begin{equation*}
J = \sum_{i,j\in\Omega}^{}(r_{ij}-\hat{r}_{ij})^2 = \sum_{i,j\in\Omega}^{}(r_{ij}-w_{i}^{T}u_{j})^2
\end{equation*}

$\Omega$ = set of pairs(i,j) where user i rated movie j

## Minimize the loss
- How? Find the gradient, set it to 0, solve for the parameters

## Solving for W

- careful about which sets are being summed over
- For J, we want to sum over all ratings
- For a particular user vector $w_{i}$, we only care about movies that user rated 
- Try to isoloate $w_{i}$
- it's stuck inside a dot product 

\begin{equation*}
\frac{\partial J}{\partial w_{i}} = 2 \sum_{j\in \Psi_{i}}^{}(r_{ij} - w_{i}^{T}u_{j})(-u_{j}) = 0  \\
\sum_{j\in\Psi_{i}}^{}(w_{i}^{T}u_{j})u_{j} = \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j} \\
\sum_{j\in\Psi_{i}}^{}(u_{j}^{T}w_{i})u_{j} = \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j}
\end{equation*}

- scalar x vector = vector x scalar

\begin{equation*}
\sum_{j\in\Psi_{i}}^{}u_{j}(u_{j}^{T}w_{i}) = \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j}
\end{equation*}

- drop the brackets

\begin{equation*}
\sum_{j\in\Psi_{i}}^{}u_{j}u_{j}^{T}w_{i} = \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j}
\end{equation*}

\begin{equation*}
(\sum_{j\in\Psi_{i}}^{}u_{j}u_{j}^{T})w_{i} = \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j}
\end{equation*}

- Now it's just Ax = b, which we know how to solve
- x = np.linalg.solve(A,b)

\begin{equation*}
w_{i} = (\sum_{j\in\Psi_{i}}^{}u_{j}u_{j}^{T})^{-1} \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j}
\end{equation*}


## Solving for U

- symmetric in W and U, so the steps should be the same
\begin{equation*}
\frac{\partial J}{\partial u_{j}} = 2 \sum_{i\in \Omega_{j}}^{}(r_{ij} - w_{i}^{T}u_{j})(-w_{i}) = 0  \\
\sum_{i\in\Omega_{j}}^{}(w_{i}^{T}u_{j})w_{i} = \sum_{i\in\Omega_{j}}^{}r_{ij}w_{i}\\
\sum_{i\in\Omega_{j}}^{}w_{i}w_{i}^{T}u_{j} = \sum_{i\in\Omega_{j}}^{}r_{ij}w_{i}\\
(\sum_{i\in\Omega_{j}}^{}w_{i}w_{i}^{T})u_{j} = \sum_{i\in\Omega_{j}}^{}r_{ij}w_{i}\\
u_{j} = (\sum_{i\in\Omega_{j}}^{}w_{i}w_{i}^{T})^{-1} \sum_{i\in\Omega_{j}}^{}r_{ij}w_{i}
\end{equation*}


## 2-way dependency
- solution for W depends on U
- solution for U depends on W

 ## Training Algorithm
 
 - W = randn(N,K) U = randn(M,K)
 - for t in range(T):

\begin{equation*}
w_{i} = (\sum_{j\in\Psi_{i}}^{}u_{j}u_{j}^{T})^{-1} \sum_{j\in\Psi_{i}}^{}r_{ij}u_{j} \\
u_{j} = (\sum_{i\in\Omega_{j}}^{}w_{i}w_{i}^{T})^{-1} \sum_{i\in\Omega_{j}}^{}r_{ij}w_{i}
\end{equation*}


## FAQ
- Does it matter which order you update in? it doesn't matter
- Should you use the old values of W when updating U?
    - Tends to go faster if you use the new values
    - computationally, if you wanted to use the old values, you'd have to make a copy(very slow)

# 4) Matrix Factorization , Expanding our model

## Bias Terms
- It thus makes sense to add bias terms to the MF model

\begin{equation*}
\hat{r}_{ij} = w_{i}^{T}u_{j} + b_{i}+ c_{j}+ \mu \\
\end{equation*}
$b_{i}$ = user bias  
$c_{j}$ = movie bias  
$\mu$ = global average  

## Training

\begin{equation*}
J = \sum_{i,j\in\Omega}^{}(r_{ij}-\hat{r}_{ij})^2 \\
\hat{r}_{ij} = w_{i}^{T}u_{j}+ b_{i}+c_{j}+ \mu
\end{equation*}




## Solving for W

\begin{equation*}
\frac{\partial J}{\partial w_{i}} = 2 \sum_{j\in\Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j}) = 0 \\
\sum_{j\in\Psi_{i}}^{}(w_{i}^{T}u_{j})u_{j} = \sum_{j\in\Psi_{i}}^{}(r_{ij}-b_{i}-c_{j}-\mu)u_{j} \\
w_{i} = (\sum_{j\in\Psi_{i}}^{}u_{j}u_{j}^{T})^{-1}  \sum_{j\in\Psi_{i}}^{}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
\end{equation*}

## Solving for U

\begin{equation*}
u_{j} = (\sum_{i\in\Omega_{j}}^{}w_{i}w_{i}^{T})^{-1}  \sum_{i\in\Omega_{j}}^{}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}
\end{equation*}

## Solving for b

\begin{equation*}
\frac{\partial J}{\partial b_{i}} = 2 \sum_{j\in\Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1) = 0 \\
b_{i} = \frac{1}{|\Psi_{i}|}\sum_{j\in\Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
\end{equation*}

## Solving for c

\begin{equation*}
\frac{\partial J}{\partial c_{j}} = 2 \sum_{i\in\Omega_{j}}^{}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1) = 0 \\
c_{j} = \frac{1}{|\Omega_{j}|}\sum_{i\in\Omega_{j}}^{}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
\end{equation*}

- Don't need to update global average(just calculate it directlry from train data)

# 5) Matrix Factorization ,Regularization

## Regularization
- A technique to prevent overfitting and help generalization
- In linear regression
- Model $\hat{y} = w^{T}x$
- Objective $J = \sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^2 +\lambda||w||_{2}^{2}$
- Solution $ w = (\lambda I + X^{T}X)^{-1}X^{T}y$

## Regularization in Matrix Factorization
- Same approach, add squared magnitude of each parameter multiplied by regularization constant
- $||*||_{F}$ is called the Frobenius norm

$J = \sum_{i,j\in \Omega}^{}(r_{ij}-\hat{r}_{ij})^2 +\lambda(||W||_{F}^{2}+||U||_{F}^{2}+||b||_{2}^{2}+||c||_{2}^{2})$

## Solve for W
- Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative

\begin{equation*}
\frac{\partial J}{\partial w_{i}} = 2\sum_{j\in \Psi_{i}}^{}(r_{ij}-W_{i}^{T}u_{j}- b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i} = 0
\end{equation*}

- If you can't see how I differentiated $w_{i}$ wrt Frobenius Norm, expand it
- Now it's just a dot product which we know how to differentiate

\begin{equation*}
||W||_{F}^{2} = \sum_{i=1}^{N}\sum_{k=1}^{K}|w_{ik}|^{2} = \sum_{i=1}^{N}||w_{i}||_{2}^{2}= \sum_{i=1}^{N}w_{i}^{T}w_{i}
\end{equation*}

\begin{equation*}
\sum_{j\in \Psi_{i}}^{}u_{j}u_{j}^{T}w_{i}+ \lambda w_{i} =\sum_{j\in \Psi_{i}}^{} (r_{ij}- b_{i}-c_{j}-\mu)(u_{j}) \\
(\sum_{j\in \Psi_{i}}^{}u_{j}u_{j}^{T}+ \lambda I) w_{i} =\sum_{j\in \Psi_{i}}^{} (r_{ij}- b_{i}-c_{j}-\mu)u_{j} \\
w_{i} = (\sum_{j\in \Psi_{i}}^{}u_{j}u_{j}^{T}+ \lambda I)^{-1}\sum_{j\in \Psi_{i}}^{} (r_{ij}- b_{i}-c_{j}-\mu)u_{j}
\end{equation*}

## Solve for U 
\begin{equation*}
u_{j} = (\sum_{i\in \Omega_{j}}^{}w_{i}w_{i}^{T}+ \lambda I)^{-1}\sum_{i\in \Omega_{j}}^{} (r_{ij}- b_{i}-c_{j}-\mu)w_{i}
\end{equation*}

## Solev for b

\begin{equation*}
\frac{\partial J}{\partial b_{i}} = 2\sum_{j\in \Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j}-b_{i}- c_{j}-\mu)(-1)+2\lambda b_{i} = 0 \\
\sum_{j\in \Psi_{i}}^{}b_{i}+\lambda b_{i} = \sum_{j\in \Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j} -c_{j}-\mu) \\
b_{i}((\sum_{j\in \Psi_{i}}^{}1) + \lambda) = \sum_{j\in \Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j} -c_{j}-\mu) \\
b_{i} = \frac{1}{|\Psi_{i}|+\lambda}\sum_{j\in \Psi_{i}}^{}(r_{ij}-w_{i}^{T}u_{j} -c_{j}-\mu)
\end{equation*}

## Solev for c

\begin{equation*}
c_{j} = \frac{1}{|\Omega_{j}|+\lambda}\sum_{i\in \Omega_{j}}^{}(r_{ij}-w_{i}^{T}u_{j} -b_{i}-\mu)
\end{equation*}