# Decomposing Factorization

### Example Dataset

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

In [109]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [116]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

## PCA for Dimensionality Reduction

In PCA, we have the following:

\begin{align*}
M^TMV &= V\Lambda \\
\text{where } \Lambda &\text{ is the diagonal matrix of eigenvalues.}
\end{align*}

As According to SVD we have:

Now, we can get the following:

\begin{align*}
M^TM &= (U \Sigma V^T)^T U \Sigma V^T \\
&= V \Sigma^T U^T U \Sigma V^T \\
&= V \Sigma^2 V^T
\end{align*}

Note that this is the same equation as with PCA, we just have:

$\Lambda = \Sigma^2$.

In [128]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

M_scaled = scaler.fit_transform(M)

pca = PCA()
V_pca = pca.fit_transform(M_scaled)
pd.DataFrame(np.around(pca.components_, 2))

Unnamed: 0,0,1,2,3,4
0,0.44,0.45,0.46,-0.44,-0.44
1,-0.44,-0.36,-0.28,-0.55,-0.55
2,0.69,-0.72,-0.02,-0.04,-0.04
3,0.37,0.38,-0.84,-0.06,-0.06
4,-0.0,0.0,0.0,0.71,-0.71


In [129]:
print np.around(V_pca, 2)
print
print np.around(np.dot(M_scaled, pca.components_.T), 2)

[[ 0.27  1.23  0.04 -0.1   0.  ]
 [ 2.09 -0.18 -0.48 -0.26  0.  ]
 [ 1.86 -0.09  0.28  0.11  0.  ]
 [ 2.54 -0.64  0.23  0.12  0.  ]
 [-2.16 -0.49 -0.43  0.23  0.  ]
 [-3.09 -0.65  0.32 -0.25  0.  ]
 [-1.51  0.81  0.04  0.15  0.  ]]

[[ 0.27  1.23  0.04 -0.1   0.  ]
 [ 2.09 -0.18 -0.48 -0.26  0.  ]
 [ 1.86 -0.09  0.28  0.11  0.  ]
 [ 2.54 -0.64  0.23  0.12  0.  ]
 [-2.16 -0.49 -0.43  0.23  0.  ]
 [-3.09 -0.65  0.32 -0.25  0.  ]
 [-1.51  0.81  0.04  0.15  0.  ]]


## SVD for Topic Analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

Note that the first three movies (Matrix, Alien, Serenity) are Sci-fi movies and the last two (Casablanca, Amelie) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [59]:
# Compute SVD
from numpy.linalg import svd

U, sigma, VT = svd(M_scaled)

In [60]:
U, sigma, VT = (np.around(x, 2) for x in (U, sigma, VT))

U = pd.DataFrame(U, index=users)
VT = pd.DataFrame(VT, columns=movies)

sigma_df = pd.DataFrame(np.diag(sigma))
sigma_df

Unnamed: 0,0,1,2,3,4
0,5.55,0.0,0.0,0.0,0
1,0.0,1.81,0.0,0.0,0
2,0.0,0.0,0.81,0.0,0
3,0.0,0.0,0.0,0.49,0
4,0.0,0.0,0.0,0.0,0


In [61]:
U

Unnamed: 0,0,1,2,3,4,5,6
Alice,0.05,-0.68,-0.05,0.21,-0.29,0.3,-0.56
Bob,0.38,0.1,0.59,0.52,-0.34,0.14,0.3
Cindy,0.33,0.05,-0.35,-0.22,-0.71,-0.46,0.04
Dan,0.46,0.35,-0.28,-0.25,-0.01,0.72,-0.06
Emily,-0.39,0.27,0.54,-0.47,-0.38,0.13,-0.33
Frank,-0.56,0.36,-0.4,0.51,-0.32,0.19,0.01
Greg,-0.27,-0.45,-0.04,-0.3,-0.21,0.33,0.69


In [62]:
VT

Unnamed: 0,Matrix,Alien,StarWars,Casablanca,Titanic
0,0.44,0.45,0.46,-0.44,-0.44
1,0.44,0.36,0.28,0.55,0.55
2,-0.69,0.72,0.02,0.04,0.04
3,-0.37,-0.38,0.84,0.06,0.06
4,0.0,0.0,0.0,-0.71,0.71


### Explained Variance Ratio

In [95]:
# Power
# singular values are square roots of eigenvalues
total_power = np.sum(sigma**2)
total_power

fraction_power = np.cumsum(sigma**2) / total_power
fraction_power

array([ 0.88070554,  0.97437584,  0.99313506,  1.        ,  1.        ])

In [45]:
# Keep only top two concepts
U = U.iloc[:,:2]
sigma = sigma[:2]
VT = VT.iloc[:2,:]

print U
print sigma
print VT

          0     1
Alice -0.21  0.02
Bob   -0.55  0.06
Cindy -0.50  0.07
Dan   -0.62  0.08
Emily -0.12 -0.60
Frank -0.04 -0.73
Greg  -0.06 -0.30
[ 13.84   9.52]
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70


In [49]:
# Check the reconstruction

np.around(U.dot(np.diag(sigma)).dot(VT))

Unnamed: 0,Matrix,Alien,StarWars,Casablanca,Titanic
Alice,1,2,2,0,0
Bob,4,5,5,0,0
Cindy,4,4,4,0,0
Dan,4,5,5,0,0
Emily,0,1,0,4,4
Frank,0,1,0,5,5
Greg,0,1,0,2,2


### PCA vs. SVD

#### Applications of SVD

1. Determining range, null space and rank (also numerical rank).
2. Matrix approximation.
3. Inverse and Pseudo-inverse.
4. Least squares
5. Denoising – Small singular values typically correspond to noise. Take the matrix whose columns are the signals, compute SVD, zero small singular values, and reconstruct.
6. Compression.

#### Why not PCA?

* Covariance matrix can be huge (if number of features is large [which is why we are probably doing dimensionality reduction])
    * Eigenvectors/values computationally intractable
* Not every matrix has an eigen-decomposition, but every matrix has an SVD
* Computing SVD more numerically stable
* "Holistic" and can be difficult to interpret
* Non-iterative

#### Why not PCA, SVD, or NMF?

* Usually computationally intractable O(m ^ 3)

## Stochastic Gradient Descent
Minimize $F(x) = \frac{1}{n}\Sigma f_i(x)$

$min_{q,p}\hspace{1mm} \Sigma_{u,i} (r_{(u,i)} - {p_u^T} q_i)^2$

Initialize $x_0$  
<br>
For j = 1, 2, ..  
$x^{(j + 1)} = x^{(j)} - \gamma\nabla f_i(x^{(j)})$

* not practical when dealing with implicit datasets
  * users times items in the order of billions

## Alternating Least Squares
$M \approx U_k V_k^T = Q$
* U and V are "skinny" (low-rank)
* Flexible
  * customize exactly what is considered in the loss function
    * ignore missing values (crucial)
    * weight different $M_i,_j$ different
* Faster than SVD
  * Trivially parallel, iterative
* Dumber than SVD
  * No singular values or orthonormal basis  
<br>
* Choose k << m, n
* Best Approximation Q
  * Minimize L2 norm $||M-Q||_2$  
<br>
* Recommendations are **largest** values in Q

## Non-negative Matrix Factorization
* factors are strictly positive
* good for matrix with positive values (TF, TF-IDF etc.)
* factors are non-unique

## Recommender Systems
[Graphlab User Guide](https://dato.com/learn/userguide/)
<img src="decisiontree.gif">


## Graphlab Recommender
* Factorization Recommender
* Item Similarity Recommender
* Popularity Recommender
* Ranking Factorization Recommender
<br>
* [Recommender Functionalities](https://dato.com/learn/gallery/notebooks/basic_recommender_functionalities.html)
  * [Example](https://dato.com/learn/gallery/notebooks/recsys_rank_10K_song.html)

### Normalization / Bias
* Problem
  * Some items are significantly higher rated
  * Some users rate subtantially lower
  * Ratings change over time
* Bias correction is crucial
  * Offset per user
  * Offset per movie
  * Time effects
  * Global bias  

<br> 
$b_{ui} = \mu^{global} + b_u^{user} + b_i^{item}$
<br>

**New cost function**  

$minimize_\hspace{1mm}\Sigma(r_{ui} - \mu - b_u - b_i - <p_u,q_i>)^2 + \lambda(||p||^2 + ||q||^2 + ||b_u||^2 + ||b_i||^2)$