# Movie Lense Recommender system <a class='tocSkip'> 

## Sang-Yun Oh <a class='tocSkip'> 

In [1]:
!pip install altair
import altair as alt
import numpy as np
import pandas as pd

long = lambda x: x.stack().reset_index()

R = pd.read_pickle('R.pkl')
R_all = pd.read_pickle('R_all.pkl')



## Back to Recommender System: Best $U$ and $V$

- In recommender system, we want to find $U$ and $V$ that minimize the residual:
$$
\begin{aligned}
\min_{U,V} f(U,V) &= \min_{U,V} \|R - V U^T\|_F^2
\end{aligned}
$$
- Due to missing values, we minimize over just the observed ratings: i.e.,
$$
\begin{aligned}
\min_{U,V} f(U,V) &= \min_{U,V} \|R - V U^T\|_F^2\\
&=\min_{U,V} \left\{ \sum_{m=1}^M\sum_{i=1}^I I_{mi}(r_{mi} - v_m u_i^T)^2 \right\}\\
&= \min_{U,V} \left\{ \sum_{m=1}^M\sum_{i=1}^I I_{mi}\cdot f_{mi}(v_m, u_i) \right\}
\end{aligned}
$$
where 
$$
\begin{aligned}
I_{im} = \begin{cases}
1 \text{, if  $r_{mi}$ is observed}\\
0 \text{, otherwise}\\
\end{cases}
\end{aligned}
$$

- Since $f(U,V)$ is a sum of quadratic (convex) functions, we can cycle through decreasing $f_{ij}(u_i,v_m)$ respect to vectors $u_i$ and $v_m$ eventually decreases the sum.

- In order to apply the update equations as in the toy examples, we compute the gradients:
$$
\begin{aligned}
\frac{\partial}{\partial u_i} f_{mi}(v_m,u_i) &= -2(r_{mi} - v_mu_i^T)\cdot v_m\\
\frac{\partial}{\partial v_m} f_{mi}(v_m,u_i) &= -2(r_{mi} - v_mu_i^T)\cdot u_i.
\end{aligned}
$$

- Now, the update formulas are
$$
\begin{aligned}
u_i^{\text{new}} &= u_i + 2\alpha(r_{mi} -  v_m u_i^T)\cdot v_m\\
v_m^{\text{new}} &= v_m + 2\alpha(r_{mi} -  v_m u_i^T)\cdot u_i,
\end{aligned}
$$
where $\alpha$ is the step-size

## Preparing to Optimize

* Decide number of latent factors $K$
* Initialize matrices $U$ and $V$ with random values

In [2]:
I = 16 # number of users
M = 15 # number of movies
K = 5  # number of latent factors

# initialize U and V with random values
np.random.seed(42)

U = np.random.uniform(0, 1, size=K*I).reshape((I, K))
V = np.random.uniform(0, 1, size=K*M).reshape((M, K))

Uold = np.zeros_like(U)
Vold = np.zeros_like(V)

In [3]:
U.shape

(16, 5)

* Decide metric for improvement (root mean square error):
    $$ \text{RMSE}(x, y) = \left[\frac{1}{n}\sum_{i=1}^{n} \|x_i - y_i\|_2^2 \right]^{1/2},$$
    where matrices $x$ and $y$ are first vectorized

In [4]:
# calculate RMSE
def rmse(X, Y):
    from numpy import sqrt, nanmean
    return sqrt(nanmean((X - Y)**2))

error = [(0, rmse(R, np.inner(V,U)))]

* Keep track of updates to $U$ and $V$:
    $$ \text{MaxUpdate}(U^{\text{(old)}}, U^{\text{(new)}}) = \left\|\frac{U^{\text{(old)}}-U^{\text{(new)}}}{U^{\text{(new)}}}\right\|_\infty,$$
    where difference, ratio, and matrix norm is computed element-wise.

In [5]:
# calculate maximum magnitude of relative updates
def max_update(X, Y):
    from numpy import inf
    from numpy.linalg import norm
    return norm(((X - Y)/Y).ravel(), inf)

update = [(0, max(max_update(Uold, U), max_update(Vold, V)))]

## Compute Solutions: $U$ and $V$

In [6]:
rate = 0.1            # learning rate (step size) 
max_iterations = 300  # maximum number of iterations
threshold = 0.001     # max_update threshold for termination

for t in range(1, max_iterations):
     
    for m, i in zip(*np.where(~np.isnan(R))):
        
        U[i] = U[i] + rate*V[m]*(R.iloc[m,i] - np.inner(V[m], U[i]))
        V[m] = V[m] + rate*U[i]*(R.iloc[m,i] - np.inner(V[m], U[i]))
        
    # compute error after one sweep of updates
    error += [(t, rmse(R, np.inner(V,U)))]
    
    # keep track of how much U and V changes
    update += [(t, max(max_update(Uold, U), max_update(Vold, V)))]
    Uold = U.copy()
    Vold = V.copy()
    
error = pd.DataFrame(error, columns=['iteration', 'rmse'])
update = pd.DataFrame(update , columns=['iteration', 'maximum update'])

## Monitoring Optimization Progress

* As gradient descent progresses,
* $U$ and $V$ are updated. How large are the updates?
* Are the updates getting better? Does RMSE decrease?

In [7]:
f_rmse = alt.Chart(error).encode(x='iteration:Q', y=alt.Y('rmse:Q', scale=alt.Scale(type='log', base=10, domain=[0.1, 3])))
f_update = alt.Chart(update).encode(x='iteration:Q', y=alt.Y('maximum update:Q', scale=alt.Scale(type='log', base=10)))

alt.hconcat(
    alt.layer(f_rmse.mark_line(), f_rmse.mark_point(filled=True), title='Root Mean Square Error'),
    alt.layer(f_update.mark_line(), f_update.mark_point(filled=True), title='Maximum Relative Update')
)

# Visualize Results

## Ratings

Comparison Data Frame:
* `observed`: observed ratings , $R$
* `fit`: calculated ratings $\hat r_{mi}$  if $r_{mi}$ is observed
* `fit/prediction`: $\hat R = VU^T$
* `deviation`: $(\hat r_{mi} - r_{mi})\cdot I_{mi}$, where $I_{mi}$ indicates if user $i$ rated movie $m$

In [8]:
Rone = R.copy()
Rone.loc[:,:] = 1 # easiest way to copy over row/column names
Rhat = np.inner(V, U) * Rone
Rhat_if_obs = Rhat.where(~np.isnan(R), np.nan)

R_compare = \
    R.rename(columns={'rating':'observed'})\
    .join(Rhat_if_obs.rename(columns={'rating':'fit'}))\
    .join(Rhat.rename(columns={'rating':'fit/prediction'}))\
    .join((Rhat_if_obs-R).rename(columns={'rating':'deviation'}))

long(R_compare)

Unnamed: 0,movie id,movie title,user id,deviation,fit,fit/prediction,observed
0,132,"Wizard of Oz, The (1939)",1,0.108267,4.108267,4.108267,4.0
1,132,"Wizard of Oz, The (1939)",85,0.507715,5.507715,5.507715,5.0
2,132,"Wizard of Oz, The (1939)",178,,,1.228071,
3,132,"Wizard of Oz, The (1939)",269,0.443834,5.443834,5.443834,5.0
4,132,"Wizard of Oz, The (1939)",271,-0.382464,4.617536,4.617536,5.0
...,...,...,...,...,...,...,...
235,111,"Truth About Cats & Dogs, The (1996)",389,-0.000053,2.999947,2.999947,3.0
236,111,"Truth About Cats & Dogs, The (1996)",650,,,5.022816,
237,111,"Truth About Cats & Dogs, The (1996)",716,-0.000063,3.999937,3.999937,4.0
238,111,"Truth About Cats & Dogs, The (1996)",727,-0.000152,2.999848,2.999848,3.0


# Recommend Movies

Recommend unwatched movie $m$ with highest $\hat r_{mi}$

In [9]:
base = alt.Chart(long(R)).mark_rect().encode(
    x='user id:O',
    y='movie title:O',
    color='rating:O',
    tooltip=['user id', 'movie title', 'rating']
)

In [10]:
base

In [11]:
base = alt.Chart(long(R_compare)).mark_rect().encode(
    x='user id:O',
    y='movie title:O',
    tooltip=['user id', 'movie title', 'fit/prediction', 'observed', 'deviation']
)

f_all = base\
    .properties(title='Ratings Fit and Predictions')\
    .encode(color=alt.Color('fit/prediction:Q', scale=alt.Scale(scheme='yellowgreenblue', domain=[1, 5])))
f_raw = base\
    .properties(title='Ratings Data')\
    .encode(color=alt.Color('observed:O', scale=alt.Scale(scheme='yellowgreenblue', domain=[1,2,3,4,5])))
f_err = base\
    .properties(title='Deviation: Data - Fit')\
    .encode(color=alt.Color('deviation:Q', scale=alt.Scale(scheme='redblue', domain=[-2, 2])))

In [12]:
nearest = alt.selection(type='single', nearest=True, on='mouseover', empty='none') 

selectors = base.mark_square(filled=False, size=350).encode(
    x='user id:O',
    y='movie title:O',
    color=alt.value('black'),
    opacity=alt.condition(nearest, alt.value(1), alt.value(0))
).add_selection(
    nearest
)

alt.hconcat(
    alt.layer(f_all.encode(color=alt.Color('fit/prediction:Q', legend=None, scale=alt.Scale(scheme='yellowgreenblue', domain=[1, 5]))),
              selectors),
    alt.layer(f_raw.encode(y=alt.Y('movie title:O', axis=alt.Axis(labels=False))),
              selectors),
).resolve_scale(color='independent')

## Recommendation: User id 85

| **Watched** | **Recommend** | **Recommend** |
| :-: | :-: | :-: | 
| **5 (observed) / 5.5 (fit/prediction)** | **5.8 (fit/prediction)** | **4.3 (fit/prediction)** |
| ![https://www.imdb.com/title/tt0032138/](https://m.media-amazon.com/images/M/MV5BNjUyMTc4MDExMV5BMl5BanBnXkFtZTgwNDg0NDIwMjE@._V1_SY1000_CR0,0,670,1000_AL_.jpg) | ![https://www.imdb.com/title/tt0117979/](https://m.media-amazon.com/images/M/MV5BOWM0MTA4NjItMzM3ZS00NDJmLTg3NWItNGE5ODIyOGJhNzQ0L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SY1000_CR0,0,666,1000_AL_.jpg) | ![https://www.imdb.com/title/tt0117951/](https://m.media-amazon.com/images/M/MV5BMzA5Zjc3ZTMtMmU5YS00YTMwLWI4MWUtYTU0YTVmNjVmODZhXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_SY1000_SX675_AL_.jpg) |
| *The Wizard of Oz (1939)* | *Truth About Cats & Dogs (1996)* | *Trainspotting (1996)*

## Recommendation: User id 727

| **Watched** | **Recommend** | **Recommend** |
| :-: | :-: |  :-: | 
| **5 (observed) / 5.2 (fit/prediction)** | **5.8 (fit/prediction)** | **4.6 (fit/prediction)** |
| ![https://www.imdb.com/title/tt0080455/](https://m.media-amazon.com/images/M/MV5BYTdlMDExOGUtN2I3MS00MjY5LWE1NTAtYzc3MzIxN2M3OWY1XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_.jpg) | ![https://www.imdb.com/title/tt0099348/](https://m.media-amazon.com/images/M/MV5BMTY3OTI5NDczN15BMl5BanBnXkFtZTcwNDA0NDY3Mw@@._V1_SY1000_CR0,0,666,1000_AL_.jpg) | ![https://www.imdb.com/title/tt0038650/](https://m.media-amazon.com/images/M/MV5BZjc4NDZhZWMtNGEzYS00ZWU2LThlM2ItNTA0YzQ0OTExMTE2XkEyXkFqcGdeQXVyNjUwMzI2NzU@._V1_SY1000_CR0,0,687,1000_AL_.jpg) |
| *The Blues Brothers (1980)* | *Dances with Wolves (1990)* | *It's a Wonderful Life (1946)* |


In [13]:
nearest = alt.selection(type='single', nearest=True, on='mouseover', empty='none') 

selectors = base.mark_square(filled=False, size=350).encode(
    x='user id:O',
    y='movie title:O',
    color=alt.value('black'),
    opacity=alt.condition(nearest, alt.value(1), alt.value(0))
).add_selection(
    nearest
)

alt.hconcat(
    alt.layer(f_all.encode(color=alt.Color('fit/prediction:Q', legend=None, scale=alt.Scale(scheme='yellowgreenblue', domain=[1, 5]))),
              selectors),
    alt.layer(f_err.encode(y=alt.Y('movie title:O', axis=alt.Axis(labels=False))),
              selectors),
).resolve_scale(color='independent')

# Comparing Users or Comparing Movies

* $K$-latent factors or unobserved characteristics
* $v_{mk}$: movie $m$ having characteristic $k$
* $u_{ik}$: user $i$'s affinity to characteristic $k$

$$ \hat r_{mi} = \sum_{k=1}^K v_{mk} u_{ik} = v_{m} u_{i}^T $$

## Matrix Factors: $V$ and $U$

* Each row of $U$ represents a user
* Compare users $i$ and $j$:
    $$\|u_i - u_j\|^2_2$$
* Each row of $V$ represents a movie
* Compare movies $m$ and $n$:
    $$\|v_m - v_n\|^2_2$$

In [14]:
V = pd.DataFrame(V, index=R.index, 
                 columns=pd.MultiIndex.from_product([['affinity'], range(0, K)], names=[None, 'k']))
U = pd.DataFrame(U, index=R.columns.get_level_values(level='user id'),
                 columns=pd.MultiIndex.from_product([['affinity'], range(0, K)], names=[None, 'k']))

In [15]:
long(U).head()

Unnamed: 0,user id,k,affinity
0,883,0,-1.573376
1,883,1,1.738146
2,883,2,0.292821
3,883,3,0.608656
4,883,4,1.214837


In [16]:
alt.hconcat(
    alt.Chart(long(V)).mark_rect().encode(x='k:O', y='movie title:O', color='affinity:Q'),
    alt.Chart(long(U)).mark_rect().encode(x='k:O', y='user id:O', color='affinity:Q')
)

# What can be improved? 

* $U$ and $V$ have $k(I+M)$ elements
* As $k\rightarrow\infty$ increases training $\text{RMSE}\rightarrow 0$
* Magnitude of elements of $\hat R$, $U$, and $V$ may get larger 
* Elements of $\hat R$ can be outside allowed range: i.e., $\hat R\not\in [1, 5]$  
* $\text{RMSE}$ can increase back up as $t$ increases for large $k$  