## SVD for topic analysis

We can use SVD to determine what we call **latent features**. This will be best demonstrated with an example.

### Users and movie ratings

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

<table style="width:80%">
  <tr>
    <th></th>
    <th>Matrix</th> 
    <th>Alien</th>
    <th>Star Wars</th>
    <th>Casa Blanca</th>
    <th>Titanic</th>  
  </tr>
  <tr>
    <td>María</td>
    <td align="center">1</td>
    <td align="center">2</td>
    <td align="center">2</td>
    <td align="center">0</td>
    <td align="center">0</td>  
  </tr>
  <tr>
    <td>Tomás</td>
    <td align="center">3</td>
    <td align="center">5</td>
    <td align="center">5</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
  <tr>
    <td>Fernando</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
   <tr>
    <td>Eduardo</td>
    <td align="center">5</td>
    <td align="center">5</td>
    <td align="center">5</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
  <tr>
    <td>Isabela</td>
    <td align="center">0</td>
    <td align="center">0</td>
    <td align="center">0</td>
    <td align="center">5</td>
    <td align="center">5</td>
  </tr>
  <tr>
    <td>Miguel</td>
    <td align="center">0</td>
    <td align="center">0</td>
    <td align="center">0</td>
    <td align="center">5</td>
    <td align="center">5</td>
  </tr>
  <tr>
    <td>Gabriela</td>
    <td align="center">0</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">2</td>
    <td align="center">2</td>
  </tr>
</table>
<br>

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [3]:
# Compute SVD
from numpy.linalg import svd
U, sigma, VT = svd(M,full_matrices=False)

## Part 1

Describe in your own words what the matrices contain and how they might be used

In [1]:
## U matrix
## print the shape and add a one sentence description of the matrix

In [None]:
## sigma matrix
## print the shape and add a one sentence description of the matrix

In [2]:
## VT matrix
## print the shape and add a one sentence description of the matrix

## Part 2

Making use of the factorized version of our ratings.  The following code rounds the elements of the matrices and prints them for inspection.

In [12]:
# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['María','Tomás','Fernando','Eduardo','Isabela','Miguel','Gabriela']

U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
df_U = pd.DataFrame(U, index=users)
df_VT = pd.DataFrame(VT, columns=movies)

print(df_U)
print("--------------------------------------")
print(np.diag(sigma))
print("--------------------------------------")
print(df_VT)

             0     1     2     3     4
María    -0.21  0.02  0.31  0.26  0.66
Tomás    -0.55  0.06  0.53  0.46 -0.33
Fernando -0.50  0.07 -0.31 -0.20 -0.37
Eduardo  -0.62  0.08 -0.39 -0.24  0.36
Isabela  -0.12 -0.60  0.40 -0.52  0.20
Miguel   -0.04 -0.73 -0.42  0.53 -0.00
Gabriela -0.06 -0.30  0.20 -0.26 -0.40
--------------------------------------
[[13.84  0.    0.    0.    0.  ]
 [ 0.    9.52  0.    0.    0.  ]
 [ 0.    0.    1.69  0.    0.  ]
 [ 0.    0.    0.    1.02  0.  ]
 [ 0.    0.    0.    0.    0.  ]]
--------------------------------------
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70
2   -0.78   0.62      0.03       -0.07    -0.07
3   -0.36  -0.48      0.79        0.05     0.05
4    0.00   0.00     -0.00       -0.71     0.71


QUESTION: Add your own description in the cell below of how the matrices relate to each other

## Part 3: Work with only the most representive topics

The goal of this section is to see if we can reasonably reconstruct the original matrix from truncated versions of the three matrices.

In [None]:
## Truncate all three matrices using slicing such that only the top two factors are represented


In [13]:
## print the matrix product of the truncated matrices


In [None]:
## print it again this time with df versions.  HINT: you may do a dot product directly on the df_U with df_U.dot()


## Part 4: Make some recommendations

Use cosine similarity to compare all other users to Miguel (using movie profiles).  Which user is closest to Miguel? (use `argsort` for this)

Use cosine similarity to comare all other movies to StarWars (using user profiles). Which movie is cloest to StarWars? (use `argsort` for this)

Rate two of the movies and lets find your recommendations

In [None]:
## 1. Create a new vector of recommendations

## 2. Append your vector to the ratings matrix

## 3. Using cosine similarity determine which movie should be recommended next?

## 4. Find the user that has the most similar recommendations to you and recommend the top rated movie you have not seen


### Extra Credit: In the real world

It turns out taking a new vector of recommendations and comparing it to all other vectors can be very slow in practice.  Lets make some recommendations again pretending our matrix has millions of users and thousands of movies.  Use the non-truncated versions for this example, but in the real world we would use the truncated version.

In [14]:
## 1. Create a new vector of recommendations

## 2. Use the V matrix to determine the 2 most representitve loadings

## 3. Use the U matrix to find the user that best represents each loading

## 4. Create a matrix from the ratings matrix with only the users as rows and all columns

## 5. Sum the ratings for each movie into another vector

## 6. Using argsort print the movies you would recommend (omitting the ones already rated)