## SVD for topic analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [2]:
# !pip install pandas

Collecting pandas
  Downloading pandas-0.22.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.9MB)
[K    100% |████████████████████████████████| 14.9MB 89kB/s eta 0:00:011
Installing collected packages: pandas
Successfully installed pandas-0.22.0


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [7]:
# Compute SVD
from numpy.linalg import svd
U, sigma, VT = svd(M,full_matrices=False)

In [78]:
print(U)

[[-0.21  0.02  0.31  0.26 -0.28]
 [-0.55  0.06  0.53  0.46  0.14]
 [-0.5   0.07 -0.31 -0.2   0.73]
 [-0.62  0.08 -0.39 -0.24 -0.61]
 [-0.12 -0.6   0.4  -0.52 -0.  ]
 [-0.04 -0.73 -0.42  0.53 -0.  ]
 [-0.06 -0.3   0.2  -0.26 -0.  ]]


In [79]:
print(sigma)

[13.84  9.52  1.69  1.02  0.  ]


In [80]:
print(VT)

[[-0.5  -0.62 -0.6  -0.06 -0.06]
 [ 0.09 -0.05  0.11 -0.7  -0.7 ]
 [-0.78  0.62  0.03 -0.07 -0.07]
 [-0.36 -0.48  0.79  0.05  0.05]
 [ 0.   -0.    0.   -0.71  0.71]]


## Part 1

Describe in your own words what the matrices contain and how they might be used

In [17]:
## U matrix
## print the shape and add your description
print('`U` is a {nrows}x{ncols} matrix where the {nrows} rows correspond to the original {nrows} rows in `M`, the {ncols} columns correspond to the {ncols} discovered "topics" in `sigma`, and the values represent the covariance between the {nrows} rows of `M` and the {ncols} topics in `sigma`.'.format(
    nrows=U.shape[0],
    ncols=U.shape[1]
))


`U` is a 7x5 matrix where the 7 rows correspond to the original 7 rows in `M`, the 5 columns correspond to the 5 discovered "topics" in `sigma`, and the values represent the covariance between the 7 rows of `M` and the 5 topics in `sigma`.


In [19]:
## sigma matrix
## print the shape and add your description
print('`sigma` is a vector of length {size}, which is the {size} diagonal values of the {size}x{size} topic matrix. The values represent the strength or weight of the topic which correlates `M`s rows with `M`s columns.'.format(
    size=sigma.shape[0]
))


`sigma` is a vector of length 5, which is the 5 diagonal values of the 5x5 topic matrix. The values represent the strength or weight of the topic which correlates `M`s rows with `M`s columns.


In [20]:
## VT matrix
## print the shape and add your description
print('`VT` is a {nrows}x{ncols} matrix where the {nrows} rows correspond to the {nrows} discovered "topics" in `sigma`, the {ncols} columns correspond to the original {ncols} columns in `M`, and the values represent the covariance between the {ncols} columns of `M` and the {nrows} topics in `sigma`.'.format(
    nrows=VT.shape[0],
    ncols=VT.shape[1]
))

`VT` is a 5x5 matrix where the 5 rows correspond to the 5 discovered "topics" in `sigma`, the 5 columns correspond to the original 5 columns in `M`, and the values represent the covariance between the 5 columns of `M` and the 5 topics in `sigma`.


## Part 2

Making use of the factorized version of our ratings

In [21]:
# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
df_U = pd.DataFrame(U, index=users)
df_VT = pd.DataFrame(VT, columns=movies)

print(df_U)
print("--------------------------------------")
print(np.diag(sigma))
print("--------------------------------------")
print(df_VT)

          0     1     2     3     4
Alice -0.21  0.02  0.31  0.26 -0.28
Bob   -0.55  0.06  0.53  0.46  0.14
Cindy -0.50  0.07 -0.31 -0.20  0.73
Dan   -0.62  0.08 -0.39 -0.24 -0.61
Emily -0.12 -0.60  0.40 -0.52 -0.00
Frank -0.04 -0.73 -0.42  0.53 -0.00
Greg  -0.06 -0.30  0.20 -0.26 -0.00
--------------------------------------
[[13.84  0.    0.    0.    0.  ]
 [ 0.    9.52  0.    0.    0.  ]
 [ 0.    0.    1.69  0.    0.  ]
 [ 0.    0.    0.    1.02  0.  ]
 [ 0.    0.    0.    0.    0.  ]]
--------------------------------------
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70
2   -0.78   0.62      0.03       -0.07    -0.07
3   -0.36  -0.48      0.79        0.05     0.05
4    0.00  -0.00      0.00       -0.71     0.71


The central matrix contains the `sigma` values on the diagonal, or the strength or weight of the discovered "topics".

The top matrix (`U`) shows the correlation of the users with the 5 sigma values, or how much each user contributed to or correlates with each topic.

The bottom matrix (`VT`) shows the correlation of the movies with the 5 sigma values, or how much each movie contributed to or correlates with each topic.

## Trim the matrices to represent a factorization from only the top two factors

In [52]:
sigma
sorted_indices = np.argsort(sigma)
top2indices = np.argwhere(sorted_indices > (sigma.size - 3)).flatten()
sigma, top2indices, sigma[top2indices]

(array([13.84,  9.52,  1.69,  1.02,  0.  ]),
 array([0, 1]),
 array([13.84,  9.52]))

In [53]:
U_trimmed = U[:, top2indices]
U_trimmed

array([[-0.21,  0.02],
       [-0.55,  0.06],
       [-0.5 ,  0.07],
       [-0.62,  0.08],
       [-0.12, -0.6 ],
       [-0.04, -0.73],
       [-0.06, -0.3 ]])

In [54]:
VT_trimmed = VT[top2indices, :]
VT_trimmed

array([[-0.5 , -0.62, -0.6 , -0.06, -0.06],
       [ 0.09, -0.05,  0.11, -0.7 , -0.7 ]])

In [58]:
sigma_trimmed = sigma[top2indices]
sigma_trimmed

array([13.84,  9.52])

## Part 3: Does your approximate version of the matrix still reasonably reconstruct the original?

In [81]:
# Use this code but swap in your matrices
np.around(U_trimmed.dot(np.diag(sigma_trimmed)).dot(VT_trimmed), 2)

array([[ 1.47,  1.79,  1.76,  0.04,  0.04],
       [ 3.86,  4.69,  4.63,  0.06,  0.06],
       [ 3.52,  4.26,  4.23, -0.05, -0.05],
       [ 4.36,  5.28,  5.23, -0.02, -0.02],
       [ 0.32,  1.32,  0.37,  4.1 ,  4.1 ],
       [-0.35,  0.69, -0.43,  4.9 ,  4.9 ],
       [ 0.16,  0.66,  0.18,  2.05,  2.05]])

## Part 4: Make some recommendations

Use cosine similarity to compare all other users to Alice (using movie profiles)

In [60]:
def cosine_similarity(a, b):
    return a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [88]:
alice = U_trimmed[0]
[cosine_similarity(user, alice) for user in U_trimmed]

# alice = U[0]
# [cosine_similarity(user, alice) for user in U]

[1.0000000000000002,
 0.999906026146549,
 0.9990258014702169,
 0.9994432224561701,
 0.10226476052188307,
 -0.0402010900614791,
 0.10226476052188307]

Use cosine similarity to comare all other movies to StarWars (using user profiles)

In [76]:
starwars = VT_trimmed.T[2]
[cosine_similarity(movie, starwars) for movie in VT_trimmed.T]

starwars = VT.T[2]
[cosine_similarity(movie, starwars) for movie in VT.T]

[0.002107164624015895,
 0.005903557758960923,
 1.0,
 -0.003596073056726072,
 -0.003596073056726072]

Provide a new vector of ratings and determine which is closest

In [74]:
collin = [5, 2, 4, 2, 3]
[cosine_similarity(user, collin) for user in U]

[-0.022074476836669934,
 0.12048538174604846,
 -0.24725899206164817,
 -0.9067049412787263,
 -0.1814945985386083,
 -0.30061372024418437,
 -0.1814945985386083]