### NMF

---

### Dimensionality does get reduced!!!

In [None]:
small_R = 3*4
small_R

In [None]:
size_submatrices = 6 + 8
size_submatrices

In [127]:
size_submatrices / small_R

1.1666666666666667

In [13]:
#imagine R but each axis is 100x greater
big_R = 300*400

In [14]:
big_R 

120000

In [128]:
#size of P and Q combined assuming 2 genres
big_P = 300*2
big_Q = 400*2
size_submatrices = big_P + big_Q

In [129]:
size_submatrices / big_R

0.011666666666666667

### NMF 

---

#### Imports

In [18]:
import numpy as np
from sklearn.decomposition import NMF
import pandas as pd

In [19]:
films = ['Titanic', 'Tiffany', 'Terminator', 'Star Trek', 'Star Wars']
users = ['ada', 'bob','steve','margaret']

In [20]:
ada = [5,4,1,1,np.nan]
bob = [3,2,1,np.nan,1]
steve = [np.nan,np.nan,np.nan,np.nan,5]
margaret = [1,1,5,4,4]
data = np.concatenate((ada,bob,steve,margaret), axis=0).reshape(-1,5)

In [21]:
R = pd.DataFrame(data, columns=films, index=users)

In [22]:
R

Unnamed: 0,Titanic,Tiffany,Terminator,Star Trek,Star Wars
ada,5.0,4.0,1.0,1.0,
bob,3.0,2.0,1.0,,1.0
steve,,,,,5.0
margaret,1.0,1.0,5.0,4.0,4.0


---

## New steps

### Handle missing data
* ZEROS? - NOT GOOD TO USE (zero = terrible rating)!!
* Average / Median -EASY, GOOD,QUICK
* Use an imputer (KNN, SMOTE, etc)

In [27]:
med_values = R.median().median()

In [28]:
R.fillna(med_values,inplace=True)

In [29]:
R

Unnamed: 0,Titanic,Tiffany,Terminator,Star Trek,Star Wars
ada,5.0,4.0,1.0,1.0,2.5
bob,3.0,2.0,1.0,2.5,1.0
steve,2.5,2.5,2.5,2.5,5.0
margaret,1.0,1.0,5.0,4.0,4.0


---

### Train NMF
* Small n_components = trains fast, might underfit, lots of dimensinality reduction
* High n_components = trains slow, might overfit, not much dimensionality reduction

In [45]:
m = NMF(n_components=2)

In [46]:
m.fit(R)

NMF(n_components=2)

### Check out the sub-matrices, and the reconstruction error

In [47]:
Q = m.components_
P = m.transform(R)
error = m.reconstruction_err_ #this is an absolute score, so no intuition from looking at in isolation! 
P.shape, Q.shape, error

((4, 2), (2, 5), 2.271418780809492)

### Reconstruct the original matrix - not necessary for doing movie predictions!! ie rating new users requires new input

In [52]:
R

Unnamed: 0,Titanic,Tiffany,Terminator,Star Trek,Star Wars
ada,5.0,4.0,1.0,1.0,2.5
bob,3.0,2.0,1.0,2.5,1.0
steve,2.5,2.5,2.5,2.5,5.0
margaret,1.0,1.0,5.0,4.0,4.0


In [51]:
new_R = np.dot(P,Q)
pd.DataFrame(new_R.round(1), columns=films, index=users)

Unnamed: 0,Titanic,Tiffany,Terminator,Star Trek,Star Wars
ada,5.0,3.9,0.7,1.4,2.5
bob,2.7,2.2,1.1,1.3,2.0
steve,2.9,2.5,3.1,3.0,3.9
margaret,0.8,1.0,4.6,3.9,4.5


---

### Make a prediction based on new user input

In [53]:
m

NMF(n_components=2)

In [54]:
films

['Titanic', 'Tiffany', 'Terminator', 'Star Trek', 'Star Wars']

In [130]:
new_user_input = pd.Series([5,4,np.nan, np.nan, np.nan])
new_user_input

0    5.0
1    4.0
2    NaN
3    NaN
4    NaN
dtype: float64

In [116]:
#Fill missing data
new_user_input = new_user_input.fillna(med_values)

In [122]:
# make sure the new input has >1 dimension & has as many columns as there are films!
new_user_input = np.array(new_user_input).reshape(1,5)

In [123]:
#Prediction step 1 - generate extra a user_P
user_P = m.transform(new_user_input)

In [124]:
#new user R - reconstruct R but for this new user only
user_R = np.dot(user_P,Q)

In [120]:
user_R #impute with median - order of recommendations is the same as below

array([[4.78138307, 3.85620501, 1.82056731, 2.26895243, 3.46890806]])

In [125]:
user_R #impute with zero - order of recommendations is the same as above

array([[4.56393556, 3.54896857, 0.        , 0.75729632, 1.73705746]])

In [99]:
# I have a list of predicted films!! BUT my user has already seen 2 films
#remove films that are already seen, and return a zip of film title and rating, sorted by highest rating
user_R = user_R[0][2:]

In [100]:
recommendations = list(zip(user_R,films[2:]))

In [101]:
no_of_films = 1
top_pick = sorted(recommendations, key = lambda x: x[0])[-no_of_films][-1]
top_pick

'Star Wars'

---

# Next steps 
* download 100k movie lense dataset - all the data you'll need is in `ratings.csv`
* work out a way to create a matrix with rows=users, columns=movies, values in the matrix= user_rating a for movie_id=1 etc
* FOLLOW the steps in this notebook to create a trained NMF able to make predictions