# Part II: Limitation(s) of sklearn's Non-Negative Matrix Factorization Library

### 1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]



We can take a look at sklearn's non-negative matrix factorization library in a different context - movie ratings. We will use data from a previous assignment.

In [23]:
#General libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import NMF
from sklearn.metrics import root_mean_squared_error



In [2]:
user_data = pd.read_csv('users.csv')
movie_data = pd.read_csv('movies.csv')
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [4]:
display(user_data.head())
display(movie_data.head())
display(train_data.head())
display(test_data.head())

Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Unnamed: 0,mID,title,year,Doc,Com,Hor,Adv,Wes,Dra,Ani,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,Toy Story,1995,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,3,Grumpier Old Men,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


Unnamed: 0,uID,mID,rating
0,2233,440,4
1,4274,587,5
2,2498,454,3
3,2868,2336,5
4,1636,2686,5


The goal for this task is to use NMF to predict the missing ratings from the test data. 

In [14]:
#Number of users and movies (unique)
unique_users = train_data['uID'].unique()
unique_movies = train_data['mID'].unique()

#Map from original user and movie IDs - this will connect to the test data.
user_mapping = {uid: x for x, uid in enumerate(unique_users)}
movie_mapping = {mid: y for y, mid in enumerate(unique_movies)}

#map to train and test datasets
train_data['uID'] = train_data['uID'].apply(lambda uID: user_mapping[uID])
train_data['mID'] = train_data['mID'].apply(lambda mID: movie_mapping[mID])
test_data['uID'] = test_data['uID'].apply(lambda uID: user_mapping.get(uID))
test_data['mID'] = test_data['mID'].apply(lambda mID: movie_mapping.get(mID))

#create a matrix of size #users x #movies
rating_matrix = np.zeros((len(unique_users), len(unique_movies)))

for index, row in train_data.iterrows():
    rating_matrix[row['uID'], row['mID']] = row['rating']



In [15]:
pd.DataFrame(rating_matrix).iloc[:5, :5]

Unnamed: 0,0,1,2,3,4
0,5.0,0.0,0.0,0.0,0.0
1,0.0,4.0,0.0,0.0,0.0
2,4.0,0.0,5.0,4.0,0.0
3,0.0,0.0,0.0,2.0,0.0
4,3.0,3.0,0.0,3.0,5.0


In [20]:
#train NMF model
features = movie_data.shape[1] - 1 #num cols - target
movie_nmf = NMF(
    n_components = features,
    random_state = 42,
    max_iter = 500
)

W = movie_nmf.fit_transform(rating_matrix)
H = movie_nmf.components_

#get predictions by taking the dot product.
pred_rating_matrix =np.dot(W,H)
pd.DataFrame(pred_rating_matrix).iloc[:5, :5]

Unnamed: 0,0,1,2,3,4
0,4.544563,2.145763,0.157586,0.231989,0.184986
1,0.779795,0.665149,0.03416,0.098775,0.0
2,3.320221,2.68587,2.939453,1.956289,1.294313
3,3.04355,0.979057,0.17008,1.601921,0.0
4,2.457001,2.103316,2.272339,2.325669,0.68506


In [24]:
#get true rating values for the test data
y_true = test_data['rating'].values

#get predicted ratings
def predicted_ratings(uID, mID, predicted_ratings_matrix):
    #if uID and mID is not NaN, get predicted ratings
    if not np.isnan(uID) and not np.isnan(mID):
        return predicted_ratings_matrix[int(uID), int(mID)]
    else:
        return 0

y_pred = test_data.apply(lambda row: predicted_ratings(row['uID'], row['mID'], pred_rating_matrix), axis=1)

# Calculate the RMSE
rmse = root_mean_squared_error(y_true, y_pred)
round(rmse,3)

np.float64(2.861)

### 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]



Now that we have predicted ratings on the test set, we can see that a RMSE of ~2.861 was achieved. This is the average of differences between the true ratings and predicted ratings (lower value = more accurate predictions). In this case, this is a relatively high error value. Keeping in mind that the rating scales are between 0 and 5, an RMSE of 2.861 could be the difference between rating a movie as "good" and rating a movie as "bad". 

Based on this result, it is evident that the NMF library didn't achieve the same results compared to the simple baseline or similarity-based methods from Module 3. One possible reason this might be is in the process of determining eigenvectors in the respective matrices of these methods. These matrices may convey more information in their eigenvectors and thus, improve model performance.

NMF is not ideal for filling in missing values. NMF takes a matrix and creates two matrices that result in the product of the initial matrix. This allows for a higher threshold of error to enter the model. Decreasing the RMSE may be possible in some ways, although it would be expected that it would not be reduced by much. It is possible that hyperparameter tuning could reduce the RMSE and increase model performance. In some cases (perhaps not this one), dimensionality reduction techniques might be of interest.

