# Evaluating Non-negative Matrix Factorization
The goal of this project is to leverage sklearn's NMF method for a movie recommender and evaluate it's performance againt the methods observed during class. We'll build the following recommender models:
1. Raw NMF - No treatment for missing ratings
2. Replace missing ratings with avg. score 3
3. Replace missing ratings with user's avg. score

author: Bruno Velleca, repository link: [CSCA5632](https://github.com/brucamail/MSCS-Machine-Learning/tree/main/CSCA5632)

In [10]:
#importing relevant libraries

import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import pdist, squareform
from collections import namedtuple

In [7]:
MV_users = pd.read_csv('users.csv')
MV_movies = pd.read_csv('movies.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

The table below presents the Root Mean Squared Error (RMSE) for the recommendation systems built in a previous assignment. We wil compare NMF's RMSE when used as a recommender system to rank it against these systems.

In [4]:
method_scores = {
'Baseline Yp=3': 1.259 ,
'Baseline Yp=mean': 1.035,
'Content based, item-item': 1.013 ,
'Collaborative, cosine': 1.026,
'Collaborative, jaccard, Mr>=3': 0.982 ,
'Collaborative, jaccard, Mr>=1': 0.991,
'Collaborative, jaccard, Mr': 0.959
}

df = pd.DataFrame(list(method_scores.items()), columns=['Method', 'RMSE'])
df

Unnamed: 0,Method,RMSE
0,Baseline Yp=3,1.259
1,Baseline Yp=mean,1.035
2,"Content based, item-item",1.013
3,"Collaborative, cosine",1.026
4,"Collaborative, jaccard, Mr>=3",0.982
5,"Collaborative, jaccard, Mr>=1",0.991
6,"Collaborative, jaccard, Mr",0.959


## Building the Non-Negative Matrix Function
We are leveraging the `class RecSys` from the assignment of recommender system and changed it to have a new method called `nmf_predict` with a parameter called `method` with 3 options:


*   `raw` - no treatment to movies that have not been rated by a user
*   `three` - transforms all unrated movies to 3
*   `user_avg` - calculates the user average and replaces all unrated movies by the user's average.

We'll predict ratings and calculate the RMSE for the 3 different methods.
Before calculating the scores for all the data, I used samples to validate the approaches. This step has been removed from the notebook to keep it shorter.



In [29]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))

    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID]
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)

        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())

    #method types: raw, three, usr_avg
    def nmf_predict(self, method='raw'):
      ratings_matrix = self.Mr.copy()
      nmf_model = NMF(n_components=100, init='random', random_state=0, max_iter=500)

      if method=='three':
        ratings_matrix[ratings_matrix == 0] = 3

      elif method == 'usr_avg':
        user_avg_ratings = {} #dict for the user ratings
        for uid in self.allusers:
            user_ratings = self.Mr[self.uid2idx[uid], :]  # getting the ratings
            rated_movies = user_ratings[user_ratings > 0]  # filtering zeros
            user_avg_ratings[uid] = np.mean(rated_movies) if rated_movies.size else 3
            ratings_matrix[ratings_matrix == 0] = user_avg_ratings[uid]

      else: pass

      W = nmf_model.fit_transform(ratings_matrix)
      H = nmf_model.components_
      yp = np.dot(W,H)

      return yp

    def rmse(self,yp):
      # yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
      # yt=np.array(self.data.test.rating)
      # return np.sqrt(((yt-yp)**2).mean())

      predicted_ratings = []

      # Iterate over the test set.
      for index, row in self.data.test.iterrows():
          uid = row['uID']
          mid = row['mID']
          # Get the indices of the user and movie in the prediction matrix.
          u_idx = self.uid2idx.get(uid)
          m_idx = self.mid2idx.get(mid)
          # If both user and movie are found in the prediction matrix,
          # append the predicted rating to the list.
          if u_idx is not None and m_idx is not None:
            predicted_ratings.append(yp[u_idx, m_idx])
          # If either user or movie is not found, append 3 as the default rating.
          else:
            predicted_ratings.append(3)

      # Convert the predicted ratings list to a NumPy array.
      predicted_ratings = np.array(predicted_ratings)
      # Now predicted_ratings has rating for each rating in test set

      # Replace NaN values in predictions with 3.
      predicted_ratings[np.isnan(predicted_ratings)] = 3
      # Get the actual ratings from the test set.
      actual_ratings = np.array(self.data.test.rating)
      # Calculate the RMSE.
      rmse_value = np.sqrt(((actual_ratings - predicted_ratings)**2).mean())

      return rmse_value


### NMF - Raw Method

In [43]:
rs_all = RecSys(data)
raw_pred = rs_all.nmf_predict()
raw_rmse = rs_all.rmse(raw_pred)
df = pd.concat([df, pd.DataFrame([{'Method': 'NMF - Raw', 'RMSE': raw_rmse}])], ignore_index=True)
print('RMSE: ',raw_rmse)

RMSE:  3.2000587858088867


### NMF - Three Method

In [42]:
three_pred = rs_all.nmf_predict(method='three')
three_rmse = rs_all.rmse(three_pred)
df = pd.concat([df, pd.DataFrame([{'Method': 'NMF - Three', 'RMSE': three_rmse}])], ignore_index=True)
print('RMSE: ',three_rmse)

RMSE:  1.1627417567180343


### NMF - User average Method

In [41]:
user_avg_pred = rs_all.nmf_predict(method='usr_avg')
user_avg_rmse = rs_all.rmse(user_avg_pred)
df = pd.concat([df, pd.DataFrame([{'Method': 'NMF - User Avg', 'RMSE': user_avg_rmse}])], ignore_index=True)
print('RMSE: ',user_avg_rmse)

RMSE:  1.125163171177205


## Conclusion
It is clear that NMF struggles with databases with numerous unreviewed items.
The dependency of the model on dense matrices is made very clear when evaluating it's performance without any treatment to the database. This is because it relies heavily on the observed data points, which might not be representative of the overall user preferences, creating bias and innacuracies in its predictions. Raw method's RMSE was 3.20 v. 1.26 for the baseline method when suggesting 3s.

Probably a better way to use NMF without having to treat the data would be groupping the ratings by genre for example, generating a more dense matrix and reducing the impact of the missing observations.

Alternatively, both methods to try and overcome the sparse matrix used in this notebook (`three`, `user_avg`) have dramatically reduced the RMSE in both cases, but still very minimal improvement compared to the baseline and performing poorly when compared to the other methods (see table below).

In [45]:
#final comparison table
print ('Comparison Table for all Methods')
#formatting for .2f
pd.options.display.float_format = '{:.2f}'.format
df

Comparison Table for all Methods


Unnamed: 0,Method,RMSE
0,Baseline Yp=3,1.26
1,Baseline Yp=mean,1.03
2,"Content based, item-item",1.01
3,"Collaborative, cosine",1.03
4,"Collaborative, jaccard, Mr>=3",0.98
5,"Collaborative, jaccard, Mr>=1",0.99
6,"Collaborative, jaccard, Mr",0.96
7,NMF - Raw,3.2
8,NMF - Three,1.16
9,NMF - User Avg,1.13


ConvergenceWarning: Maximum number of iterations 500 reached. Increase it to improve convergence.