# Week 4 Unsupervised Learning Sklearn’s Non-Negative Matrix Factorization 

In [34]:
# Libraries
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
from sklearn.decomposition import NMF
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error

In [11]:
# Import the data

users = pd.read_csv('https://raw.githubusercontent.com/Vorlon41/Master-of-Data-Science-CU-Boulder-Colorado/main/Machine%20Learning/DTSA%205510%20Unsupervised%20Algorithms%20in%20Machine%20Learning/Week4DataFiles/users.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/Vorlon41/Master-of-Data-Science-CU-Boulder-Colorado/main/Machine%20Learning/DTSA%205510%20Unsupervised%20Algorithms%20in%20Machine%20Learning/Week4DataFiles/movies.csv')
train = pd.read_csv('https://raw.githubusercontent.com/Vorlon41/Master-of-Data-Science-CU-Boulder-Colorado/main/Machine%20Learning/DTSA%205510%20Unsupervised%20Algorithms%20in%20Machine%20Learning/Week4DataFiles/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Vorlon41/Master-of-Data-Science-CU-Boulder-Colorado/main/Machine%20Learning/DTSA%205510%20Unsupervised%20Algorithms%20in%20Machine%20Learning/Week4DataFiles/test.csv')

In [13]:
print(train.head())     # See the first 5 rows
print(train.info())     # See basic structure


    uID   mID  rating
0   744  1210       5
1  3040  1584       4
2  1451  1293       5
3  5455  3176       2
4  2507  3074       5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700146 entries, 0 to 700145
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   uID     700146 non-null  int64
 1   mID     700146 non-null  int64
 2   rating  700146 non-null  int64
dtypes: int64(3)
memory usage: 16.0 MB
None


## Load the movie ratings data and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE.

In [16]:
allusers = list(users['uID'])
allmovies = list(movies['mID'])
mid2idx = dict(zip(movies.mID,list(range(len(movies)))))
uid2idx = dict(zip(users.uID,list(range(len(users)))))
ind_movie = [mid2idx[x] for x in train.mID] 
ind_user = [uid2idx[x] for x in train.uID]
rating_train = list(train.rating)
Mr = np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(allusers), len(allmovies))).toarray())

     

In [18]:
Mr

array([[5, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [3, 0, 0, ..., 0, 0, 0]])

In [20]:
# Computing the sparsity
len(Mr.nonzero()[0]) / float(Mr.shape[0] * Mr.shape[1])

0.029852745794625237

In [22]:
ind_movie_test = [mid2idx[x] for x in test.mID] 
ind_user_test = [uid2idx[x] for x in test.uID]
rating_test = list(test.rating)
Mr_test = np.array(coo_matrix((rating_test, (ind_user_test, ind_movie_test)), shape=(len(allusers), len(allmovies))).toarray())
     

In [24]:
len(Mr_test.nonzero()[0]) / float(Mr_test.shape[0] * Mr_test.shape[1])

0.012794052185362243

In [28]:
model = NMF(n_components=20)     
W = model.fit_transform(Mr)
H = model.components_

In [29]:
Mr_pred = H.T.dot(W.T).T

In [32]:

Mr_pred

array([[1.79464002e+00, 5.34311639e-01, 1.15749301e-02, ...,
        1.29737624e-02, 6.40729530e-03, 9.17142916e-02],
       [1.22744873e+00, 3.72580112e-01, 1.39461017e-01, ...,
        1.65134832e-02, 0.00000000e+00, 3.70602063e-02],
       [6.99488318e-01, 1.46098069e-01, 1.02131271e-03, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [6.25181940e-01, 1.38935621e-02, 1.27455412e-03, ...,
        1.38678171e-03, 0.00000000e+00, 0.00000000e+00],
       [1.25286703e+00, 2.86990278e-01, 9.39784923e-02, ...,
        4.85286743e-02, 0.00000000e+00, 0.00000000e+00],
       [1.25535953e+00, 9.74039020e-02, 6.40501225e-03, ...,
        9.33149194e-02, 9.31066927e-02, 4.21293214e-01]])

In [36]:
rmse = np.sqrt(mean_squared_error(Mr_pred[Mr_test.nonzero()].flatten(), Mr_test[Mr_test.nonzero()].flatten()))
     

In [38]:
rmse

2.8622637677758154

## Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?

In comparison, a simple baseline model achieved a much lower RMSE of approximately 1.26, demonstrating substantially better predictive accuracy than the NMF model (RMSE ≈ 2.86).
This gap highlights that in very sparse datasets, simpler models that directly use observed ratings and local similarity structures are often more effective than matrix factorization techniques.

The poor performance of NMF in this setting can be attributed to several factors.
First, NMF operates purely on the user-movie rating matrix without considering true underlying user or movie features, leading to limited generalization.
Second, NMF by default optimizes a standard L2 (Euclidean) loss, which does not perform well when applied to highly sparse matrices containing mostly zeros.
As a result, the model ends up fitting missing data (zeros) rather than accurately modeling the known ratings.

Several improvements could potentially enhance NMF performance.
Tuning the number of components (n_components) through a method like GridSearchCV could help find a more optimal latent space representation.
In addition, switching the NMF loss function from standard L2 loss to Kullback-Leibler (KL) divergence would be more appropriate for sparse data, as KL loss is better suited to distributions with many zero entries.
Finally, applying regularization (e.g., using an alpha penalty) would help prevent overfitting and improve generalization to unseen ratings.