# Limitation(s) of sklearnâ€™s non-negative matrix factorization library

## Experiment

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt

In [7]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [13]:
# cannot handle empty data, fill with 0
train_matrix_df = train.pivot(index='uID', columns='mID', values='rating').fillna(0)
R = train_matrix_df.values

# map
user_id_to_index = {uid: idx for idx, uid in enumerate(train_matrix_df.index)}
movie_id_to_index = {mid: idx for idx, mid in enumerate(train_matrix_df.columns)}

print(f"User-Item Matrix Shape: {R.shape}")

# train and fit
n_components = 20
model = NMF(n_components=n_components, init='random', random_state=42, max_iter=200)
W = model.fit_transform(R)
H = model.components_ # feature-item matrix
R_predicted = np.dot(W, H) # approximate ratings of all

# predict and get rmse
y_true = []
y_pred = []
global_mean = train['rating'].mean()

for index, row in test.iterrows():
    user = row['uID']
    movie = row['mID']
    actual_rating = row['rating']
    
    y_true.append(actual_rating)
    
    # check if user and movie exist
    if user in user_id_to_index and movie in movie_id_to_index:
        u_idx = user_id_to_index[user]
        m_idx = movie_id_to_index[movie]
        
        # get the predicted rating
        predicted_rating = R_predicted[u_idx, m_idx]
        y_pred.append(predicted_rating)
    else:
        # fallback
        y_pred.append(global_mean)

# get rmse
rmse = sqrt(mean_squared_error(y_true, y_pred))

print("-" * 55)
print(f"Final RMSE using Sklearn NMF: {rmse:.4f}")
print("-" * 55)

User-Item Matrix Shape: (6040, 3664)
-------------------------------------------------------
Final RMSE using Sklearn NMF: 2.8535
-------------------------------------------------------


## Failure Analysis

**The Result**

The RMSE was 2.8535.

Since movie ratings are on a scale of 1 to 5, an error of nearly 3 points is significant. For example, if a user actually rated a movie 5 stars, this model likely predicted a rating around 2.1. This performance is much lower than a simple baseline (like guessing the average rating), which typically yields an RMSE around 1.0.

**Why it didn't work?**

To simply put, the reason was the issue with "0" filling.
The poor performance is not necessarily due to the NMF algorithm itself, but rather how we had to preprocess the data for the Scikit-Learn library.

1.  **Input Constraints:** Sklearn's NMF implementation does not support empty cells or missing values. It requires a completely filled matrix to run.
2.  **Imputation Issue:** To satisfy this requirement, I filled all missing ratings with 0.
3.  **Model Confusion:** The model does not understand that 0 is a placeholder. It interprets 0 as an actual, very low rating, essentially assuming the user hated the movie.
4.  **Data Imbalance:** In this dataset, over 95% of user-movie combinations are missing. Consequently, the matrix is filled mostly with zeros. The model learns from this that the correct prediction is usually 0.
5.  **Prediction Error:** When predicting for the test set (where users actually liked the movies and gave high ratings), the model biased by the zeros predicts a value close to 0. The large gap between the actual rating and the prediction creates the high RMSE.

**Suggested Improvements**

1.  **Better Imputation:** A quick fix can be to fill the missing values with something else, for example, the user's average rating instead of 0. If a user typically rates movies 3.5 stars, filling empty spots with 3.5 teaches the model that the user is generally neutral, rather than assuming they dislike unseen movies.
2.  **Specialized Libraries:** The most robust solution is to use a library designed specifically for Recommender Systems, such as Surprise or LightFM. Unlike Scikit-Learn, these libraries are designed to ignore empty cells during calculation. They compute error based strictly on observed ratings, which avoids the zero-bias problem entirely.

## Improved Result

In [14]:
train_matrix_df = train.pivot(index='uID', columns='mID', values='rating')

# fill with means instead of 0
user_means = train_matrix_df.mean(axis=1)
train_matrix_filled = train_matrix_df.T.fillna(user_means).T

R = train_matrix_filled.values
model = NMF(n_components=20, init='random', random_state=42, max_iter=200)
W = model.fit_transform(R)
H = model.components_
R_predicted = np.dot(W, H)

y_true = []
y_pred = []

user_id_to_index = {uid: idx for idx, uid in enumerate(train_matrix_df.index)}
movie_id_to_index = {mid: idx for idx, mid in enumerate(train_matrix_df.columns)}

for index, row in test.iterrows():
    user = row['uID']
    movie = row['mID']
    
    if user in user_id_to_index and movie in movie_id_to_index:
        u_idx = user_id_to_index[user]
        m_idx = movie_id_to_index[movie]
        
        y_true.append(row['rating'])
        y_pred.append(R_predicted[u_idx, m_idx])

rmse_improved = sqrt(mean_squared_error(y_true, y_pred))

print("-" * 55)
print(f"RMSE with User-Mean Imputation: {rmse_improved:.4f}")
print("-" * 55)

-------------------------------------------------------
RMSE with User-Mean Imputation: 0.9745
-------------------------------------------------------
