# **Using NMF on sparse data**


## Step 1:
I will load the movie ratings sample data from the MovieLens dataset and use sklearn matrix factorization to predict the missing ratings from the test data and measure the RMSE.

To account for missing data, I am imputing all of a user's unrated movies to a 3. For the number of latent factors, I will use the number of genre columns in the movies table (18).

In [44]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt
from scipy.sparse import csr_matrix

In [45]:
MV_users = pd.read_csv('movie-data/users.csv')
MV_movies = pd.read_csv('movie-data/movies.csv')
train = pd.read_csv('movie-data/train.csv')
test = pd.read_csv('movie-data/test.csv')

In [46]:
num_users = MV_users['uID'].max()
num_movies = MV_movies['mID'].max()

# Create a full user-movie matrix of threes
user_movie_matrix_train = np.full((num_users, num_movies), 3.0)

# Populate the matrix with ratings from the training data
for _, row in train.iterrows():
    # Adjust for 0-based indexing if IDs start from 1
    user_idx = row['uID'] - 1
    movie_idx = row['mID'] - 1
    rating = row['rating']
    if user_idx < num_users and movie_idx < num_movies:
        user_movie_matrix_train[user_idx, movie_idx] = rating

# Create a sparse matrix for NMF for efficiency
user_movie_sparse_train = csr_matrix(user_movie_matrix_train)

# For the number of latent factors, I am using the number of genre columns in the movies table.
n_components = 18

# Initialize and fit the NMF model on the training data matrix
nmf_model = NMF(n_components=n_components, init='nndsvda', max_iter=1000, random_state=42)
user_factors = nmf_model.fit_transform(user_movie_sparse_train)
movie_factors = nmf_model.components_

# Reconstruct the matrix to predict missing ratings for the entire matrix
predicted_ratings_full_matrix = np.dot(user_factors, movie_factors)

# Predict ratings for the test data
test_predictions = []
for _, row in test.iterrows():
    # Adjust for 0-based indexing
    user_idx = row['uID'] - 1
    movie_idx = row['mID'] - 1
    predicted_rating = predicted_ratings_full_matrix[user_idx, movie_idx]
    test_predictions.append(predicted_rating)

# Calculate RMSE on the test data
test_true_ratings = test['rating'].values
rmse = sqrt(mean_squared_error(test_true_ratings, test_predictions))

print(f"NMF RMSE with imputed threes on test set: {rmse:.4f}")

NMF RMSE with imputed threes on test set: 1.1062


This model took an excruciatingly long time to run and had a RMSE worse than all my collaborative models for this data set.

## Step 2:
Now I will discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to the simple baseline or similarity-based methods I did in an earlier model, and suggest a way(s) to fix it.

Why sklearn.decomposition.NMF performs poorly with this data
Naive Handling of Missing Data: As explained previously, sklearn's NMF requires a complete matrix. By filling missing values with 0s, we introduce a strong bias. The model learns that a non-rated item is equivalent to a poorly rated one, which is inaccurate. This fundamentally misrepresents the data, leading to poor predictions for true missing ratings.
No Bias Terms: The standard NMF implementation lacks user and item bias terms. In collaborative filtering, these biases are crucial for accurately modeling user behavior. For example, a user who is generally stingy with ratings (low user bias) and a movie that is universally loved (high item bias) are important factors that the simple NMF model ignores.
Non-negativity Constraint: NMF's constraint of non-negative matrices can be too restrictive for modeling user preferences, which may have both positive and negative components.

Why other methods might work better
Simple Baseline: A baseline approach could involve predicting the average rating for a user, or a movie, or the global average. These simple methods often perform better because they capture important biases that the naive NMF model ignores.
Similarity-Based Methods: Collaborative filtering models that use similarity (e.g., user-user or item-item) directly handle the sparsity of the data and are not susceptible to the imputation problem. 
Suggested ways to fix and improve
Use a Dedicated Library (surprise): The surprise library is built for recommender systems and includes an NMF implementation that correctly handles missing data. It avoids the need for imputation and includes bias terms, which would significantly improve performance.
Custom Implementation with Stochastic Gradient Descent (SGD): If constrained to sklearn's components, a custom SGD approach could be implemented. This would involve:
Iterating only over the known ratings in the training data.
Adding and updating user and item bias terms during each iteration.
Updating user and item latent factor matrices based on the gradients of the known ratings.
Use a sklearn Alternative that handles sparsity: The sklearn.decomposition.TruncatedSVD can operate on sparse matrices directly. While not NMF, it is another form of matrix factorization that could provide better performance with sparse data.
Advanced Imputation: A better imputation strategy could be to fill missing ratings with the user's average rating or the movie's average rating, rather than just 0. While still not ideal, it's a step up from a simple zero-filled approach.