<a href="https://colab.research.google.com/github/dstephenhaynes/DTSA5510Movies/blob/main/DTSA5510_Movies_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <center>
<h1><center>Limitations of sklearn's Non-negative Matrix Factorization (NMF)</center></h1>
<h2><center>Unsupervised Algorithms in Machine Learning</center></h2>
<h3><center>
DTSA-5510

University of Colorado Boulder

D. Stephen Haynes
</center></h3>



In [6]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

In [7]:
movies = pd.read_csv('https://raw.githubusercontent.com/dstephenhaynes/DTSA5510Movies/main/movies.csv')
users = pd.read_csv('https://raw.githubusercontent.com/dstephenhaynes/DTSA5510Movies/main/users.csv')
train = pd.read_csv('https://raw.githubusercontent.com/dstephenhaynes/DTSA5510Movies/main/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/dstephenhaynes/DTSA5510Movies/main/test.csv')

In [8]:
movies.info()
print()
users.info()
print()
train.info()
print()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   mID     3883 non-null   int64 
 1   title   3883 non-null   object
 2   year    3883 non-null   int64 
 3   Doc     3883 non-null   int64 
 4   Com     3883 non-null   int64 
 5   Hor     3883 non-null   int64 
 6   Adv     3883 non-null   int64 
 7   Wes     3883 non-null   int64 
 8   Dra     3883 non-null   int64 
 9   Ani     3883 non-null   int64 
 10  War     3883 non-null   int64 
 11  Chi     3883 non-null   int64 
 12  Cri     3883 non-null   int64 
 13  Thr     3883 non-null   int64 
 14  Sci     3883 non-null   int64 
 15  Mys     3883 non-null   int64 
 16  Rom     3883 non-null   int64 
 17  Fil     3883 non-null   int64 
 18  Fan     3883 non-null   int64 
 19  Act     3883 non-null   int64 
 20  Mus     3883 non-null   int64 
dtypes: int64(20), object(1)
memory usage: 637.2+ KB

<class 'pan

In [9]:
# Prepare training matrix
train_matrix = train.pivot(index='uID', columns='mID', values='rating')
train_matrix.fillna(0, inplace=True)

# Apply Non-Negative Matrix Factorization
n_components = 20
nmf = NMF(n_components=n_components, init='random', random_state=42, max_iter=1000)
W = nmf.fit_transform(train_matrix)
H = nmf.components_

# Predict ratings
train_matrix_pred = np.dot(W, H)

# Prepare test data
test_matrix = test.pivot(index='uID', columns='mID', values='rating')

# Ensure train and test matrices have the same columns
missing_cols = list(set(train_matrix.columns) - set(test_matrix.columns))
missing_df = pd.DataFrame(0, index=test_matrix.index, columns=missing_cols)
test_matrix = pd.concat([test_matrix, missing_df], axis=1)

# Align test matrix to have the same order of columns as train matrix
test_matrix = test_matrix[train_matrix.columns]

# Extract test ratings
test_ratings = test.pivot(index='uID', columns='mID', values='rating').stack()

# Predict test ratings
test_pred = []
for uid, mid in test_ratings.index:
    # Adjust for zero-based indexing
    if uid - 1 < train_matrix_pred.shape[0] and mid - 1 < train_matrix_pred.shape[1]:
        test_pred.append(train_matrix_pred[uid - 1, mid - 1])
    else:
        # If the user or movie index is out of bounds, append a default prediction (e.g., mean rating)
        test_pred.append(train_matrix_pred.mean())

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_ratings, test_pred))
print(f"RMSE: {rmse}")

RMSE: 3.5045622216788614


# Result
The RMSE value of 3.50 indicates that predictions made with the NMF model are relatively inaccurate. A high RMSE suggests that this model may not be capturing the data patterns effectively.

# Discussion:

The RMSE value obtained from the NMF approach is much higher than that obtained with the baseline and similarity-based methods in Module 3. This may be the result of limitations to using sklearn's non-negative matrix factorization (NMF) that make it inappropriate for this task. These reasons include:
1. Scalability: NMF can be computationally expensive for large datasets due to its iterative nature.
2. Sparse Data: NMF may not handle highly sparse data well, which is common in recommendation systems.
3. Overfitting: NMF can easily overfit the training data, especially if the number of latent features is not well-tuned.

Here are a couple of suggestions to improve the results:
1. Hybrid Models: Combine NMF with other collaborative or content-based filtering methods to address the cold start problem and improve accuracy.
2. Regularization: Introduce regularization terms to the NMF optimization process to prevent overfitting.
