# Limitations of NMF
This notebook follows a similar structure to `week4`: 
- Load the movie ratings data (as in the HW3-recommender-system)
- Use matrix factorization technique(s) and predict the missing ratings from the test data
- Measure the RMSE using `sklearn` 

# Setup

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error as mse
from sklearn.decomposition import NMF

import warnings
warnings.filterwarnings('ignore')

## Load Data

In [2]:
MV_users = pd.read_csv('data/w3/users.csv')
MV_movies = pd.read_csv('data/w3/movies.csv')
train = pd.read_csv('data/w3/train.csv')
test = pd.read_csv('data/w3/test.csv')

## EDA
Here, we will take a look at the movie data and how it is constructed:
- Shape
- Data construction

In [3]:
MV_users.shape, MV_movies.shape, train.shape, test.shape

((6040, 5), (3883, 21), (700146, 3), (300063, 3))

In [4]:
train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


## Model Construction
- Feature engineer data to better fit the model structure
- Replace `NaN` values
- Create `impute()`, `pred()`, and `rmse()` function

In [5]:
ratings = train.pivot(index='uID', columns='mID', values='rating')
print(ratings.shape)
ratings.head()
# (6040, 3664)

(6040, 3664)


mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,


In [6]:
missing_locs = np.isnan(ratings)
ratings[missing_locs] = 0
ratings.head()

mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
def impute(ratings, val):
    """_summary_

    Args:
        ratings (_type_): _description_
        val (_type_): _description_

    Returns:
        _type_: _description_
    """
    ratings[np.isnan(ratings)] = val
    return ratings

In [8]:
def pred(train, test, impute_val = 0, impute = impute, k = 10):
    """_summary_

    Args:
        train: training data
        test: testing data
        impute_val: value to replace NaN values
        impute: impute function
        k: number of latent features

    Returns:
        _type_: DataFrame with columns ['uID', 'mID', 'rating', 'pred_rating']
    """
    ratings = train.pivot(index='uID', columns='mID', values='rating')
    ratings = impute(ratings, impute_val) # replace NaN values
    
    nmf = NMF(n_components=5, init='nndsvda', solver='mu', beta_loss='kullback-leibler', alpha_H=0.1, alpha_W=0.1, max_iter=200, random_state=0)
    W1 = nmf.fit_transform(ratings)
    H1 = nmf.components_
    print(f'reconstruction_err_: {nmf.reconstruction_err_}')
    
    pred = W1 @ H1
    df = pd.DataFrame(data = pred,  
                      index = ratings.index.values, 
                      columns = ratings.columns.values) 
    df['uID'] = df.index.values

    df = pd.melt(df, id_vars=['uID'], var_name='mID', value_name='pred_rating')
    result = df.merge(test, on=['uID', 'mID'], how="inner", validate="many_to_many")
    result.head()
    return result[['uID', 'mID', 'rating', 'pred_rating']]
 

In [5]:
def rmse(df):
    """_summary_

    Args:
        df (result[['uID', 'mID', 'rating', 'pred_rating']]): predictions

    Returns:
        rmse: root mean squared error
    """
    return np.sqrt(mse(df['rating'].values, df['pred_rating'].values))

# Predict
Impute the missing values in the training dataset with zeros, train NMF and predict

In [10]:
predictions = pred(train, test)
predictions.head()

reconstruction_err_: 2692.048463233844


Unnamed: 0,uID,mID,rating,pred_rating
0,6,1,4,0.717425
1,8,1,4,0.704151
2,21,1,3,0.253164
3,23,1,4,1.319983
4,26,1,3,1.377998


RMSE w/ NMF model

In [11]:
print(f'RMSE: {rmse(predictions)}')
# RMSE: 2.911772946856598

RMSE: 2.911772946856598


RMSE w/ Baseline (`predict_everything_to_3`)

In [12]:
predictions['pred_rating'] = 3
print(f'RMSE: {rmse(predictions)}')
# RMSE: 1.2585673019351262

RMSE: 1.2585673019351262


# RMSE w/ NMF by Imputing Missing Ratings for item by avg user ratings for item

In [14]:
def impute_missing_avg_item_rating(ratings, val=None):
    """_summary_

    Args:
        ratings (_type_): _description_
        val (_type_, optional): _description_. Defaults to None.

    Returns:
        _type_: _description_
    """
    missing_locs = np.isnan(ratings)
    mean = ratings.apply(np.nanmean, axis=0)
    ratings.fillna(mean, inplace=True)
    return ratings
 
pred_df = pred(train, test, impute = impute_missing_avg_item_rating, impute_val = None)
print(f'RMSE: {rmse(pred_df)}')
# NMF reconstrunction error: 807.7526430141833
# RMSE: 0.9651849775012515

reconstruction_err_: 807.7526430141227
RMSE: 0.9651849775012514


# Results
Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]