# Limitations of NMF
This notebook follows a similar structure to `week4`: 
- Load the movie ratings data (as in the HW3-recommender-system)
- Use matrix factorization technique(s) and predict the missing ratings from the test data
- Measure the RMSE using `sklearn` 

# Setup

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error as mse
from sklearn.decomposition import NMF

import warnings
warnings.filterwarnings('ignore')

## Load Data

In [2]:
MV_users = pd.read_csv('data/w3/users.csv')
MV_movies = pd.read_csv('data/w3/movies.csv')
train = pd.read_csv('data/w3/train.csv')
test = pd.read_csv('data/w3/test.csv')

## EDA
Here, we will take a look at the movie data and how it is constructed:
- Shape
- Data construction

In [3]:
MV_users.shape, MV_movies.shape, train.shape, test.shape

((6040, 5), (3883, 21), (700146, 3), (300063, 3))

In [4]:
train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


## Model Construction
- Feature engineer data to better fit the model structure
- Replace `NaN` values
- Create `impute()`, `pred()`, and `rmse()` function

In [5]:
ratings = train.pivot(index='uID', columns='mID', values='rating')
print(ratings.shape)
ratings.head()
# (6040, 3664)

(6040, 3664)


mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,


In [6]:
missing_locs = np.isnan(ratings)
ratings[missing_locs] = 0
ratings.head()

mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
def impute(ratings, val):
    """Fill in missing values in ratings with val.

    Args:
        ratings (df): DataFrame of user ratings.
        val (int): Value to fill in missing ratings.

    Returns:
        ratings: DataFrame with missing values filled in.
    """
    ratings[np.isnan(ratings)] = val
    return ratings

In [19]:
def pred(train, test, impute_val=0, impute=impute):
    """Predict ratings using NMF

    Args:
        train: training data
        test: testing data
        impute_val: value to replace NaN values
        impute: impute function
        k: number of latent features

    Returns:
        result: DataFrame with columns ['uID', 'mID', 'rating', 'pred_rating']
    """
    ratings = train.pivot(index='uID', columns='mID', values='rating')
    ratings = impute(ratings, impute_val)
    
    nmf = NMF(n_components=5, init='nndsvda', solver='mu', beta_loss='kullback-leibler', alpha_H=0.1, alpha_W=0.1, max_iter=200, random_state=0)
    W1 = nmf.fit_transform(ratings)
    H1 = nmf.components_
    print(f'reconstruction_err_: {nmf.reconstruction_err_}')
    
    pred = W1 @ H1
    df = pd.DataFrame(data = pred,  
                      index = ratings.index.values, 
                      columns = ratings.columns.values) 
    df['uID'] = df.index.values

    df = pd.melt(df, id_vars=['uID'], var_name='mID', value_name='pred_rating')
    result = df.merge(test, on=['uID', 'mID'], how="inner", validate="many_to_many")
    result.head()
    return result[['uID', 'mID', 'rating', 'pred_rating']]
 

In [20]:
def rmse(df):
    """Calculate root mean squared error.

    Args:
        df (result[['uID', 'mID', 'rating', 'pred_rating']]): predictions

    Returns:
        rmse: root mean squared error
    """
    return np.sqrt(mse(df['rating'].values, df['pred_rating'].values))

# Predict
Impute the missing values in the training dataset with zeros, train NMF and predict

In [21]:
predictions = pred(train, test)
predictions.head()

reconstruction_err_: 3207.2273812257104


Unnamed: 0,uID,mID,rating,pred_rating
0,6,1,4,0.28822
1,8,1,4,0.713572
2,21,1,3,0.084568
3,23,1,4,1.186557
4,26,1,3,0.852577


In [22]:
print(f'RMSE: {rmse(predictions)}')

RMSE: 3.2642937971607076


Set permutations to 3 (an average score)

In [12]:
predictions['pred_rating'] = 3
print(f'RMSE: {rmse(predictions)}')

RMSE: 1.2585673019351262


Fill in the missing values using the average values of the column

In [16]:
def impute_missing(ratings, impute_val=None):
    """Impute missing values with the mean of the column

    Args:
        ratings (df): pivot table of ratings off of train
        impute_val (None): set to None so that there's no more code changes

    Returns:
        ratings: df with missing values imputed from avg of column
    """
    missing_locs = np.isnan(ratings) # find missing values
    mean = ratings.apply(np.nanmean, axis=0)
    ratings.fillna(mean, inplace=True)
    return ratings

In [18]:
pred_df = pred(train, test, impute = impute_missing, impute_val = None)
print(f'RMSE: {rmse(pred_df)}')

reconstruction_err_: 940.3654037938592
RMSE: 1.0373524255726359


# Results

The RMSE of the NMF model from sklearn is high at 1.037. I think that the custom functions that captured the relationships better using the cosine or jaccard similarity functions were instrumental in creating a more accurate model. 

Ways to improve the RMSE would be to have a better permutation method as well as having better pre-processing methods with the matrix construction. Replacing the missing values with a 3 works since 3 is the average of 1 thorugh 5 but doesn't capture the observed average of the columns. Taking the average of the columns is a much better label permuation method, as demonstrated above. I also believe that having predictions based off of multiple columns requires a more complex and/or custom model to better capture the relationships, hence, why sklearn's NMF model did not rank as highly as our custom Week 3 model.