<a href="https://www.kaggle.com/code/anthonyylee/fun-with-movie-lens-1m-subset?scriptVersionId=178093346" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Compare NMF movie rating prediction with similarity-matrix prediction approach

Author: Anthony Lee

## Overview
This is a short notebook that compares the performance (using RMSE) of predicting movie ratings compared to that of using similarity-matrix to predict movie ratings. 

Spoilers, NMF does not do as good of a job as that of the similarity-matrix approach in terms of RMSE. 

## Outline

- [Helper functions](#Helper-functions)
- [Download and process data](#Download-and-process-data)
- [NMF to predict ratings](#Use-NMF-to-predict-ratings)
- [Conclusion](#Conclusion)
- [Future improvements](#Future-improvements)
- [References](#References)

## Helper functions

These are functions that were used throughout the notebook. Instead of clogging up the cells, they are aggregated here to make the notebooks easier to read.

In [1]:
import requests
import zipfile
import pandas as pd
import scipy as sp
from sklearn.decomposition import NMF
import numpy as np
from sklearn.model_selection import train_test_split

## Helper function
def parse_genre_to_long(dataframe) -> pd.DataFrame:
    """Splits off a string of genres separated by vertical pipe.
    
    This function is a convenience function that splits off the string of genres
    and returns a dataframe where each cell in the 'genre' column is atomic (with
    only one value.)
    """
    holder_movieId = []
    holder_movieName = []
    holder_movieGenre = []
    
    for idx in range(dataframe.shape[0]):
        row = dataframe.iloc[idx, :]
        genres = row['movieGenre'].split('|')
        for genre in genres: 
            holder_movieId.append(row.movieId)
            holder_movieName.append(row.movieName)
            holder_movieGenre.append(genre)
        
    return pd.DataFrame({
        'movieId': holder_movieId, 
        'movieName': holder_movieName, 
        'movieGenre': holder_movieGenre,
    })

def download_and_unzip_files() -> None: 
    """Downloads the data from GroupLens source and then unzip them.
    
    This function groups all the steps together since the url is not expected to change
    there aren't any need to have to adjsut to different urls.
    """
    groupLensUrl = 'https://files.grouplens.org/datasets/movielens/ml-1m.zip'
    zipPath = "/kaggle/working/1M_subset.zip"
    extractToPath = "/kaggle/working/"
    with open(zipPath, "wb") as file: 
        url = groupLensUrl
        response = requests.get(url)
        file.write(response.content)

    with zipfile.ZipFile(zipPath, "r") as file: 
        file.extractall(extractToPath)
    return

def process_and_clean_the_data() -> (pd.DataFrame, pd.DataFrame, pd.DataFrame):
    """The data processing and cleaning code.
    
    This function groups all the processing into one step. Since the data is not expected to 
    change, the processing and cleaning steps are not expected to change.
    By having all the code here just makes things easier to read.
    """

    path_movies = '/kaggle/working/ml-1m/movies.dat'
    path_ratings = "/kaggle/working/ml-1m/ratings.dat"
    path_users = "/kaggle/working/ml-1m/users.dat"
    
    ## Conversion table from the README file
    ageGroupLabel = {
    '1': "Under 18",
    '18': "18-24", 
    '25': "25-34", 
    '35': "35-44", 
    '45': "45-49",
    '50': "50-55",
    '56': "56+"
    }
    occupationLabel = {
        '0':  "other",
        '1':  "academic/educator",
        '2':  "artist",
        '3':  "clerical/admin",
        '4':  "college/grad student",
        '5':  "customer service",
        '6':  "doctor/health care",
        '7':  "executive/managerial",
        '8':  "farmer",
        '9':  "homemaker",
        '10':  "K-12 student",
        '11':  "lawyer",
        '12':  "programmer",
        '13':  "retired",
        '14':  "sales/marketing",
        '15':  "scientist",
        '16':  "self-employed",
        '17':  "technician/engineer",
        '18':  "tradesman/craftsman",
        '19':  "unemployed",
        '20':  "writer",
    }
    
    ## Understand what the file formats are like
    ## NOTE: The table does not have a header and delimited by `::`
    with open(path_movies, 'rb') as file: 
        lines = file.readlines()
        print(" Preview the dat files ".center(80, "#"))
        print(lines[:3])  # Shows the lack of a header and dlimited by `::`
        print("".center(80, "#"))

    ## Read the movies data
    df_movies = pd.read_table(path_movies, sep='::', encoding='windows-1250', engine='python')
    df_movies = df_movies.T.reset_index(drop=False).T.reset_index(drop=True)
    df_movies = df_movies.rename(columns={0:'movieId', 1:'movieName', 2:'movieGenre'})
    df_movies = df_movies.astype({"movieId":int, "movieName": str})
    df_movies = parse_genre_to_long(df_movies)  # Convert to long format
    df_movies = df_movies.pivot(index=['movieId', 'movieName'], columns='movieGenre', values='movieGenre').map(lambda x: 1 if isinstance(x, str) else 0).reset_index(drop=False).set_index("movieId")

    ## Read the ratings data
    df_ratings = pd.read_table(path_ratings, sep='::', encoding='windows-1250', engine='python')
    df_ratings = df_ratings.T.reset_index(drop=False).T.reset_index(drop=True)
    df_ratings = df_ratings.rename(columns={0:'userId', 1:'movieId', 2:'rating', 3:'timestamp'})
    df_ratings = df_ratings.astype({"userId": int, "movieId": int, "rating":int})  # Will convert datetime column later
    df_ratings.timestamp = df_ratings.timestamp.apply(lambda x: pd.Timestamp.fromtimestamp(int(x), tz="UTC"))
    
    ## Read the users data
    df_users = pd.read_table(path_users, sep='::', encoding='windows-1250', engine='python')
    df_users = df_users.T.reset_index(drop=False).T.reset_index(drop=True)
    df_users = df_users.rename(columns={0:'userId', 1:'gender', 2:'ageGroup', 3:'occupationLabel', 4:'zipCode'})
    df_users.zipCode = df_users.zipCode.apply(lambda x: x[:5] if len(x) > 5 else x)  # Exclude the additional digits in zip codes
    df_users['occupationLabel'] = df_users['occupationLabel'].apply(lambda x: occupationLabel[str(x)])
    df_users['ageGroup'] = df_users['ageGroup'].apply(lambda x: int(float(x)) if isinstance(x, str) else x)  # Convert the mistype into integer  
    df_users['ageGroup'] = df_users.ageGroup.apply(lambda x: ageGroupLabel[str(x)])
    df_users = df_users.astype({"userId": int, "zipCode": int})
    df_users = df_users.set_index('userId')
    
    returnTuple = (df_movies, df_ratings, df_users)
    for df in returnTuple: 
        display(df.head())

    return returnTuple


## Download and process data

The data is downloaded from the GroupLens website, and we are using the 1-million entry subset.
There are additional dataset, including a synthetic dataset, that can be used to conduct further testings.

In [2]:
## Download and clean the data
download_and_unzip_files()
df_movies, df_ratings, df_users = process_and_clean_the_data()

############################ Preview the dat files #############################
[b"1::Toy Story (1995)::Animation|Children's|Comedy\n", b"2::Jumanji (1995)::Adventure|Children's|Fantasy\n", b'3::Grumpier Old Men (1995)::Comedy|Romance\n']
################################################################################


movieGenre,movieName,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,Toy Story (1995),0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,2000-12-31 22:12:40+00:00
1,1,661,3,2000-12-31 22:35:09+00:00
2,1,914,3,2000-12-31 22:32:48+00:00
3,1,3408,4,2000-12-31 22:04:35+00:00
4,1,2355,5,2001-01-06 23:38:11+00:00


Unnamed: 0_level_0,gender,ageGroup,occupationLabel,zipCode
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,F,Under 18,K-12 student,48067
2,M,56+,self-employed,70072
3,M,25-34,scientist,55117
4,M,45-49,executive/managerial,2460
5,M,25-34,writer,55455


In [3]:
# ## Create a mapping of movieId to movieName - For later work
# mapping_movieId_to_name = {namedtuple[0]:namedtuple[1] for namedtuple in df_movies.itertuples()}
# ## Create a mapping of index to movie genre - For later work
# mapping_index_to_genre = {idx:colname for (idx, colname) in enumerate(df_movies.columns)}

## Use NMF to predict ratings

Here I factorize the rating training rating matrix using sklearn's NMF, and then use the transformed matrix along with the factorization matrix to predict holdout test set's users' ratings on various movies. 

The result is then compared to a baseline model (similarity matrix) done in a separate notebook. RMSE was the metric used to compare and consideration is made why the NMF performs better/worse than the baseline model (similarity matrix approach).

Spoilers, NMF performed worse than the baseline model.

In [4]:
## Train test split
trainSet, testSet = train_test_split(df_ratings, test_size=0.3, shuffle=True, random_state=550)

## Create rating matrix - (Each row is a user, each column is a movie)
## NOTE: Know that both the userId and movieId starts with 1, thus is offsetted compared to the row-idx and col-idx.
trainSet_rating_matrix_sparse = sp.sparse.csc_array((trainSet.rating, (trainSet.userId, trainSet.movieId)))
trainSet_rating_matrix = trainSet_rating_matrix_sparse.toarray()
trainSet_rating_matrix = trainSet_rating_matrix[1:, 1:]  # Drop the first row and first column because the userId and movieId both starts at 1 instead of 0

## Preview the rating matrix
print( f"Shape of the trainSet rating matrix: {trainSet_rating_matrix.shape}", end='\n\n' )
print( f"trainSet_rating_matrix preview: \n{trainSet_rating_matrix}", end='\n\n' )


Shape of the trainSet rating matrix: (6040, 3952)

trainSet_rating_matrix preview: 
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]



In [5]:
## Getting the transformed data and factorization matrix (sometimes called 'dictionary')
## NOTE: NMF(A) = W H , where W is the 'transformed matrix' and H is the 'factorization matrix'

nmf = NMF(n_components=5)
matrix_transformed = nmf.fit_transform(trainSet_rating_matrix)
matrix_factorization = nmf.components_

In [6]:
## Create the testing rating matrix
testSet_rating_matrix_sparse = sp.sparse.csc_array((testSet.rating, (testSet.userId, testSet.movieId)))
testSet_rating_matrix = np.ma.masked_where(testSet_rating_matrix_sparse.toarray()==0, testSet_rating_matrix_sparse.toarray())
testSet_rating_matrix = testSet_rating_matrix[1:, 1:]  # Remove first row because the userId and movieId starts at 1 instead of 0

## Predict the ratings
prediction = np.matmul(matrix_transformed, matrix_factorization)
rmse = np.sqrt(np.mean(np.power(prediction - testSet_rating_matrix, 2)))

## Printing the results
print( f"The baseline RMSE by using a similarity matrix: 1.2642784503423288" )
print( f"The NMF predicted RMSE: {rmse}" )  # 2.988950581399221

## NOTE:
## Here we see that the RMSE using NMF's transformed matrix and factorization matrix to predict
## holdout test set to be greater than that of using a similarity matrix to predict each user's 
## rating on each of the movie. 
## The reason for such difference is that NMF being strictly positive is an additive model. Being
## an additive model, it learns by parts as opposed to a holistic approach. A similarity matrix
## could be considered an holistic approach by comparing EVERY pairwise similarities thus is a much
## denser coefficient matrix than that of the factorization matrix from NMF.


The baseline RMSE by using a similarity matrix: 1.2642784503423288
The NMF predicted RMSE: 2.9889505812139547


## Conclusion

With all the hype generated by the Netflix Prize's winning team's publication (Koren et al., 2009), I thought matrix factorization would perform much better. However, when comparing with a similarity matrix approach's RMSE baseline (RMSE = 1.2642784503423288), the result was not what I would expect from what the Netflix Prize's winning team used as the basis of their approach. To the winning team's argument, they did refined their model using a lot more than just the basic matrix factorication.

Additionally, the Sci-kit Learn's Non-negative Matrix Factorication (NMF) (Sklearn.Decomposition.NMF, n.d.) has more restriction than the method used by the Netflix Prize winning team. Most importantly is the "strictly positive" aspect of the NMF, which forces it to "learn the parts of the [data]" (Lee & Seung, 1999). In contrast, similarity matrices as used in the RMSE baseline and the Netflix Prize winning team create a much denser representation instead of just "learning the parts."

The "parts based representation" (Lee & Seung, 1999) has a lot of benefits, specifically the handling of sparse data and better interpretability. In contrast, similar methods such as Principal Components Analysis (PCA) and Vector Quantization (VQ) have a much harder or non-intuitive interpretation in terms of its encoding matrix (Lee & Seung, 1999).

## Future improvements

This analysis could be improved upon in the following ways: 

- Include the implementation of the similarity-matrix approach (which currently resides offline [somewhere on my machine...])
- Test different NMF hyperparameters including the number of components of which the latent dimension uses. Currently, it just uses the convenient n_components=5.
- Test the different initialization algorithm of the NMF. Could there be one that is more beneficial for a sparse rating matrix and instead of putting most of the zeros to the factorication matrix would balance the number of zeros between the transformed matrix and the factorication matrix.

## References

- Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30–37. https://doi.org/10.1109/MC.2009.263
- Sklearn.decomposition.NMF. (n.d.). Scikit-Learn. Retrieved May 16, 2024, from https://scikit-learn/stable/modules/generated/sklearn.decomposition.NMF.html
- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. https://doi.org/10.1038/44565


