# Movie Ratings

Chris Murphy 06/17/2024

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.

The link to the dataset can be found here: https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset/data

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

## Data import

In [None]:
MV_users = pd.read_csv('data/movies/users.csv')
MV_movies = pd.read_csv('data/movies/movies.csv')
train = pd.read_csv('data/movies/train.csv')
test = pd.read_csv('data/movies/test.csv')

In [None]:
train.head()

In [None]:
train.info()

### Building NMF model on our training dataset

In [None]:
nmf = NMF(random_state = 42)
nmf_train = nmf.fit_transform(train)
pd.DataFrame(nmf_train).head()

In [None]:
pred = np.argmax(nmf_train, axis= 1)
pred

In [None]:
rms = mean_squared_error(train['rating'], pred, squared = False)
print(f'The RMSE of the NMF model on the training data is: {rms}')

### Building NMF model on our testing dataset

In [None]:
nmf_test = nmf.fit_transform(test)
pd.DataFrame(nmf_test).head()

In [None]:
test_pred = np.argmax(nmf_test, axis = 1)

In [None]:
rms = mean_squared_error(test['rating'], test_pred, squared = False)
print(f'The RMSE of the NMF model on the testing data is: {rms}')

## Discussion

#### Discuss the results and why sklearn's non-negative matrix factorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. 

Building and measuring the NMF model against both the training and testing dataset returned very high RMSE's of 2.713 and 2.674 respectively. These results were quite poor as the sklearn's non-negative matrix factorization method is not a very appropriate model for this type of data. To start, the sklearn NMF model often struggles with sparse datsets. In the movie ratings training dataset, there are 700,148 rows of data where each user has at least 20 different recorded ratings. This means that there are many users that have a relatively low number of movie ratings. It is hard for the model to deccompose such a large matrix such as this. This is the main issue that the model definiltely encountered.

There is also an issue around the Cold Start Problem. The NMF model assumes that the past results (in this case the past movie ratings) are indicitive of future ratings which might not necessairly be the case. For example, if one particular user rates a comedy movie highly, it is not guarenteed that they will rate all other comedy movies as high in the future. Also, since the data is sparse to begin with, the user may have only rated one comedy movie at all in which the model would not be able to decompose.

#### Can you suggest a way(s) to fix it?

There are a couple of ways that we can improve the RMSE of our NMF model in this case. One way to improve our outcome would be to preprocess the data to better handle the sparse nature of the dataset. Some examples of preprocessing that can be done would be conduct single value decomposition which was a topic that was covered in week 4. In single value decomposition, the principal components of a dataset are extracted and the variance of the particular dataset is maximized. This in turn reduces the dimensionality of the data and will allow for our NMF model to perform better.  