## Movie Ratings & Matrix Factorization

### Section 1: Matrix Factorization Techniques

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine
from sklearn.decomposition import NMF

In [3]:
ratings = pd.read_csv('data/ratings.dat', sep='::', header=None, names=['mID', 'uID', 'Rating', '?'], engine='python')
users = pd.read_csv('data/users.dat', sep='::', header=None, names=['uID', 'gender', 'age', 'occupation', 'zip'], engine='python')
movies = pd.read_csv('data/movies.dat', sep='::', header=None, names=['mID','title','Genres'], engine='python')

In [4]:
print(ratings.head())
print(users.head())
print(movies.head())

   mID   uID  Rating          ?
0    1  1193       5  978300760
1    1   661       3  978302109
2    1   914       3  978301968
3    1  3408       4  978300275
4    1  2355       5  978824291
   uID gender  age  occupation    zip
0    1      F    1          10  48067
1    2      M   56          16  70072
2    3      M   25          15  55117
3    4      M   45           7  02460
4    5      M   25          20  55455
   mID                               title                        Genres
0    1                    Toy Story (1995)   Animation|Children's|Comedy
1    2                      Jumanji (1995)  Adventure|Children's|Fantasy
2    3             Grumpier Old Men (1995)                Comedy|Romance
3    4            Waiting to Exhale (1995)                  Comedy|Drama
4    5  Father of the Bride Part II (1995)                        Comedy


In [5]:
# Data preparation
genres = []
for genre_str in movies['Genres']:
    for g in genre_str.split('|'):
        try:
            gi = genres.index(g)
        except:
            genres.append(g)

# manually build one-hot encoding
encodings = []
for genre_str in movies['Genres']:
    encoding = np.zeros([len(genres),])
    for g in genre_str.split('|'):
        gi = genres.index(g)
        encoding[gi] = 1
    encodings.append(encoding)

movie_data = movies.copy()[["mID"]]
movie_data['genre_encoding'] = encodings
movie_data.head()

Unnamed: 0,mID,genre_encoding
0,1,"[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2,"[0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
2,3,"[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."
3,4,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
4,5,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [8]:
x_train, x_test = train_test_split(encodings, test_size=0.33, random_state=42)
train_mat = csr_matrix(x_train)
nmf_mod = NMF(n_components=18, l1_ratio = 0.5, random_state = 57)
nmf_mod.fit(train_mat)



NMF(l1_ratio=0.5, n_components=18, random_state=57)

In [10]:
from sklearn.metrics import mean_squared_error

y_transform_train = nmf_mod.transform(encodings)

rmse = mean_squared_error(encodings, y_transform_train, squared=False)

print("RMSE:", rmse)

RMSE: 0.2700338629407287


In this case the RMSE is comparatively low, considering results in the Recommender System module produce error values closer to 1. However, this does represent the model's ability to associate genres rather than predict a rating value.

Previous RMSE values from other methods for reference:

```
Baseline: Predict everything to 3
RMSE 1.259

Baseline: Predict to user average
RMSE 1.035

C​ontent based, item-item
RMSE 1.38

Collaborative cosine
RMSE of 1.02
```