# Info 

Build a matrix factorization-based recommendation model via singular valueu decomposition (SVD) and the `surprise` package. 

Mathematically, SVD "decomposes" a matrix into three other matrices. In practice for film ratings, the SVD model uses the training data of ratings (which becomes an N by M matrix of ratings for N users and M movies) to model a latent representations of individual users as well as individual films. These comprise 2 of the 3 resultant decomposition matrices, with the other being a matrix that defines the weighted importance of each r ating. 

The fitted model is then used to predict unseen ratings for a user based on these factors to serve as a recommendation system.

Steps:

1. Load and prepare ratings dataset (previously created via `prep_data_eda.ipynb`)
2. Train the SVD Model
3. Evaluate the Model
4. Make Predictions

Since the full dataset contains over 32 million rows of ratings, training here just for sandbox/exploratory syntax purposes is done with a random (25%) sample of the ratings, which should still provide a generous training set.

In [1]:
import os
import pandas as pd
import numpy as np

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy

data_dir = '../data/'
df_ratings = pd.read_parquet(os.path.join(data_dir, 'ratings_combined.parquet'))

seed=77
sample_size=0.25
df_ratings_sample = df_ratings.sample(frac=sample_size, random_state=seed)

reader = Reader(rating_scale=(0.5, 5.0))
surprise_data = Dataset.load_from_df(df_ratings_sample, reader)
# surprise_data = Dataset.load_from_df(df_ratings, reader)

# Train and Evaluate Model

Using default parameters, establish syntax for fitting, evaluating, and saving a model.

Hyperparameter tuning will be done via sweep job in AzureML on the following:

* `n_epochs` 
* `lr_all` 
* `reg_all` 

In [2]:
train_set, test_set = train_test_split(surprise_data, test_size=0.2, random_state=seed)

svd_model = SVD(
  n_factors=100, # default
  n_epochs=20, # default
  random_state=seed
)

svd_model.fit(train_set)

predictions = svd_model.test(test_set)
# RMSE as the primary metric to evaluate (minimize)
rmse = accuracy.rmse(predictions)
print(rmse) 
mae = accuracy.mae(predictions)
print(mae)

RMSE: 0.8576
0.8575780773223838
MAE:  0.6507
0.6506684841319251


5-fold cross-validation

In [3]:
cross_validate(svd_model, surprise_data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8574  0.8574  0.8571  0.8553  0.8567  0.8568  0.0008  
MAE (testset)     0.6507  0.6507  0.6502  0.6494  0.6502  0.6502  0.0005  
Fit time          59.40   68.70   72.85   75.08   72.43   69.69   5.54    
Test time         10.90   10.29   12.56   9.58    12.55   11.18   1.20    


{'test_rmse': array([0.85740219, 0.85740031, 0.85714132, 0.85530971, 0.85671686]),
 'test_mae': array([0.65073034, 0.65067826, 0.65024987, 0.64937952, 0.65017996]),
 'fit_time': (59.395994663238525,
  68.70016145706177,
  72.85249352455139,
  75.08339762687683,
  72.432457447052),
 'test_time': (10.900620937347412,
  10.287422895431519,
  12.560326099395752,
  9.580122232437134,
  12.553075075149536)}

In [4]:
# save model artifact
# import pickle

# with open('svd_model.pkl', 'wb') as file:
#   pickle.dump(svd_model, file)

# Generate Recommendations via SVD

The model can be used to output predicted ratings for a given user on films they have not yet rated.

In [5]:
def recommend_for_user_svd(df_ratings, df_films, user_id, top_n=10):
  """
  Recommend movies for a given user using the trained SVD model.

  Args:
    df_ratings: A dataframe of film ratings with columns: user_id, movie_id, rating
    df_films: A dimension table dataframe of films that provides the title and genre(s) for a given movie_id
    user_id: A specific user_id present in df_ratings
    top_n: # of recommendations to return

  Return: A dataframe with top_n rows of recommended titles, genres, and predicted ratings
  """
  all_movie_ids = df_ratings['movie_id'].unique()

  if user_id not in df_ratings['user_id'].unique():
    print("This user is not present in the ratings dataset provided")
    return None

  # Movies the user has already rated
  rated = df_ratings[df_ratings['user_id'] == user_id]['movie_id'].unique()
  # Candidate movies = all movies in sample that user has NOT rated
  candidate_movie_ids = [mid for mid in all_movie_ids if mid not in rated]

  # use model to predict ratings for not-yet-rated movie_id's
  preds = []
  for mid in candidate_movie_ids:
      pred = svd_model.predict(user_id, mid)
      preds.append((mid, pred.est))

  # Sort by predicted rating (high to low)
  preds_sorted = sorted(preds, key=lambda x: x[1], reverse=True)[:top_n]
  top_movie_ids = [mid for (mid, _) in preds_sorted]
  top_scores = [score for (_, score) in preds_sorted]

  # Map movie_id to title + genre
  recs = df_films[df_films['movie_id'].isin(top_movie_ids)][['movie_id', 'title', 'genres']]
  # Preserve order of top_movie_ids
  recs = recs.set_index('movie_id').loc[top_movie_ids].reset_index()
  recs['predicted_rating'] = top_scores

  return recs


In [6]:
df_films = pd.read_csv(os.path.join(data_dir, 'movies.csv')).rename(columns={'movieId':'movie_id'})

recommend_for_user_svd(df_ratings, df_films, user_id=999999, top_n=10)

Unnamed: 0,movie_id,title,genres,predicted_rating
0,171011,Planet Earth II (2016),Documentary,4.186504
1,3307,City Lights (1931),Comedy|Drama|Romance,4.111758
2,3629,"Gold Rush, The (1925)",Adventure|Comedy|Romance,4.093407
3,105250,"Century of the Self, The (2002)",Documentary,4.060982
4,1193,One Flew Over the Cuckoo's Nest (1975),Drama,4.048949
5,1281,"Great Dictator, The (1940)",Comedy|Drama|War,4.045887
6,6123,Sunless (Sans Soleil) (1983),Documentary,4.037953
7,170705,Band of Brothers (2001),Action|Drama|War,4.007385
8,1256,Duck Soup (1933),Comedy|Musical|War,4.002715
9,26048,"Human Condition II, The (Ningen no joken II) (...",Drama|War,4.001241
