# Info 

Build a matrix factorization-based recommendation model via singular valueu decomposition (SVD) and the `surprise` package. 

Mathematically, SVD "decomposes" a matrix into three other matrices. In practice for film ratings, the SVD model uses the training data of ratings (which becomes an N by M matrix of ratings for N users and M movies) to model a latent representations of individual users as well as individual films. These comprise 2 of the 3 resultant decomposition matrices, with the other being a matrix that defines the weighted importance of each r ating. 

The fitted model is then used to predict unseen ratings for a user based on these factors to serve as a recommendation system.

Steps:

1. Load and prepare ratings dataset (previously created via `prep_data_eda.ipynb`)
2. Train the SVD Model
3. Evaluate the Model
4. Make Predictions

Since the full dataset contains over 32 million rows of ratings, training here just for sandbox/exploratory syntax purposes is done with a random (25%) sample of the ratings, which should still provide a generous training set.

In [1]:
import os
import pandas as pd
import numpy as np

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy

data_dir = '../data/'
df_ratings = pd.read_parquet(os.path.join(data_dir, 'ratings_combined.parquet'))

seed=77
sample_size=0.25
df_ratings_sample = df_ratings.sample(frac=sample_size, random_state=seed)

reader = Reader(rating_scale=(0.5, 5.0))
surprise_data = Dataset.load_from_df(df_ratings_sample, reader)
# surprise_data = Dataset.load_from_df(df_ratings, reader)

# Train and Evaluate Model

Using default parameters, establish syntax for fitting, evaluating, and saving a model.

Hyperparameter tuning will be done via sweep job in AzureML on the following:

* `n_epochs` 
* `lr_all` 
* `reg_all` 

In [2]:
train_set, test_set = train_test_split(surprise_data, test_size=0.2, random_state=seed)

svd_model = SVD(
  n_epochs=50,
  lr_all=0.01,
  reg_all=0.1,
  random_state=seed
)

svd_model.fit(train_set)

predictions = svd_model.test(test_set)
# RMSE as the primary metric to evaluate (minimize)
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 0.8394
MAE:  0.6360


0.6359714069463517

5-fold cross-validation

In [3]:
cross_validate(svd_model, surprise_data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8394  0.8392  0.8395  0.8395  0.8390  0.8393  0.0002  
MAE (testset)     0.6364  0.6360  0.6363  0.6366  0.6360  0.6363  0.0002  
Fit time          144.73  143.43  148.06  149.15  147.99  146.67  2.20    
Test time         11.05   8.87    10.10   9.65    11.34   10.20   0.91    


{'test_rmse': array([0.83940477, 0.83915392, 0.83948548, 0.83951545, 0.83897032]),
 'test_mae': array([0.63638111, 0.63602027, 0.63627463, 0.63664386, 0.63604673]),
 'fit_time': (144.72850680351257,
  143.4264039993286,
  148.05582904815674,
  149.152184009552,
  147.99309873580933),
 'test_time': (11.054686784744263,
  8.86668848991394,
  10.099096775054932,
  9.647846460342407,
  11.342945575714111)}

# Generate Recommendations via SVD

The model can be used to output predicted ratings for a given user on films they have not yet rated.

In [4]:
def recommend_for_user_svd(df_ratings, df_films, user_id, top_n=10):
  """
  Recommend movies for a given user using the trained SVD model.

  Args:
    df_ratings: A dataframe of film ratings with columns: user_id, movie_id, rating
    df_films: A dimension table dataframe of films that provides the title and genre(s) for a given movie_id
    user_id: A specific user_id present in df_ratings
    top_n: # of recommendations to return

  Return: A dataframe with top_n rows of recommended titles, genres, and predicted ratings
  """
  all_movie_ids = df_ratings['movie_id'].unique()

  if user_id not in df_ratings['user_id'].unique():
    print("This user is not present in the ratings dataset provided")
    return None

  # Movies the user has already rated
  rated = df_ratings[df_ratings['user_id'] == user_id]['movie_id'].unique()
  # Candidate movies = all movies in sample that user has NOT rated
  candidate_movie_ids = [mid for mid in all_movie_ids if mid not in rated]

  # use model to predict ratings for not-yet-rated movie_id's
  preds = []
  for mid in candidate_movie_ids:
      pred = svd_model.predict(user_id, mid)
      preds.append((mid, pred.est))

  # Sort by predicted rating (high to low)
  preds_sorted = sorted(preds, key=lambda x: x[1], reverse=True)[:top_n]
  top_movie_ids = [mid for (mid, _) in preds_sorted]
  top_scores = [score for (_, score) in preds_sorted]

  # Map movie_id to title + genre
  recs = df_films[df_films['movie_id'].isin(top_movie_ids)][['movie_id', 'title', 'genres']]
  # Preserve order of top_movie_ids
  recs = recs.set_index('movie_id').loc[top_movie_ids].reset_index()
  recs['predicted_rating'] = top_scores

  return recs


In [None]:
df_films = pd.read_csv(os.path.join(data_dir, 'movies.csv')).rename(columns={'movieId':'movie_id'})

recommend_for_user_svd(df_ratings, df_films, user_id=999999, top_n=25)

Unnamed: 0,movie_id,title,genres,predicted_rating
0,179063,Everybody in Our Family (2012),Drama,4.433291
1,70186,Heimat - A Chronicle of Germany (Heimat - Eine...,Drama,4.338844
2,239316,Can't Get You Out of My Head: An Emotional His...,Documentary,4.283249
3,91762,"Last Lions, The (2011)",Documentary,4.220569
4,101964,Mugabe and the White African (2009),Documentary,4.188153
5,177903,Near Death (1989),Documentary,4.186834
6,26411,"Adventures of Picasso, The (Picassos äventyr) ...",Comedy,4.170682
7,169920,Triumph’s Election Special 2016 (2016),Comedy,4.113056
8,181951,All the Cats Join In (1946),Animation,4.100749
9,225435,Wesley (2009),(no genres listed),4.096104
