# Info 

Build a matrix factorization-based recommendation model via singular valueu decomposition (SVD) and the `surprise` package. 

Mathematically, SVD "decomposes" a matrix into three other matrices. In practice for film ratings, the SVD model uses the training data of ratings (which becomes an N by M matrix of ratings for N users and M movies) to model a latent representations of individual users as well as individual films. These comprise 2 of the 3 resultant decomposition matrices, with the other being a matrix that defines the weighted importance of each r ating. 

The fitted model is then used to predict unseen ratings for a user based on these factors to serve as a recommendation system.

Steps:

1. Load and prepare ratings dataset (previously created via `prep_data_eda.ipynb`)
2. Train the SVD Model
3. Evaluate the Model
4. Make Predictions

Since the full dataset contains over 32 million rows of ratings, training here just for sandbox/exploratory syntax purposes is done with a random (25%) sample of the ratings, which should still provide a generous training set.

In [5]:
import os
import pandas as pd
import numpy as np

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

data_dir = '../data/'
df_ratings = pd.read_csv(os.path.join(data_dir, 'ratings_combined.csv'))

seed=77
sample_size=0.25
df_ratings_sample = df_ratings.sample(frac=sample_size, random_state=seed)

reader = Reader(rating_scale=(0.5, 5.0))
surprise_data = Dataset.load_from_df(df_ratings_sample, reader)
# surprise_data = Dataset.load_from_df(df_ratings, reader)

# Train and Evaluate Model

Using default parameters, establish syntax for fitting, evaluating, and saving a model.

Hyperparameter tuning will be done via sweep job in AzureML on the following:

* `n_factors` 
* `n_epochs` 
* `lr_all` 
* `reg_all` 

In [None]:
train_set, test_set = train_test_split(surprise_data, test_size=0.2, random_state=seed)

svd_model = SVD(
  n_factors=20, # default
  n_epochs=20, # default
  random_state=seed
)

svd_model.fit(train_set)

predictions = svd_model.test(test_set)
# RMSE as the primary metric to evaluate (minimize)
rmse = accuracy.rmse(predictions)
print(rmse) 

RMSE: 0.8522
0.8521971857568799


In [None]:
# save model artifact
import pickle

with open('svd_model.pkl', 'wb') as file:
  pickle.dump(svd_model, file)

# Generate Recommendations via SVD

The model can be used to output predicted ratings for a given user on films they have not yet rated.

In [None]:
def recommend_for_user_svd(df_ratings, df_films, user_id, top_n=10):
  """
  Recommend movies for a given user using the trained SVD model.

  Args:
    df_ratings: A dataframe of film ratings with columns: user_id, movie_id, rating
    df_films: A dimension table dataframe of films that provides the title and genre(s) for a given movie_id
    user_id: A specific user_id present in df_ratings
    top_n: # of recommendations to return

  Return: A dataframe with top_n rows of recommended titles, genres, and predicted ratings
  """
  all_movie_ids = df_ratings['movie_id'].unique()

  if user_id not in df_ratings['user_id'].unique():
    print("This user is not present in the ratings dataset provided")
    return None

  # Movies the user has already rated
  rated = df_ratings[df_ratings['user_id'] == user_id]['movie_id'].unique()
  # Candidate movies = all movies in sample that user has NOT rated
  candidate_movie_ids = [mid for mid in all_movie_ids if mid not in rated]

  # use model to predict ratings for not-yet-rated movie_id's
  preds = []
  for mid in candidate_movie_ids:
      pred = svd_model.predict(user_id, mid)
      preds.append((mid, pred.est))

  # Sort by predicted rating (high to low)
  preds_sorted = sorted(preds, key=lambda x: x[1], reverse=True)[:top_n]
  top_movie_ids = [mid for (mid, _) in preds_sorted]
  top_scores = [score for (_, score) in preds_sorted]

  # Map movie_id to title + genre
  recs = df_films[df_films['movie_id'].isin(top_movie_ids)][['movie_id', 'title', 'genres']]
  # Preserve order of top_movie_ids
  recs = recs.set_index('movie_id').loc[top_movie_ids].reset_index()
  recs['predicted_rating'] = top_scores

  return recs


In [17]:
df_films = pd.read_csv(os.path.join(data_dir, 'ml-32m/movies.csv')).rename(columns={'movieId':'movie_id'})

recommend_for_user_svd(df_ratings, df_films, user_id=0, top_n=10)

Unnamed: 0,movie_id,title,genres,predicted_rating
0,1260,M (1931),Crime|Film-Noir|Thriller,4.097911
1,163809,Over the Garden Wall (2013),Adventure|Animation|Drama,4.080932
2,159817,Planet Earth (2006),Documentary,4.066612
3,169252,Everything Will Be OK (2006),Animation|Drama,4.039003
4,926,All About Eve (1950),Drama,4.027503
5,171011,Planet Earth II (2016),Documentary,4.023441
6,778,Trainspotting (1996),Comedy|Crime|Drama,4.013003
7,3462,Modern Times (1936),Comedy|Drama|Romance,4.00682
8,26587,"Decalogue, The (Dekalog) (1989)",Crime|Drama|Romance,4.004739
9,2351,"Nights of Cabiria (Notti di Cabiria, Le) (1957)",Drama,4.000537
