## Matrix Factorization Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
3. [FunkSVD Item-Based Recommender System with Scikit Surprise](#funksvd)
4. [Hyperparameter Tuning FunkSVD Item-Based Recommender System](#hyper)
5. [Making Predictions for User 4056](#4056)
6. [Final Item-Based Recommender System](#final)

### Introduction <a class="anchor" id="intro"></a>

During the Initial Modeling stage, we create the first version of the restaurant recommendation system, which will serve as our starting point for future improvements and enhancements.

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

# Import SVD algorithm from Surprise library
from surprise import SVD

# Import Reader and Dataset from Surprise library
from surprise.reader import Reader
from surprise import Dataset

# Import FunkSVD algorithm from Surprise library
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import train_test_split and GridSearchCV from Surprise library
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

# Import accuracy module from Surprise library
from surprise import accuracy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
# Read data from a pickle file into a Pandas DataFrame
vancouver_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/vancouver_data.pkl')

In [3]:
# Display concise information about the 'vancouver_data' DataFrame
vancouver_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64660 entries, 1101 to 5561981
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          64660 non-null  int64  
 1   business_id      64660 non-null  int64  
 2   rating           64660 non-null  float64
 3   restaurant_name  64660 non-null  object 
 4   categories       64660 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 3.0+ MB


In [4]:
# Display the first few rows of the 'vancouver_data' DataFrame
vancouver_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
1101,70315,1407,4.0,Meat & Bread,"[Fast Food, Bakeries, Sandwiches, Salad, Soup,..."
1105,70315,1356,3.0,Edible Canada At the Market,"[Seafood, Canadian (New), American (New), Spec..."
1109,70315,7370,4.0,The Lamplighter Public House,"[Nightlife, Gastropubs, Bars, Pubs]"
1144,70315,1143,5.0,Miku,"[Japanese, Sushi Bars]"
1151,70315,13469,4.0,Lupo,[Italian]


In [5]:
# Count the number of missing values in each column of the 'vancouver_data' DataFrame
vancouver_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
# Print the size of our model dataset
print(f"The size of our model dataset is {vancouver_data.shape[0]} entries.")

The size of our model dataset is 64660 entries.


In [7]:
# Extract columns 'user_id', 'restaurant_name', and 'rating' from 'vancouver_data',
# then sort the data by 'user_id' in ascending order
sorted_data = vancouver_data[['user_id', 'restaurant_name', 'rating']].sort_values(by='user_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328038,4,Breakfast Table,2.0
2328033,4,Yolks,5.0
2328050,4,Fable,3.0
2328052,4,Minami,5.0
2328053,4,The Flying Pig - Gastown,4.0
...,...,...,...
1342235,81124,The Sandbar Seafood Restaurant,5.0
1342237,81124,The Flying Pig - Yaletown,1.0
1342238,81124,Black Rice Izakaya,2.0
1999029,81139,Marutama Ramen,5.0


In [8]:
# Get unique user_id values and map them to new values starting from 0
user_id_mapping = {user_id: new_id for new_id, user_id in enumerate(sorted_data['user_id'].unique())}

# Replace the 'user_id' values in the DataFrame using the mapping
sorted_data['user_id'] = sorted_data['user_id'].map(user_id_mapping)

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328038,0,Breakfast Table,2.0
2328033,0,Yolks,5.0
2328050,0,Fable,3.0
2328052,0,Minami,5.0
2328053,0,The Flying Pig - Gastown,4.0
...,...,...,...
1342235,8976,The Sandbar Seafood Restaurant,5.0
1342237,8976,The Flying Pig - Yaletown,1.0
1342238,8976,Black Rice Izakaya,2.0
1999029,8977,Marutama Ramen,5.0


In [9]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['user_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['rating'].min(), "to", sorted_data['rating'].max())

Number of restaurants: 766
Number of unique reviewers: 8978
Range of ratings: 1.0 to 5.0


In [10]:
# Group by 'user_id' and count the number of non-NaN ratings for each user
user_ratings_count = sorted_data.groupby('user_id')['rating'].count()

# Find the user with the most ratings (index of the maximum count)
user_with_most_ratings = user_ratings_count.idxmax()

# Get the actual count of ratings for the user with the most ratings
most_ratings_count = user_ratings_count.max()

# Print the results
print(f"User with the most ratings: {user_with_most_ratings}")
print(f"Number of ratings for the user: {most_ratings_count}")

User with the most ratings: 4056
Number of ratings for the user: 543


### FunkSVD Item-Based Recommender System with Scikit Surprise <a class="anchor" id="funksvd"></a>

In [11]:
# Load the DataFrame into a scikit-surprise Dataset
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sorted_data[['user_id', 'restaurant_name', 'rating']], reader)

In [12]:
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.4)

In [13]:
# Create the FunkSVD model
model = FunkSVD(n_factors=100, n_epochs=20, lr_all=0.05, biased=False, verbose=0)

# Train the model on the training set
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x152979510>

In [14]:
# Make predictions on the test set
predictions = model.test(testset)

In [15]:
# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.1260309447608687
Mean Squared Error (MSE): 1.2679456885590548
Mean Absolute Error (MAE): 0.8870043280718919
Fraction of Concordant Pairs (FCP): 0.6160065612835128


### Hyperparameter Tuning FunkSVD Item-Based Recommender System <a class="anchor" id="hyper"></a>

In [16]:
# Set the parameter grid
param_grid = {
    'n_epochs': [10, 20, 50, 75, 100], 
    'n_factors': [50, 100, 150, 200, 250],
    'lr_all': [0.005, 0.01, 0.02], 
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(data)

In [17]:
# Print the best FCP scores
print('Best FCP:', GS.best_score['fcp'])

# Print the best parameters found during the grid search
print('Best parameters:', GS.best_params['fcp'])

Best FCP: 0.6403319464145792
Best parameters: {'n_epochs': 10, 'n_factors': 250, 'lr_all': 0.005, 'biased': False}


In [18]:
# Split train test set
trainset, testset = train_test_split(data, test_size=0.40)

# Set the algorithm
my_svd = FunkSVD(n_factors=250, 
                 n_epochs=10, 
                 lr_all=0.005,
                 biased=False,
                 verbose=0)

# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [19]:
# Put 'my_pred' results in a DataFrame
df_prediction_rated = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

# Calculate the difference of actual and prediction into the 'diff' column
df_prediction_rated['diff'] = abs(df_prediction_rated['prediction'] - df_prediction_rated['actual'])

In [20]:
# Check the df_prediction
df_prediction_rated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,2466,Western Lake Chinese Seafood Restaurant,3.0,3.582694,{'was_impossible': False},0.582694
1,4821,Go Fish Ocean Emporium,4.0,3.436397,{'was_impossible': False},0.563603
2,2544,Hawksworth Restaurant,4.0,4.023631,{'was_impossible': False},0.023631
3,7631,Sushi Itoga,5.0,3.83274,"{'was_impossible': True, 'reason': 'User and i...",1.16726
4,1623,East is East,3.0,2.300915,{'was_impossible': False},0.699085


In [21]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (df_prediction_rated['diff'] == 0).mean())

Proportion of correct predictions: 0.0066115063408598825


In [22]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (df_prediction_rated["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.4281240334055057


In [23]:
# Build full trainset
full_trainset = data.build_full_trainset()

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14661b810>

In [24]:
# Define the full test set
full_testset = full_trainset.build_anti_testset(fill=-1)

In [25]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [26]:
# Put into a dataframe
df_prediction_unrated = pd.DataFrame(my_pred, columns=['user_id',
                                                     'restaurant_name',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [27]:
df_prediction_unrated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
0,2466,Western Lake Chinese Seafood Restaurant,3.0,3.582694,{'was_impossible': False}
1,4821,Go Fish Ocean Emporium,4.0,3.436397,{'was_impossible': False}
2,2544,Hawksworth Restaurant,4.0,4.023631,{'was_impossible': False}
3,7631,Sushi Itoga,5.0,3.83274,"{'was_impossible': True, 'reason': 'User and i..."
4,1623,East is East,3.0,2.300915,{'was_impossible': False}


### Making Predictions for User 4056 <a class="anchor" id="4056"></a>

In [28]:
# Check our favorite user id `4056` for the top predictions
predict_4056 = df_prediction_unrated[df_prediction_unrated['user_id'] == 4056].sort_values(by=['prediction'], ascending=False)

predict_4056

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
20122,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False}
22236,4056,French Made Baking,5.0,5.000000,{'was_impossible': False}
8433,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False}
21649,4056,Tuc Craft Kitchen,3.0,5.000000,{'was_impossible': False}
23024,4056,Trafiq Cafe & Bakery,5.0,5.000000,{'was_impossible': False}
...,...,...,...,...,...
24161,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False}
4837,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False}
12448,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False}
23538,4056,Just Waffles,5.0,1.000000,{'was_impossible': False}


In [29]:
original_4056 = sorted_data[sorted_data['user_id'] == 4056]

original_4056

Unnamed: 0,user_id,restaurant_name,rating
144010,4056,Sushi Mura,5.0
144009,4056,Purebread,5.0
144014,4056,Canra Srilankan Cuisine,5.0
144023,4056,Sushi Hub,4.0
143948,4056,Sal y Limón,5.0
...,...,...,...
141845,4056,Pizzeria Farina,4.0
141844,4056,Joe Fortes Seafood & Chop House,5.0
141746,4056,Wang's Taiwan Beef Noodle House,4.0
141736,4056,Showcase Restaurant & Bar,5.0


In [30]:
# Merge on 'user_id' and 'restaurant_name'
merged_4056 = predict_4056.merge(original_4056, how='left', on=['user_id', 'restaurant_name'])

# Calculate the absolute difference between 'prediction' and 'actual'
merged_4056['diff'] = abs(merged_4056['prediction'] - merged_4056['actual'])

# Drop the 'rating' column
merged_4056.drop(columns=['rating'], inplace=True)

# Display the updated DataFrame
merged_4056

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False},0.000000
1,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
2,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
3,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
4,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False},0.000000
...,...,...,...,...,...,...
267,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False},1.152937
268,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False},3.156977
269,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False},3.374221
270,4056,Just Waffles,5.0,1.000000,{'was_impossible': False},4.000000


In [31]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (merged_4056['diff'] == 0).mean())

Proportion of correct predictions: 0.025735294117647058


In [32]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (merged_4056["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.5477941176470589


### Final Item-Based Recommender System <a class="anchor" id="final"></a>

In [33]:
# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.4)

# Fit the algorithm on the training dataset
my_svd.fit(trainset)

# Generate predictions on the test dataset
predictions = my_svd.test(testset)

# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.7530979114555563
Mean Squared Error (MSE): 3.073352287149833
Mean Absolute Error (MAE): 1.3899871313953476
Fraction of Concordant Pairs (FCP): 0.6620823794661699
