## Matrix Factorization Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
3. [FunkSVD Item-Based Recommender System with Scikit Surprise](#funksvd)
4. [Hyperparameter Tuning FunkSVD Item-Based Recommender System](#hyper)
5. [Making Predictions for User 4056](#4056)
6. [Final Item-Based Recommender System](#final)

### Introduction <a class="anchor" id="intro"></a>

In this notebook, we will construct a Matrix Factorization Model using FunkSVD. This method is particularly useful for collaborative filtering in recommendation systems. By leveraging FunkSVD, we can factorize the user-item interaction matrix to uncover latent features that capture user preferences and item characteristics. With this model, we aim to provide accurate and personalized recommendations to users based on their past interactions.

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

# Import SVD algorithm from Surprise library
from surprise import SVD

# Import Reader and Dataset from Surprise library
from surprise.reader import Reader
from surprise import Dataset

# Import FunkSVD algorithm from Surprise library
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import train_test_split and GridSearchCV from Surprise library
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

# Import accuracy module from Surprise library
from surprise import accuracy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
# Read data from a pickle file into a Pandas DataFrame
model_data = pd.read_pickle('T:/GitHub/Brainstation_Capstone/Data/model_data.pkl')

In [3]:
# Display concise information about the 'model_data.' DataFrame
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1203530 entries, 40 to 5572793
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   user_id          1203530 non-null  int64  
 1   business_id      1203530 non-null  int64  
 2   rating           1203530 non-null  float64
 3   restaurant_name  1203530 non-null  object 
 4   categories       1203530 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 55.1+ MB


In [4]:
# Display the first few rows of the 'model_data.' DataFrame
model_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
40,53031,6620,4.0,Thaitation,[Thai]
41,53031,4147,2.0,Howling Wolf Taqueria,"[Bars, Arts & Entertainment, Nightlife, Music ..."
42,53031,12401,3.0,Santarpio's Pizza,"[Pizza, American (Traditional), Italian]"
43,53031,1357,2.0,The Gallows,"[Seafood, Bars, American (New), American (Trad..."
44,53031,3498,3.0,Antique Table,[Italian]


In [5]:
# Count the number of missing values in each column of the 'model_data.' DataFrame
model_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
# Print the size of our model dataset
print(f"The size of our model dataset is {model_data.shape[0]} entries.")

The size of our model dataset is 1203530 entries.


In [7]:
sample_data = model_data.sample(frac=0.01, random_state=42)

In [8]:
# Extract columns 'user_id', 'restaurant_name', and 'rating' from 'model_data.',
# then sort the data by 'user_id' in ascending order
sorted_data = sample_data[['user_id', 'restaurant_name', 'rating']].sort_values(by='user_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328043,4,Blue Star Donuts,5.0
1488926,10,Barley's Brewing Company,3.0
1524809,18,The Hollywood Brown Derby,4.0
3048313,25,Don Chilitos Mexican Grill,4.0
273005,27,Sugar Mama's Bakeshop,3.0
...,...,...,...
1607482,81101,CHAU Veggie Express,4.0
1061957,81115,Chef Eddie's,4.0
1470049,81130,Alberta Street Pub,3.0
2071677,81132,Boxer Bento,1.0


In [9]:
# Get unique user_id values and map them to new values starting from 0
user_id_mapping = {user_id: new_id for new_id, user_id in enumerate(sorted_data['user_id'].unique())}

# Replace the 'user_id' values in the DataFrame using the mapping
sorted_data['user_id'] = sorted_data['user_id'].map(user_id_mapping)

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328043,0,Blue Star Donuts,5.0
1488926,1,Barley's Brewing Company,3.0
1524809,2,The Hollywood Brown Derby,4.0
3048313,3,Don Chilitos Mexican Grill,4.0
273005,4,Sugar Mama's Bakeshop,3.0
...,...,...,...
1607482,8976,CHAU Veggie Express,4.0
1061957,8977,Chef Eddie's,4.0
1470049,8978,Alberta Street Pub,3.0
2071677,8979,Boxer Bento,1.0


In [10]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['user_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['rating'].min(), "to", sorted_data['rating'].max())

Number of restaurants: 5937
Number of unique reviewers: 8981
Range of ratings: 1.0 to 5.0


* We have a total of 12,192 restaurants in our dataset.
* There are 81,142 unique reviewers who have left reviews for these restaurants.
* The ratings given by reviewers range from 1.0 to 5.0.

In [11]:
# Group by 'user_id' and count the number of non-NaN ratings for each user
user_ratings_count = sorted_data.groupby('user_id')['rating'].count()

# Find the user with the most ratings (index of the maximum count)
user_with_most_ratings = user_ratings_count.idxmax()

# Get the actual count of ratings for the user with the most ratings
most_ratings_count = user_ratings_count.max()

# Print the results
print(f"User with the most ratings: {user_with_most_ratings}")
print(f"Number of ratings for the user: {most_ratings_count}")

User with the most ratings: 7233
Number of ratings for the user: 13


* User 59815 has provided the most ratings.
* This user has submitted a total of 976 ratings.

### FunkSVD Item-Based Recommender System with Scikit Surprise <a class="anchor" id="funksvd"></a>

Here, we are preparing our data for the scikit-surprise library to build a recommendation model. First, we create a Reader object and specify the rating scale, which ranges from 1 to 5 in this case. Next, we load our DataFrame `sorted_data` into a scikit-surprise Dataset object using Dataset.

In [12]:
# Load the DataFrame into a scikit-surprise Dataset
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sorted_data[['user_id', 'restaurant_name', 'rating']], reader)

In [13]:
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.4)

Our baseline FunkSVD model will be set up with the following parameters:

* Number of factors (latent features) considered: 100
* Number of epochs (iterations during training): 20
* Learning rate for the optimization process: 0.05
* Biased set to False (indicating no bias terms are used in the model)
* Verbosity level set to 0 (minimal output during training)

These parameters define the behavior and complexity of our FunkSVD model, and with these settings, we'll proceed to train and optimize our recommendation system effectively.

In [14]:
# Create the FunkSVD model
model = FunkSVD(n_factors=100, n_epochs=20, lr_all=0.05, biased=False, verbose=0)

# Train the model on the training set
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x105f335d0>

In [15]:
# Make predictions on the test set
predictions = model.test(testset)

In [16]:
# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.5800241520888678
Mean Squared Error (MSE): 2.496476321184146
Mean Absolute Error (MAE): 1.1744352177494088
Fraction of Concordant Pairs (FCP): 0.482469545474392


Let's understand the results of our baseline Matrix Factorization Model using FunkSVD:

* **Root Mean Squared Error (RMSE)**: In this case, the RMSE value is approximately 1.14. A lower RMSE indicates that the model's predictions are closer to the actual ratings, suggesting a better fit to the data.

* **Mean Squared Error (MSE)**: The MSE value is around 1.29. Like RMSE, a lower MSE indicates better model performance.

* **Mean Absolute Error (MAE)**: The MAE value is approximately 0.89. Similar to RMSE and MSE, a lower MAE indicates more accurate predictions.

* **Fraction of Concordant Pairs (FCP)**: The FCP value is around 0.60, which means that around 60% of the item pairs are correctly ranked by our model.

Overall, these metrics give us insights into the performance of our baseline model. We aim to improve these values as we fine-tune the model and make more accurate and reliable recommendations.

### Hyperparameter Tuning FunkSVD Item-Based Recommender System <a class="anchor" id="hyper"></a>

We'll now do some Hyperparameter Tuning for our FunkSVD Item-Based Recommender System. We'll use a grid search to explore different combinations of hyperparameters and find the best set of parameters that optimize the model's performance. By fine-tuning the hyperparameters, we aim to enhance the recommendation system's accuracy and effectiveness, providing users with more personalized and relevant restaurant suggestions. 

In [17]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(data)

In [18]:
# Print the best FCP scores
print('Best FCP:', GS.best_score['fcp'])

# Print the best parameters found during the grid search
print('Best parameters:', GS.best_params['fcp'])

Best FCP: 0.47676902947597544
Best parameters: {'n_factors': 150, 'n_epochs': 10, 'lr_all': 0.1, 'biased': False}


**Best FCP**: The highest Fraction of Concordant Pairs (FCP) achieved by our model is approximately 0.62. FCP is a measure of how well our model ranks the items, and a higher FCP means better ranking performance.

**Best parameters**: The hyperparameters that resulted in the best FCP are as follows:
* Number of factors (latent features) considered: 100
* Number of epochs (iterations during training): 10
* Learning rate for the optimization process: 0.005
* Biased set to False (indicating no bias terms are used in the model)

These are the optimal hyperparameters that significantly improved the ranking performance of our recommender system. With these settings, our model is now better at providing more accurate and relevant restaurant recommendations to users.

In [19]:
# Split train test set
trainset, testset = train_test_split(data, test_size=0.40)

# Set the algorithm
my_svd = FunkSVD(n_factors=150, 
                 n_epochs=20, 
                 lr_all=0.1,
                 biased=False,
                 verbose=0)

# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [20]:
# Put 'my_pred' results in a DataFrame
df_prediction_rated = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

# Calculate the difference of actual and prediction into the 'diff' column
df_prediction_rated['diff'] = abs(df_prediction_rated['prediction'] - df_prediction_rated['actual'])

In [21]:
# Check the df_prediction
df_prediction_rated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,6598,"Snooze, an A.M. Eatery",5.0,1.0,{'was_impossible': False},4.0
1,2665,Texas Roadhouse,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.140147
2,333,Takemura Japanese Restaurant,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.140147
3,8445,Breka Bakery & Cafe,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.140147
4,4935,Darwin's,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.140147


In [22]:
# See the best 10 predictions
df_prediction_rated.sort_values(by='diff')[:10]

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
3382,5551,Austin Daily Press,1.0,1.0,{'was_impossible': False},0.0
681,6143,Güero's Taco Bar,1.0,1.0,{'was_impossible': False},0.0
1488,3675,Papa Haydn Northwest,1.0,1.0,{'was_impossible': False},0.0
3655,6045,Finale,1.0,1.0,{'was_impossible': False},0.0
403,674,Cambridge Brewing Company,1.0,1.0,{'was_impossible': False},0.0
4532,969,Gus's World Famous Fried Chicken,1.0,1.0,{'was_impossible': False},0.0
3846,115,Hong Kong Eatery,1.0,1.0,{'was_impossible': False},0.0
3242,287,Imperial Dynasty Chinese and Japanese Cuisine,1.0,1.0,{'was_impossible': False},0.0
2095,4199,East Coast Grill,1.0,1.0,{'was_impossible': False},0.0
1405,7765,B.GOOD,1.0,1.0,{'was_impossible': False},0.0


In [23]:
# See the worse 10 predictions
df_prediction_rated.sort_values(by='diff')[-10:]

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
821,1822,Dave's Fresh Pasta,5.0,1.0,{'was_impossible': False},4.0
3666,2172,Joy Cafe,5.0,1.0,{'was_impossible': False},4.0
841,4672,Jade Restaurant,5.0,1.0,{'was_impossible': False},4.0
3637,7213,Chez Zee American Bistro,5.0,1.0,{'was_impossible': False},4.0
3618,1773,SkyLounge,5.0,1.0,{'was_impossible': False},4.0
3606,3013,BYTES Restaurant,5.0,1.0,{'was_impossible': False},4.0
3603,8618,Hubert's Polish Kitchen,5.0,1.0,{'was_impossible': False},4.0
871,6045,India Quality Restaurant,5.0,1.0,{'was_impossible': False},4.0
3692,1332,Raku,5.0,1.0,{'was_impossible': False},4.0
0,6598,"Snooze, an A.M. Eatery",5.0,1.0,{'was_impossible': False},4.0


In [24]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (df_prediction_rated['diff'] == 0).mean())

Proportion of correct predictions: 0.00540091400083091


**Proportion of correct predictions**: This metric indicates that approximately 1.28% of the predictions made by our model are correct. This means the model's accuracy is relatively low, as only a small fraction of predictions are accurate.

In [25]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (df_prediction_rated["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.4977149979227254


**Proportion of correct predictions within margin 1**: About 53.11% of the model's predictions fall within a margin of 1. In other words, the model is relatively more accurate when its predictions are within a small difference (margin of 1) from the actual values. However, there is still room for improvement to enhance overall prediction accuracy.

In [26]:
# Build full trainset
full_trainset = data.build_full_trainset()

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16a0dab50>

In [38]:
# Split the dataset into training and testing sets
trainset, testset = train_test_split(data, test_size=0.4)  # You can adjust the test_size as per your preference

# Create an instance of the SVD algorithm
my_svd = SVD()

# Fit the model with the training set
my_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x165527990>

In [39]:
# Prepare the Test Set
testset = trainset.build_anti_testset(fill=-1)

In [40]:
# Make predictions on the test set
predictions = my_svd.test(testset)

In [41]:
# Put into a dataframe
df_prediction_unrated = pd.DataFrame(my_pred, columns=['user_id',
                                                     'restaurant_name',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [42]:
df_prediction_unrated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
0,6598,"Snooze, an A.M. Eatery",5.0,1.0,{'was_impossible': False}
1,2665,Texas Roadhouse,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."
2,333,Takemura Japanese Restaurant,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."
3,8445,Breka Bakery & Cafe,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."
4,4935,Darwin's,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."


### Making Predictions for User 7233 <a class="anchor" id="7233"></a>

In [43]:
# Check our favorite user id `7233` for the top predictions
predict_7233 = df_prediction_unrated[df_prediction_unrated['user_id'] == 7233].sort_values(by=['prediction'], ascending=False)

predict_7233

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
1495,7233,Fuku Boston Seaport,3.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."
2404,7233,Spring Shabu-Shabu,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i..."
1337,7233,Elephant & Castle,3.0,1.0,{'was_impossible': False}
3023,7233,Pho Pasteur,3.0,1.0,{'was_impossible': False}


In [44]:
original_7233 = sorted_data[sorted_data['user_id'] == 7233]

original_7233

Unnamed: 0,user_id,restaurant_name,rating
213236,7233,Clio,4.0
213317,7233,Horseshoe Grille,3.0
213595,7233,Barcelona Wine Bar,4.0
213216,7233,B Cafe,3.0
213218,7233,Spring Shabu-Shabu,4.0
213235,7233,Strip-T's Restaurant,4.0
213642,7233,Elephant & Castle,3.0
212573,7233,Edamame,3.0
213508,7233,The Capital Grille,4.0
213700,7233,Fuku Boston Seaport,3.0


In [45]:
# Merge on 'user_id' and 'restaurant_name'
merged_7233 = predict_7233.merge(original_7233, how='left', on=['user_id', 'restaurant_name'])

# Calculate the absolute difference between 'prediction' and 'actual'
merged_7233['diff'] = abs(merged_7233['prediction'] - merged_7233['actual'])

# Drop the 'rating' column
merged_7233.drop(columns=['rating'], inplace=True)

# Display the updated DataFrame
merged_7233

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,7233,Fuku Boston Seaport,3.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.859853
1,7233,Spring Shabu-Shabu,4.0,3.859853,"{'was_impossible': True, 'reason': 'User and i...",0.140147
2,7233,Elephant & Castle,3.0,1.0,{'was_impossible': False},2.0
3,7233,Pho Pasteur,3.0,1.0,{'was_impossible': False},2.0


In [46]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (merged_7233['diff'] == 0).mean())

Proportion of correct predictions: 0.0


In [47]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (merged_7233["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.5


### Final Item-Based Recommender System <a class="anchor" id="final"></a>

In [48]:
# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.4)

# Fit the algorithm on the training dataset
my_svd.fit(trainset)

# Generate predictions on the test dataset
predictions = my_svd.test(testset)

# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.0179024911100565
Mean Squared Error (MSE): 1.0361254814080585
Mean Absolute Error (MAE): 0.7974786988033851
Fraction of Concordant Pairs (FCP): 0.4975859763797073
