## Matrix Factorization Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
3. [FunkSVD Item-Based Recommender System with Scikit Surprise](#funksvd)
4. [Hyperparameter Tuning FunkSVD Item-Based Recommender System](#hyper)
5. [Making Predictions for User 4056](#4056)
6. [Final Item-Based Recommender System](#final)

### Introduction <a class="anchor" id="intro"></a>

In this notebook, we will construct a Matrix Factorization Model using FunkSVD. This method is particularly useful for collaborative filtering in recommendation systems. By leveraging FunkSVD, we can factorize the user-item interaction matrix to uncover latent features that capture user preferences and item characteristics. With this model, we aim to provide accurate and personalized recommendations to users based on their past interactions.

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

# Import SVD algorithm from Surprise library
from surprise import SVD

# Import Reader and Dataset from Surprise library
from surprise.reader import Reader
from surprise import Dataset

# Import FunkSVD algorithm from Surprise library
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import train_test_split and GridSearchCV from Surprise library
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

# Import accuracy module from Surprise library
from surprise import accuracy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'surprise'

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [None]:
# Read data from a pickle file into a Pandas DataFrame
model_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/model_data.pkl')

In [None]:
# Display concise information about the 'model_data.' DataFrame
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1203530 entries, 40 to 5572793
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   user_id          1203530 non-null  int64  
 1   business_id      1203530 non-null  int64  
 2   rating           1203530 non-null  float64
 3   restaurant_name  1203530 non-null  object 
 4   categories       1203530 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 55.1+ MB


In [None]:
# Display the first few rows of the 'model_data.' DataFrame
model_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
40,53031,6620,4.0,Thaitation,[Thai]
41,53031,4147,2.0,Howling Wolf Taqueria,"[Bars, Arts & Entertainment, Nightlife, Music ..."
42,53031,12401,3.0,Santarpio's Pizza,"[Pizza, American (Traditional), Italian]"
43,53031,1357,2.0,The Gallows,"[Seafood, Bars, American (New), American (Trad..."
44,53031,3498,3.0,Antique Table,[Italian]


In [None]:
# Count the number of missing values in each column of the 'model_data.' DataFrame
model_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [None]:
# Print the size of our model dataset
print(f"The size of our model dataset is {model_data.shape[0]} entries.")

The size of our model dataset is 1203530 entries.


In [None]:
# Extract columns 'user_id', 'restaurant_name', and 'rating' from 'model_data.',
# then sort the data by 'user_id' in ascending order
sorted_data = model_data[['user_id', 'restaurant_name', 'rating']].sort_values(by='user_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
5514017,0,Keke's Breakfast Cafe,5.0
5111087,1,Galleria Umberto,5.0
5111089,1,The Friendly Toast,4.0
5111090,1,Pavement Coffeehouse,4.0
5111088,1,Jm Curley,3.0
...,...,...,...
1999031,81139,Meat & Bread,5.0
1999030,81139,The Salt Lick BBQ - Austin Airport,5.0
1999029,81139,Marutama Ramen,5.0
4055785,81140,Tia's Waterfront,4.0


In [None]:
# Get unique user_id values and map them to new values starting from 0
user_id_mapping = {user_id: new_id for new_id, user_id in enumerate(sorted_data['user_id'].unique())}

# Replace the 'user_id' values in the DataFrame using the mapping
sorted_data['user_id'] = sorted_data['user_id'].map(user_id_mapping)

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
5514017,0,Keke's Breakfast Cafe,5.0
5111087,1,Galleria Umberto,5.0
5111089,1,The Friendly Toast,4.0
5111090,1,Pavement Coffeehouse,4.0
5111088,1,Jm Curley,3.0
...,...,...,...
1999031,81139,Meat & Bread,5.0
1999030,81139,The Salt Lick BBQ - Austin Airport,5.0
1999029,81139,Marutama Ramen,5.0
4055785,81140,Tia's Waterfront,4.0


In [None]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['user_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['rating'].min(), "to", sorted_data['rating'].max())

Number of restaurants: 12192
Number of unique reviewers: 81142
Range of ratings: 1.0 to 5.0


* We have a total of 12,192 restaurants in our dataset.
* There are 81,142 unique reviewers who have left reviews for these restaurants.
* The ratings given by reviewers range from 1.0 to 5.0.

In [None]:
# Group by 'user_id' and count the number of non-NaN ratings for each user
user_ratings_count = sorted_data.groupby('user_id')['rating'].count()

# Find the user with the most ratings (index of the maximum count)
user_with_most_ratings = user_ratings_count.idxmax()

# Get the actual count of ratings for the user with the most ratings
most_ratings_count = user_ratings_count.max()

# Print the results
print(f"User with the most ratings: {user_with_most_ratings}")
print(f"Number of ratings for the user: {most_ratings_count}")

User with the most ratings: 59815
Number of ratings for the user: 976


* User 59815 has provided the most ratings.
* This user has submitted a total of 976 ratings.

### FunkSVD Item-Based Recommender System with Scikit Surprise <a class="anchor" id="funksvd"></a>

Here, we are preparing our data for the scikit-surprise library to build a recommendation model. First, we create a Reader object and specify the rating scale, which ranges from 1 to 5 in this case. Next, we load our DataFrame `sorted_data` into a scikit-surprise Dataset object using Dataset.

In [None]:
# Load the DataFrame into a scikit-surprise Dataset
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sorted_data[['user_id', 'restaurant_name', 'rating']], reader)

In [None]:
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.4)

Our baseline FunkSVD model will be set up with the following parameters:

* Number of factors (latent features) considered: 100
* Number of epochs (iterations during training): 20
* Learning rate for the optimization process: 0.05
* Biased set to False (indicating no bias terms are used in the model)
* Verbosity level set to 0 (minimal output during training)

These parameters define the behavior and complexity of our FunkSVD model, and with these settings, we'll proceed to train and optimize our recommendation system effectively.

In [None]:
# Create the FunkSVD model
model = FunkSVD(n_factors=100, n_epochs=20, lr_all=0.05, biased=False, verbose=0)

# Train the model on the training set
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2920bc990>

In [None]:
# Make predictions on the test set
predictions = model.test(testset)

In [None]:
# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.1372911600680533
Mean Squared Error (MSE): 1.2934311827689386
Mean Absolute Error (MAE): 0.8900969926148625
Fraction of Concordant Pairs (FCP): 0.6041028828729958


Let's understand the results of our baseline Matrix Factorization Model using FunkSVD:

* **Root Mean Squared Error (RMSE)**: In this case, the RMSE value is approximately 1.14. A lower RMSE indicates that the model's predictions are closer to the actual ratings, suggesting a better fit to the data.

* **Mean Squared Error (MSE)**: The MSE value is around 1.29. Like RMSE, a lower MSE indicates better model performance.

* **Mean Absolute Error (MAE)**: The MAE value is approximately 0.89. Similar to RMSE and MSE, a lower MAE indicates more accurate predictions.

* **Fraction of Concordant Pairs (FCP)**: The FCP value is around 0.60, which means that around 60% of the item pairs are correctly ranked by our model.

Overall, these metrics give us insights into the performance of our baseline model. We aim to improve these values as we fine-tune the model and make more accurate and reliable recommendations.

### Hyperparameter Tuning FunkSVD Item-Based Recommender System <a class="anchor" id="hyper"></a>

We'll now do some Hyperparameter Tuning for our FunkSVD Item-Based Recommender System. We'll use a grid search to explore different combinations of hyperparameters and find the best set of parameters that optimize the model's performance. By fine-tuning the hyperparameters, we aim to enhance the recommendation system's accuracy and effectiveness, providing users with more personalized and relevant restaurant suggestions. 

In [None]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(data)

In [None]:
# Print the best FCP scores
print('Best FCP:', GS.best_score['fcp'])

# Print the best parameters found during the grid search
print('Best parameters:', GS.best_params['fcp'])

Best FCP: 0.6171634393403845
Best parameters: {'n_factors': 100, 'n_epochs': 10, 'lr_all': 0.005, 'biased': False}


**Best FCP**: The highest Fraction of Concordant Pairs (FCP) achieved by our model is approximately 0.62. FCP is a measure of how well our model ranks the items, and a higher FCP means better ranking performance.

**Best parameters**: The hyperparameters that resulted in the best FCP are as follows:
* Number of factors (latent features) considered: 100
* Number of epochs (iterations during training): 10
* Learning rate for the optimization process: 0.005
* Biased set to False (indicating no bias terms are used in the model)

These are the optimal hyperparameters that significantly improved the ranking performance of our recommender system. With these settings, our model is now better at providing more accurate and relevant restaurant recommendations to users.

In [None]:
# Split train test set
trainset, testset = train_test_split(data, test_size=0.40)

# Set the algorithm
my_svd = FunkSVD(n_factors=150, 
                 n_epochs=10, 
                 lr_all=0.005,
                 biased=False,
                 verbose=0)

# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [None]:
# Put 'my_pred' results in a DataFrame
df_prediction_rated = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

# Calculate the difference of actual and prediction into the 'diff' column
df_prediction_rated['diff'] = abs(df_prediction_rated['prediction'] - df_prediction_rated['actual'])

In [None]:
# Check the df_prediction
df_prediction_rated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,64333,Shojo,4.0,2.563028,{'was_impossible': False},1.436972
1,38555,Salvation Pizza Kitchen and Bar - Rock Rose,4.0,1.627459,{'was_impossible': False},2.372541
2,9587,Vendetta,4.0,2.537885,{'was_impossible': False},1.462115
3,12832,5 Seasons Brewing,5.0,3.619123,{'was_impossible': False},1.380877
4,47611,Simon's Restaurant,2.0,2.072313,{'was_impossible': False},0.072313


In [None]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (df_prediction_rated['diff'] == 0).mean())

Proportion of correct predictions: 0.012804001562071572


**Proportion of correct predictions**: This metric indicates that approximately 1.28% of the predictions made by our model are correct. This means the model's accuracy is relatively low, as only a small fraction of predictions are accurate.

In [None]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (df_prediction_rated["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.5311147208627953


**Proportion of correct predictions within margin 1**: About 53.11% of the model's predictions fall within a margin of 1. In other words, the model is relatively more accurate when its predictions are within a small difference (margin of 1) from the actual values. However, there is still room for improvement to enhance overall prediction accuracy.

In [None]:
# Build full trainset
full_trainset = data.build_full_trainset()

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2b48bd190>

In [None]:
# Define the batch size
batch_size = 1000

# Calculate the number of batches
num_batches = (full_trainset.n_users * full_trainset.n_items) // batch_size + 1

# Process the anti-testsets in batches
for batch_num in range(num_batches):
    start_idx = batch_num * batch_size
    end_idx = min((batch_num + 1) * batch_size, full_trainset.n_users * full_trainset.n_items)
    batch_anti_testset = full_trainset.build_anti_testset(fill=-1)[start_idx:end_idx]

    # Use your collaborative filtering model to predict ratings for the batch
    batch_predictions = model.test(batch_anti_testset)

In [None]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [None]:
# Put into a dataframe
df_prediction_unrated = pd.DataFrame(my_pred, columns=['user_id',
                                                     'restaurant_name',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [None]:
df_prediction_unrated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
0,2466,Western Lake Chinese Seafood Restaurant,3.0,3.582694,{'was_impossible': False}
1,4821,Go Fish Ocean Emporium,4.0,3.436397,{'was_impossible': False}
2,2544,Hawksworth Restaurant,4.0,4.023631,{'was_impossible': False}
3,7631,Sushi Itoga,5.0,3.83274,"{'was_impossible': True, 'reason': 'User and i..."
4,1623,East is East,3.0,2.300915,{'was_impossible': False}


### Making Predictions for User 4056 <a class="anchor" id="4056"></a>

In [None]:
# Check our favorite user id `59815` for the top predictions
predict_59815 = df_prediction_unrated[df_prediction_unrated['user_id'] == 59815].sort_values(by=['prediction'], ascending=False)

predict_59815

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
20122,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False}
22236,4056,French Made Baking,5.0,5.000000,{'was_impossible': False}
8433,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False}
21649,4056,Tuc Craft Kitchen,3.0,5.000000,{'was_impossible': False}
23024,4056,Trafiq Cafe & Bakery,5.0,5.000000,{'was_impossible': False}
...,...,...,...,...,...
24161,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False}
4837,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False}
12448,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False}
23538,4056,Just Waffles,5.0,1.000000,{'was_impossible': False}


In [None]:
original_59815 = sorted_data[sorted_data['user_id'] == 59815]

original_59815

Unnamed: 0,user_id,restaurant_name,rating
144010,4056,Sushi Mura,5.0
144009,4056,Purebread,5.0
144014,4056,Canra Srilankan Cuisine,5.0
144023,4056,Sushi Hub,4.0
143948,4056,Sal y Limón,5.0
...,...,...,...
141845,4056,Pizzeria Farina,4.0
141844,4056,Joe Fortes Seafood & Chop House,5.0
141746,4056,Wang's Taiwan Beef Noodle House,4.0
141736,4056,Showcase Restaurant & Bar,5.0


In [None]:
# Merge on 'user_id' and 'restaurant_name'
merged_59815 = predict_59815.merge(original_59815, how='left', on=['user_id', 'restaurant_name'])

# Calculate the absolute difference between 'prediction' and 'actual'
merged_59815['diff'] = abs(merged_59815['prediction'] - merged_59815['actual'])

# Drop the 'rating' column
merged_59815.drop(columns=['rating'], inplace=True)

# Display the updated DataFrame
merged_59815

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False},0.000000
1,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
2,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
3,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
4,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False},0.000000
...,...,...,...,...,...,...
267,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False},1.152937
268,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False},3.156977
269,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False},3.374221
270,4056,Just Waffles,5.0,1.000000,{'was_impossible': False},4.000000


In [None]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (merged_59815['diff'] == 0).mean())

Proportion of correct predictions: 0.025735294117647058


In [None]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (merged_59815["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.5477941176470589


### Final Item-Based Recommender System <a class="anchor" id="final"></a>

In [None]:
# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.4)

# Fit the algorithm on the training dataset
my_svd.fit(trainset)

# Generate predictions on the test dataset
predictions = my_svd.test(testset)

# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.7530979114555563
Mean Squared Error (MSE): 3.073352287149833
Mean Absolute Error (MAE): 1.3899871313953476
Fraction of Concordant Pairs (FCP): 0.6620823794661699
