## Matrix Factorization Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
3. [FunkSVD Item-Based Recommender System with Scikit Surprise](#funksvd)
4. [Hyperparameter Tuning FunkSVD Item-Based Recommender System](#hyper)
5. [Making Predictions for User 4056](#4056)
6. [Final Item-Based Recommender System](#final)

### Introduction <a class="anchor" id="intro"></a>

During the Initial Modeling stage, we create the first version of the restaurant recommendation system, which will serve as our starting point for future improvements and enhancements.

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

# Import SVD algorithm from Surprise library
from surprise import SVD

# Import Reader and Dataset from Surprise library
from surprise.reader import Reader
from surprise import Dataset

# Import FunkSVD algorithm from Surprise library
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import train_test_split and GridSearchCV from Surprise library
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

# Import accuracy module from Surprise library
from surprise import accuracy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
# Read data from a pickle file into a Pandas DataFrame
model_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/model_data.pkl')

In [3]:
# Display concise information about the 'model_data.' DataFrame
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1203530 entries, 40 to 5572793
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   user_id          1203530 non-null  int64  
 1   business_id      1203530 non-null  int64  
 2   rating           1203530 non-null  float64
 3   restaurant_name  1203530 non-null  object 
 4   categories       1203530 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 55.1+ MB


In [4]:
# Display the first few rows of the 'model_data.' DataFrame
model_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
40,53031,6620,4.0,Thaitation,[Thai]
41,53031,4147,2.0,Howling Wolf Taqueria,"[Bars, Arts & Entertainment, Nightlife, Music ..."
42,53031,12401,3.0,Santarpio's Pizza,"[Pizza, American (Traditional), Italian]"
43,53031,1357,2.0,The Gallows,"[Seafood, Bars, American (New), American (Trad..."
44,53031,3498,3.0,Antique Table,[Italian]


In [5]:
# Count the number of missing values in each column of the 'model_data.' DataFrame
model_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
# Print the size of our model dataset
print(f"The size of our model dataset is {model_data.shape[0]} entries.")

The size of our model dataset is 1203530 entries.


As the dataset is quite large, we will do a 10% sample of the original `model_data`. This will help with the overall computational needs for this data analysis. 

In [7]:
# Create a random sample of 10% of 'sorted_data'
sample_data = model_data.sample(frac=0.1, random_state=42)

# Display the updated DataFrame
display(sample_data)

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
405064,78865,9705,4.0,Licha's Cantina,"[Event Planning & Services, Mexican, Nightlife..."
31691,3264,12278,3.0,Blackmoor Bar and Kitchen,"[Asian Fusion, Pubs, American (Traditional), B..."
241944,29521,6143,4.0,Inka Chicken,"[Peruvian, Spanish, Latin American, Gluten-Fre..."
1379345,48464,812,4.0,Ichiban,"[Sushi Bars, Chinese, Hot Pot, Teppanyaki, Jap..."
2067827,16442,11547,5.0,Shigezo Izakaya,"[Tapas/Small Plates, Japanese, Ramen, Sushi Ba..."
...,...,...,...,...,...
322016,2875,7522,4.0,Manuel's Tavern,"[Bars, Sandwiches, Desserts, Salad, Dive Bars,..."
633635,74665,11752,4.0,Olympia Provisions,"[Modern European, American (New), Wine Bars, B..."
2252431,51374,522,4.0,The Buff Restaurant,"[Soup, American (Traditional), Breakfast & Bru..."
919607,20343,7126,5.0,Petsi Pies,"[Coffee & Tea, Bakeries, Breakfast & Brunch]"


In [8]:
# Extract columns 'user_id', 'restaurant_name', and 'rating' from 'model_data.', then sort the data by 'user_id' in ascending order
sorted_data = sample_data[['user_id', 'restaurant_name', 'rating']].sort_values(by='user_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
5514017,0,Keke's Breakfast Cafe,5.0
5337664,2,Harvard Square,3.0
2328043,4,Blue Star Donuts,5.0
2328036,4,Panera Bread,2.0
2328057,4,Hawksworth Restaurant,5.0
...,...,...,...
771245,81135,Lulu's Allston,2.0
771248,81135,Futago Udon,5.0
771236,81135,Kimchipapi Kitchen,1.0
771242,81135,Tasty n Alder,5.0


In [9]:
# Get unique user_id values and map them to new values starting from 0
user_id_mapping = {user_id: new_id for new_id, user_id in enumerate(sorted_data['user_id'].unique())}

# Replace the 'user_id' values in the DataFrame using the mapping
sorted_data['user_id'] = sorted_data['user_id'].map(user_id_mapping)

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
5514017,0,Keke's Breakfast Cafe,5.0
5337664,1,Harvard Square,3.0
2328043,2,Blue Star Donuts,5.0
2328036,2,Panera Bread,2.0
2328057,2,Hawksworth Restaurant,5.0
...,...,...,...
771245,36688,Lulu's Allston,2.0
771248,36688,Futago Udon,5.0
771236,36688,Kimchipapi Kitchen,1.0
771242,36688,Tasty n Alder,5.0


The `model_data` has been sorted by `user_id` in ascending order and put into a new dataframe `sorted_data`. We then reset the `user_id` to start at 0 and so fourth through mapping for a cleaner dataset to work with.

In [10]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['user_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['rating'].min(), "to", sorted_data['rating'].max())

Number of restaurants: 11861
Number of unique reviewers: 36690
Range of ratings: 1.0 to 5.0


In [11]:
# Group by 'user_id' and count the number of non-NaN ratings for each user
user_ratings_count = sorted_data.groupby('user_id')['rating'].count()

# Find the user with the most ratings (index of the maximum count)
user_with_most_ratings = user_ratings_count.idxmax()

# Get the actual count of ratings for the user with the most ratings
most_ratings_count = user_ratings_count.max()

# Print the results
print(f"User with the most ratings: {user_with_most_ratings}")
print(f"Number of ratings for the user: {most_ratings_count}")

User with the most ratings: 29701
Number of ratings for the user: 105


### FunkSVD Item-Based Recommender System with Scikit Surprise <a class="anchor" id="funksvd"></a>

In [12]:
# Load the DataFrame into a scikit-surprise Dataset
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sorted_data[['user_id', 'restaurant_name', 'rating']], reader)

In [13]:
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.4)

In [14]:
# Create the FunkSVD model
model = FunkSVD(n_factors=100, n_epochs=20, lr_all=0.05, biased=False, verbose=0)

# Train the model on the training set
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x17b926790>

In [15]:
# Make predictions on the test set
predictions = model.test(testset)

In [16]:
# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 2.6275937671639618
Mean Squared Error (MSE): 6.904249005238899
Mean Absolute Error (MAE): 2.304389844740413
Fraction of Concordant Pairs (FCP): 0.4794000820640771


### Hyperparameter Tuning FunkSVD Item-Based Recommender System <a class="anchor" id="hyper"></a>

In [17]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(data)

In [18]:
# Print the best FCP scores
print('Best FCP:', GS.best_score['fcp'])

# Print the best parameters found during the grid search
print('Best parameters:', GS.best_params['fcp'])

Best FCP: 0.5481983376679745
Best parameters: {'n_factors': 100, 'n_epochs': 10, 'lr_all': 0.1, 'biased': False}


In [20]:
# Split train test set
trainset, testset = train_test_split(data, test_size=0.40)

# Set the algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=10, 
                 lr_all=0.005,
                 biased=False,
                 verbose=0)

# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [21]:
# Put 'my_pred' results in a DataFrame
df_prediction_rated = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

# Calculate the difference of actual and prediction into the 'diff' column
df_prediction_rated['diff'] = abs(df_prediction_rated['prediction'] - df_prediction_rated['actual'])

In [22]:
# Check the df_prediction
df_prediction_rated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,35261,Douzo,3.0,1.0,{'was_impossible': False},2.0
1,9820,Island Creek Oyster Bar,4.0,3.849289,"{'was_impossible': True, 'reason': 'User and i...",0.150711
2,21062,Pinocchios Pizza & Subs,4.0,1.0,{'was_impossible': False},3.0
3,34506,Cafe Sushi,3.0,1.0,{'was_impossible': False},2.0
4,21314,Pacific Rim Bistro,4.0,3.849289,"{'was_impossible': True, 'reason': 'User and i...",0.150711


In [23]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (df_prediction_rated['diff'] == 0).mean())

Proportion of correct predictions: 0.025902538324124466


In [24]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (df_prediction_rated["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.20294129865813634


In [25]:
# Build full trainset
full_trainset = data.build_full_trainset()

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2800ac910>

In [26]:
# Define the batch size
batch_size = 5

# Calculate the number of batches
num_batches = (full_trainset.n_users * full_trainset.n_items) // batch_size + 1

# Process the anti-testsets in batches
for batch_num in range(num_batches):
    start_idx = batch_num * batch_size
    end_idx = min((batch_num + 1) * batch_size, full_trainset.n_users * full_trainset.n_items)
    batch_anti_testset = full_trainset.build_anti_testset(fill=-1)[start_idx:end_idx]

    # Use your collaborative filtering model to predict ratings for the batch
    batch_predictions = model.test(batch_anti_testset)

In [None]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [None]:
# Put into a dataframe
df_prediction_unrated = pd.DataFrame(my_pred, columns=['user_id',
                                                     'restaurant_name',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [None]:
df_prediction_unrated.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
0,2466,Western Lake Chinese Seafood Restaurant,3.0,3.582694,{'was_impossible': False}
1,4821,Go Fish Ocean Emporium,4.0,3.436397,{'was_impossible': False}
2,2544,Hawksworth Restaurant,4.0,4.023631,{'was_impossible': False}
3,7631,Sushi Itoga,5.0,3.83274,"{'was_impossible': True, 'reason': 'User and i..."
4,1623,East is East,3.0,2.300915,{'was_impossible': False}


### Making Predictions for User 29701 <a class="anchor" id="4056"></a>

In [None]:
# Check our favorite user id `29701` for the top predictions
predict_29701 = df_prediction_unrated[df_prediction_unrated['user_id'] == 29701].sort_values(by=['prediction'], ascending=False)

predict_29701

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
20122,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False}
22236,4056,French Made Baking,5.0,5.000000,{'was_impossible': False}
8433,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False}
21649,4056,Tuc Craft Kitchen,3.0,5.000000,{'was_impossible': False}
23024,4056,Trafiq Cafe & Bakery,5.0,5.000000,{'was_impossible': False}
...,...,...,...,...,...
24161,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False}
4837,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False}
12448,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False}
23538,4056,Just Waffles,5.0,1.000000,{'was_impossible': False}


In [None]:
original_29701 = sorted_data[sorted_data['user_id'] == 29701]

original_29701

Unnamed: 0,user_id,restaurant_name,rating
144010,4056,Sushi Mura,5.0
144009,4056,Purebread,5.0
144014,4056,Canra Srilankan Cuisine,5.0
144023,4056,Sushi Hub,4.0
143948,4056,Sal y Limón,5.0
...,...,...,...
141845,4056,Pizzeria Farina,4.0
141844,4056,Joe Fortes Seafood & Chop House,5.0
141746,4056,Wang's Taiwan Beef Noodle House,4.0
141736,4056,Showcase Restaurant & Bar,5.0


In [None]:
# Merge on 'user_id' and 'restaurant_name'
merged_29701 = predict_29701.merge(original_29701, how='left', on=['user_id', 'restaurant_name'])

# Calculate the absolute difference between 'prediction' and 'actual'
merged_29701['diff'] = abs(merged_29701['prediction'] - merged_29701['actual'])

# Drop the 'rating' column
merged_29701.drop(columns=['rating'], inplace=True)

# Display the updated DataFrame
merged_29701

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,4056,Parallel 49 Brewing,5.0,5.000000,{'was_impossible': False},0.000000
1,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
2,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
3,4056,French Made Baking,5.0,5.000000,{'was_impossible': False},0.000000
4,4056,Kissa Tanto,5.0,5.000000,{'was_impossible': False},0.000000
...,...,...,...,...,...,...
267,4056,Sushi Coen,3.0,1.847063,{'was_impossible': False},1.152937
268,4056,CaliBurger Vancouver,5.0,1.843023,{'was_impossible': False},3.156977
269,4056,Showcase Restaurant & Bar,5.0,1.625779,{'was_impossible': False},3.374221
270,4056,Just Waffles,5.0,1.000000,{'was_impossible': False},4.000000


In [None]:
# Calculate the proportion where the predicted rating matches exactly with the actual rating
print("Proportion of correct predictions:", (merged_29701['diff'] == 0).mean())

Proportion of correct predictions: 0.025735294117647058


In [None]:
# Calculate the proportion of correct predictions within a margin of 1 
print("Proportion of correct predictions within margin 1:", (merged_29701["diff"] <= 1).mean())

Proportion of correct predictions within margin 1: 0.5477941176470589


### Final Item-Based Recommender System <a class="anchor" id="final"></a>

In [None]:
# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.4)

# Fit the algorithm on the training dataset
my_svd.fit(trainset)

# Generate predictions on the test dataset
predictions = my_svd.test(testset)

# Calculate and print RMSE
print("Root Mean Squared Error (RMSE):", accuracy.rmse(predictions, verbose=False))

# Calculate and print MSE
print("Mean Squared Error (MSE):", accuracy.mse(predictions, verbose=False))

# Calculate and print MAE
print("Mean Absolute Error (MAE):", accuracy.mae(predictions, verbose=False))

# Calculate and print FCP
print("Fraction of Concordant Pairs (FCP):", accuracy.fcp(predictions, verbose=False))

Root Mean Squared Error (RMSE): 1.7530979114555563
Mean Squared Error (MSE): 3.073352287149833
Mean Absolute Error (MAE): 1.3899871313953476
Fraction of Concordant Pairs (FCP): 0.6620823794661699
