## Modeling 

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
    * Data Dictionary
3. [Collaborative-Filtering Recommendation System without SVD](#nosvd)
4. [Collaborative-Filtering Recommendation System with SVD](#svd)
5. [Collaborative-Filtering Recommendation System with FunkSVD](#funksvd)

### Introduction <a class="anchor" id="intro"></a>

During the Initial Modeling stage, we create the first version of the restaurant recommendation system, which will serve as our starting point for future improvements and enhancements.

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# Import cosine_similarity function from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity

# Import SVD algorithm from Surprise library
from surprise import SVD

# Import Reader and Dataset from Surprise library
from surprise.reader import Reader
from surprise import Dataset

# Import FunkSVD algorithm from Surprise library
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import train_test_split and GridSearchCV from Surprise library
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

# Import accuracy module from Surprise library
from surprise import accuracy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
# Read data from a pickle file into a Pandas DataFrame
vancouver_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/vancouver_data.pkl')


In [3]:
# Display concise information about the 'vancouver_data' DataFrame
vancouver_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64660 entries, 1101 to 5561981
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          64660 non-null  int64  
 1   business_id      64660 non-null  int64  
 2   rating           64660 non-null  float64
 3   restaurant_name  64660 non-null  object 
 4   categories       64660 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 3.0+ MB


In [4]:
# Display the first few rows of the 'vancouver_data' DataFrame
vancouver_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
1101,70315,1407,4.0,Meat & Bread,"[Fast Food, Bakeries, Sandwiches, Salad, Soup,..."
1105,70315,1356,3.0,Edible Canada At the Market,"[Seafood, Canadian (New), American (New), Spec..."
1109,70315,7370,4.0,The Lamplighter Public House,"[Nightlife, Gastropubs, Bars, Pubs]"
1144,70315,1143,5.0,Miku,"[Japanese, Sushi Bars]"
1151,70315,13469,4.0,Lupo,[Italian]


In [5]:
# Count the number of missing values in each column of the 'vancouver_data' DataFrame
vancouver_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
# Print the size of our model dataset
print(f"The size of our model dataset is {vancouver_data.shape[0]} entries.")

The size of our model dataset is 64660 entries.


In [7]:
# Extract columns 'user_id', 'restaurant_name', and 'rating' from 'vancouver_data',
# then sort the data by 'user_id' in ascending order
sorted_data = vancouver_data[['user_id', 'restaurant_name', 'rating']].sort_values(by='user_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328038,4,Breakfast Table,2.0
2328033,4,Yolks,5.0
2328050,4,Fable,3.0
2328052,4,Minami,5.0
2328053,4,The Flying Pig - Gastown,4.0
...,...,...,...
1342235,81124,The Sandbar Seafood Restaurant,5.0
1342237,81124,The Flying Pig - Yaletown,1.0
1342238,81124,Black Rice Izakaya,2.0
1999029,81139,Marutama Ramen,5.0


In [8]:
# Get unique user_id values and map them to new values starting from 0
user_id_mapping = {user_id: new_id for new_id, user_id in enumerate(sorted_data['user_id'].unique())}

# Replace the 'user_id' values in the DataFrame using the mapping
sorted_data['user_id'] = sorted_data['user_id'].map(user_id_mapping)

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,user_id,restaurant_name,rating
2328038,0,Breakfast Table,2.0
2328033,0,Yolks,5.0
2328050,0,Fable,3.0
2328052,0,Minami,5.0
2328053,0,The Flying Pig - Gastown,4.0
...,...,...,...
1342235,8976,The Sandbar Seafood Restaurant,5.0
1342237,8976,The Flying Pig - Yaletown,1.0
1342238,8976,Black Rice Izakaya,2.0
1999029,8977,Marutama Ramen,5.0


In [9]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['user_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['rating'].min(), "to", sorted_data['rating'].max())

Number of restaurants: 766
Number of unique reviewers: 8978
Range of ratings: 1.0 to 5.0


### Building the Matrixes for the Model

In [10]:
def create_user_item_matrix(data):
    # Create the User-Item Matrix
    user_item_matrix = data.pivot_table(index='user_id', columns='restaurant_name', values='rating')

    return user_item_matrix

In [11]:
# Create the user-item matrix using the create_user_item_matrix function
user_item_matrix = create_user_item_matrix(sorted_data)

# Display the user-item matrix
display(user_item_matrix)

restaurant_name,3G Vegetarian Restaurant,49th Parallel Coffee,6 Degrees Eatery,A La Mode,ARC Restaurant,Abigail's Party,Absinthe Bistro,Acacia Fillo Bar,Acme Cafe,Adesso Bistro,...,Yaletown Brewing Company,Yamato Sushi Restaurant,Yolk's Breakfast,Yolks,Yui Japanese Bistro,Zabu Chicken,Zakkushi Dining On Main,Zakkushi on Denman,Zefferelli's,Zeitoon Restaurant
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,5.0,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,5.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,4.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8973,,,,,,,,,,,...,,,,3.0,,,,,,
8974,,,,,,,,,,,...,,,,,,,,,,
8975,,,,,,,,,,,...,,,,,,,,,,
8976,,,,,,,,,,,...,,,,,,,,,,


### Collaborative-Filtering Recommendation System with FunkSVD <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

In [12]:
# Load your user-item interaction data into Surprise Dataset
reader = Reader(rating_scale=(1, 5))

sur_data = Dataset.load_from_df(sorted_data[['user_id', 'restaurant_name', 'rating']], reader)

In [13]:
# Split the data into training and testing sets
trainset, testset = train_test_split(sur_data, test_size=0.2, random_state=42)

In [14]:
# Build and train the FunkSVD-based collaborative filtering model
model = FunkSVD(n_factors=50, biased=True, random_state=42)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16a5ea250>

In [15]:
# Make predictions on the test set
predictions = model.test(testset)

In [16]:
# Evaluate the model's performance and calculate Root Mean Squared Error (RMSE)
rmse = accuracy.rmse(predictions)
print("Root Mean Squared Error (RMSE):", rmse)

# Evaluate the model's performance and calculate Mean Absolute Error (MAE)
mae = accuracy.mae(predictions)
print("Mean Absolute Error (MAE):", mae)

RMSE: 0.9192
Root Mean Squared Error (RMSE): 0.9191623580796869
MAE:  0.7176
Mean Absolute Error (MAE): 0.7176286174020334


In [17]:
# Example Usage: Recommend restaurants similar to a specific restaurant
restaurant_name = "Miku" 

In [18]:
# Get the user-item matrix used for factorization
trainset_full = sur_data.build_full_trainset()
user_item_matrix = trainset_full.ur

In [19]:
# Find the index of the input restaurant name in the pivot table
restaurant_index = trainset_full.to_inner_iid(restaurant_name)

In [20]:
# Get the latent factors for the input restaurant
restaurant_factors = model.qi[restaurant_index]

In [21]:
# Calculate similarity scores with other restaurants based on latent factors
similarity_scores = np.dot(model.qi, restaurant_factors)

In [22]:
# Sort the restaurants based on similarity scores in descending order
similar_restaurant_indices = np.argsort(similarity_scores)[::-1]

In [23]:
# Get top N recommended restaurants (excluding the input restaurant itself)
top_n = 5
recommended_restaurants = []
for index in similar_restaurant_indices:
    name = trainset_full.to_raw_iid(index)
    if name != restaurant_name:
        recommended_restaurants.append(name)
        if len(recommended_restaurants) == top_n:
            break

print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants))

Recommended Restaurants for Miku: ['Memphis Blues Barbeque House', 'JOEY Bentall One', 'Adesso Bistro', 'So Hyang Korean Cuisine', 'Tom Sushi']


### Model Evaluation of Recommendation System with FunkSVD  <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

#### Hyperparameter Optimization through GridSearchCV



In [24]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] } #The parameter indicates to the algorithm that all latent information must be stored. 

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(sur_data)

In [25]:
# Get the best FCP accuracy score from GridSearchCV results
gsbs = GS.best_score['fcp']

# Get the best parameters for FCP from GridSearchCV results
gsbp = GS.best_params['fcp']

# Check the FCP accuracy score (1.0 is ideal and 0 is worst)
print("FCP accuracy score:", gsbs)

# Check the best parameters for FCP
print("Best parameters for FCP:", gsbp)

FCP accuracy score: 0.6469286492352521
Best parameters for FCP: {'n_factors': 150, 'n_epochs': 10, 'lr_all': 0.005, 'biased': False}


#### Building Recommendation System with FunkSVD

In [26]:
# Split train test set
trainset, testset = train_test_split(sur_data, test_size=0.25)

# Set the algorithm
my_svd = FunkSVD(n_factors=150, 
                 n_epochs=10, 
                 lr_all=0.005, 
                 biased=False,
                 verbose=0)

# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [27]:
# Put 'my_pred' results in a DataFrame
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

# Calculate the difference of actual and prediction into the 'diff' column
df_prediction['diff'] = abs(df_prediction['prediction'] - df_prediction['actual'])


#### Evaluating Predictions

In [28]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
0,3138,Splitz Grill,4.0,2.57775,{'was_impossible': False},1.42225
1,4842,AnnaLena,5.0,4.217511,{'was_impossible': False},0.782489
2,4572,Nordstrom,5.0,2.9964,{'was_impossible': False},2.0036
3,8787,Caffè Artigiano,4.0,3.119032,{'was_impossible': False},0.880968
4,2150,Raincity Grill,4.0,1.567071,{'was_impossible': False},2.432929


In [29]:
# See the best 10 predictions
df_prediction.sort_values(by='diff')[:10]

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
1326,3930,Gyu-Kaku Japanese BBQ,5.0,5.0,{'was_impossible': False},0.0
737,762,Shiro,5.0,5.0,{'was_impossible': False},0.0
10007,7826,Chatime,1.0,1.0,{'was_impossible': False},0.0
390,4854,Marutama Ramen,5.0,5.0,{'was_impossible': False},0.0
5165,955,Sushi Jin,1.0,1.0,{'was_impossible': False},0.0
12139,3386,The Diamond,1.0,1.0,{'was_impossible': False},0.0
14392,3930,L'Abattoir,5.0,5.0,{'was_impossible': False},0.0
2472,3963,Kintaro Ramen,5.0,5.0,{'was_impossible': False},0.0
12097,4545,Timbertrain Coffee Roasters,5.0,5.0,{'was_impossible': False},0.0
12094,1769,Joe Fortes Seafood & Chop House,5.0,5.0,{'was_impossible': False},0.0


In [30]:
# See the worst 10 predictions
df_prediction.sort_values(by='diff')[-10:]

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
523,4005,Society Dining Lounge,5.0,1.0,{'was_impossible': False},4.0
10089,3027,Go Fish Ocean Emporium,5.0,1.0,{'was_impossible': False},4.0
12779,3229,Japadog,5.0,1.0,{'was_impossible': False},4.0
4865,2038,Hurricane Grill,5.0,1.0,{'was_impossible': False},4.0
10087,8834,Gyu-Kaku Japanese BBQ,5.0,1.0,{'was_impossible': False},4.0
12776,5754,Kintaro Ramen,5.0,1.0,{'was_impossible': False},4.0
7352,851,Heirloom Vegetarian,5.0,1.0,{'was_impossible': False},4.0
12759,4131,Heirloom Vegetarian,5.0,1.0,{'was_impossible': False},4.0
628,884,A La Mode,5.0,1.0,{'was_impossible': False},4.0
11101,4997,Cactus Club Cafe,5.0,1.0,{'was_impossible': False},4.0


In [31]:
# Check total rows with same actual and prediction ratings
df_prediction[df_prediction['diff'] <= 0]

Unnamed: 0,user_id,restaurant_name,actual,prediction,details,diff
32,8427,Joe Fortes Seafood & Chop House,5.0,5.0,{'was_impossible': False},0.0
165,2376,Blue Water Cafe,5.0,5.0,{'was_impossible': False},0.0
295,6673,Twisted Fork,5.0,5.0,{'was_impossible': False},0.0
390,4854,Marutama Ramen,5.0,5.0,{'was_impossible': False},0.0
442,1445,Fritz European Fry House,5.0,5.0,{'was_impossible': False},0.0
...,...,...,...,...,...,...
15390,2046,Lupo,1.0,1.0,{'was_impossible': False},0.0
15769,4003,Finch's Tea & Coffee House,5.0,5.0,{'was_impossible': False},0.0
15829,3619,Toshi Sushi,5.0,5.0,{'was_impossible': False},0.0
15982,994,Showcase Restaurant & Bar,1.0,1.0,{'was_impossible': False},0.0


In [32]:
(df_prediction['diff'] == 0).mean()

0.009217445097432725

In [33]:
(df_prediction["diff"] <= 1).mean()

0.5233529229817507

In [34]:
# Build full trainset
full_trainset = sur_data.build_full_trainset()

# Build the SVD algorithm
my_svd = FunkSVD(n_factors=150, 
                 n_epochs=10, 
                 lr_all=0.005, 
                 biased=False,
                 verbose=0)

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x17771e050>

In [35]:
# Define the full test set
full_testset = full_trainset.build_anti_testset(fill=-1)

In [36]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [37]:
# Put 'my_pred' results in a DataFrame
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                               'restaurant_name',
                                               'actual',
                                               'prediction',
                                               'details'])

In [38]:
# Check user id `1497` predictions
df = df_prediction[df_prediction['user_id'] == 7857]\
    .sort_values(by=['prediction'], ascending=False)\
    .head()

display(df)

Unnamed: 0,user_id,restaurant_name,actual,prediction,details
7630,7857,Miku,5.0,5.0,{'was_impossible': False}
15580,7857,Au Petit Café,5.0,4.106117,{'was_impossible': False}
61,7857,Minami,5.0,4.015528,{'was_impossible': False}
713,7857,Black & Blue,4.0,3.907933,{'was_impossible': False}
11438,7857,Black & Blue,2.0,3.907933,{'was_impossible': False}


#### Recommendation System Model Evaluation 

In [39]:
# The surprise package doesn't allow you to test on the trainset we built
my_train_dataset, my_test_dataset = train_test_split(sur_data, test_size=0.5)

predictions = my_svd.test(my_test_dataset)

In [40]:
# Calculate and print Root Mean Squared Error (RMSE)
RMSE = accuracy.rmse(predictions, verbose=False)
print("Root Mean Squared Error (RMSE):", RMSE)

# Calculate and print Mean Squared Error (MSE)
MSE = accuracy.mse(predictions, verbose=False)
print("Mean Squared Error (MSE):", MSE)

Root Mean Squared Error (RMSE): 0.9958286277836513
Mean Squared Error (MSE): 0.9916746559134699


In [41]:
# Calculate Fraction of Concordant Pairs (FCP)
FCP = accuracy.fcp(predictions, verbose=False)

# Print the Fraction of Concordant Pairs (FCP) score
print("Fraction of Concordant Pairs (FCP):", FCP)

Fraction of Concordant Pairs (FCP): 0.739445319325952
