## Modeling 

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)
    * Data Dictionary
3. [Collaborative-Filtering Recommendation System without SVD](#nosvd)
4. [Collaborative-Filtering Recommendation System with SVD](#svd)
5. [Collaborative-Filtering Recommendation System with FunkSVD](#funksvd)

### Introduction <a class="anchor" id="intro"></a>

During the Initial Modeling stage, we create the first version of the restaurant recommendation system, which will serve as our starting point for future improvements and enhancements.

#### Importing Python Libraries 

Importing necessary libraries for the EDA process.

In [1]:
import numpy as np 
import pandas as pd 

from sklearn.metrics.pairwise import cosine_similarity

from surprise import SVD
from surprise.reader import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import train_test_split
from surprise import accuracy

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="model"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
vancouver_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/vancouver_data.pkl')

In [3]:
vancouver_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64660 entries, 1101 to 5561981
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          64660 non-null  int64  
 1   business_id      64660 non-null  int64  
 2   rating           64660 non-null  float64
 3   restaurant_name  64660 non-null  object 
 4   categories       64660 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 3.0+ MB


In [4]:
vancouver_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
1101,70315,1407,4.0,Meat & Bread,"[Fast Food, Bakeries, Sandwiches, Salad, Soup,..."
1105,70315,1356,3.0,Edible Canada At the Market,"[Seafood, Canadian (New), American (New), Spec..."
1109,70315,7370,4.0,The Lamplighter Public House,"[Nightlife, Gastropubs, Bars, Pubs]"
1144,70315,1143,5.0,Miku,"[Japanese, Sushi Bars]"
1151,70315,13469,4.0,Lupo,[Italian]


In [5]:
vancouver_data.sample(10)

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
146274,70193,12407,5.0,Sura Korean Cuisine,[Korean]
172039,25824,4932,4.0,Thierry,"[Desserts, Chocolatiers & Shops, Food Delivery..."
148910,11201,1065,1.0,Viet Sub,"[Vietnamese, Sandwiches]"
2390296,75071,8292,4.0,Chambar,"[Cafes, Middle Eastern, Nightlife, Breakfast &..."
146942,68883,8558,5.0,Nuba Yaletown,"[Mediterranean, Vegetarian, Middle Eastern]"
149541,23276,592,4.0,Ciao Bella,[Italian]
169763,70403,8451,3.0,The Ouisi Bistro,"[Cajun/Creole, Breakfast & Brunch]"
467652,74049,9609,4.0,Kadoya Japanese Restaurant,"[Japanese, Sushi Bars]"
2331827,27935,1700,5.0,La Taqueria Pinche Taco Shop,"[Caterers, Mexican, Event Planning & Services]"
2869828,41381,5709,2.0,Romer's,"[Burgers, Seafood, Salad]"


In [6]:
vancouver_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [7]:
print(f"The size of our model dataset is {vancouver_data.shape[0]} entries.")

The size of our model dataset is 64660 entries.


In [8]:
# sampled_model_data = model_data.sample(frac=0.01, random_state=42)

# print(f"The size of our sampled model dataset is {sampled_model_data.shape[0]} entries.")

### Collaborative-Filtering Recommendation System without SVD <a class="anchor" id="nosvd"></a>

Collaborative filtering is a general technique used in recommendation systems to predict user preferences based on the preferences of similar users. It does not involve matrix factorization. Instead, it relies on computing similarities between users or items to generate recommendations. Collaborative filtering without SVD directly operates on the user-item interaction matrix and may use various similarity metrics to find similar users or items. 

In [9]:
# Creating the User-Item Matrix
def create_user_item_matrix(data):
    # Pivot the data to create a matrix where rows are restaurant names and columns are user IDs
    user_item_matrix = data.pivot_table(index='restaurant_name', columns='user_id', values='rating', fill_value=0)
    return user_item_matrix

# Calculate Similarity Scores
def calculate_similarity_scores(user_item_matrix):
    # Calculate the cosine similarity between restaurants based on their user-item matrix
    similarity_scores = cosine_similarity(user_item_matrix)
    return similarity_scores

# Recommend Restaurants
def recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores, num_recommendations=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(restaurant_name)

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 'num_recommendations' similar items
    similar_items = sorted(enumerate(similarity_scores[index]), key=lambda x: x[1], reverse=True)[1:num_recommendations + 1]

    # Fetch the relevant restaurant names from the 'user_item_matrix' dataset
    recommended_restaurant_names = [user_item_matrix.index[i[0]] for i in similar_items]

    return recommended_restaurant_names 

In [10]:
# Creating the User-Item Matrix
user_item_matrix = create_user_item_matrix(vancouver_data)

user_item_matrix.head()

user_id,4,7,12,27,28,33,42,45,77,82,...,81069,81081,81084,81094,81098,81101,81102,81111,81124,81139
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3G Vegetarian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49th Parallel Coffee,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6 Degrees Eatery,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A La Mode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARC Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Calculate Similarity Scores
similarity_scores = calculate_similarity_scores(user_item_matrix)

similarity_scores.shape

(766, 766)

In [12]:
# Get recommendations for a specific restaurant
restaurant_name = "Miku"
recommended_restaurants = recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores)
print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants))

Recommended Restaurants for Miku: ['Hokkaido Ramen Santouka', 'Phnom Penh', 'Kingyo', 'Chambar', 'Jam Cafe on Beatty']


### Collaborative-Filtering Recommendation System with SVD <a class="anchor" id="svd"></a>

Traditional Singular Value Decomposition is a matrix factorization technique that decomposes a given matrix into three matrices: U (user features), Σ (singular values), and V^T (item features). While traditional SVD can be applied to recommendation systems, it assumes a complete user-item interaction matrix without any missing values. This assumption is often not applicable in real-world scenarios where user-item matrices are typically sparse.

In [13]:
# Creating the User-Item Matrix
def create_user_item_matrix(data):
    # Pivot the data to create a matrix where rows are restaurant names and columns are user IDs
    user_item_matrix = data.pivot_table(index='restaurant_name', columns='user_id', values='rating', fill_value=0)
    return user_item_matrix

# Perform Singular Value Decomposition (SVD)
def perform_svd(user_item_matrix, num_latent_features=50):
    # Perform SVD on the user-item matrix
    U, sigma, Vt = np.linalg.svd(user_item_matrix)

    # Reduce the dimensions based on the number of latent features
    U = U[:, :num_latent_features]
    sigma = np.diag(sigma[:num_latent_features])
    Vt = Vt[:num_latent_features, :]

    return U, sigma, Vt

# Calculate Similarity Scores
def calculate_similarity_scores(Vt):
    # Calculate the cosine similarity between restaurants based on the Vt matrix from SVD
    similarity_scores = cosine_similarity(Vt)
    return similarity_scores

# Recommend Restaurants
def recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores, num_recommendations=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(restaurant_name)

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 'num_recommendations' similar items
    similar_items = sorted(enumerate(similarity_scores[index]), key=lambda x: x[1], reverse=True)[1:num_recommendations + 1]

    # Fetch the relevant restaurant names from the 'user_item_matrix' dataset
    recommended_restaurant_names = [user_item_matrix.index[i[0]] for i in similar_items]

    return recommended_restaurant_names 

In [14]:
# Creating the User-Item Matrix
user_item_matrix = create_user_item_matrix(vancouver_data)

user_item_matrix.head()

user_id,4,7,12,27,28,33,42,45,77,82,...,81069,81081,81084,81094,81098,81101,81102,81111,81124,81139
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3G Vegetarian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49th Parallel Coffee,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6 Degrees Eatery,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A La Mode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARC Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Perform Singular Value Decomposition (SVD)
U, sigma, Vt = perform_svd(user_item_matrix)

print("Matrix U:")
print(U)

print("\nMatrix Sigma:")
print(sigma)

print("\nMatrix Vt:")
print(Vt)

Matrix U:
[[-0.01212714  0.0012404  -0.02619276 ... -0.00303898 -0.00441591
   0.00940577]
 [-0.08993277 -0.03366205  0.00969205 ...  0.03692048 -0.05690002
   0.0381378 ]
 [-0.0109174  -0.00536868 -0.00604911 ... -0.02327282  0.00542948
  -0.02597468]
 ...
 [-0.02076546 -0.00421513  0.02269508 ... -0.03520262 -0.01196116
  -0.02276765]
 [-0.03860882 -0.00749997  0.0246579  ... -0.01059694 -0.02531328
   0.03079898]
 [-0.01009916 -0.00273176  0.00236111 ... -0.00773699 -0.02498955
   0.01446831]]

Matrix Sigma:
[[283.89355785   0.           0.         ...   0.           0.
    0.        ]
 [  0.         129.99310514   0.         ...   0.           0.
    0.        ]
 [  0.           0.         120.34068013 ...   0.           0.
    0.        ]
 ...
 [  0.           0.           0.         ...  59.89844207   0.
    0.        ]
 [  0.           0.           0.         ...   0.          59.41380546
    0.        ]
 [  0.           0.           0.         ...   0.           0.
   59.087906

In [16]:
# Calculate Similarity Scores
similarity_scores = calculate_similarity_scores(user_item_matrix)

similarity_scores.shape

(766, 766)

In [17]:
# Get recommendations for a specific restaurant
restaurant_name = "Miku"
recommended_restaurants = recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores)
print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants)) 

Recommended Restaurants for Miku: ['Hokkaido Ramen Santouka', 'Phnom Penh', 'Kingyo', 'Chambar', 'Jam Cafe on Beatty']


### Collaborative-Filtering Recommendation System with FunkSVD <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

In [18]:
# Load your user-item interaction data into Surprise Dataset
reader = Reader(rating_scale=(0, 5))
vancouver_data = Dataset.load_from_df(vancouver_data[['user_id', 'restaurant_name', 'rating']], reader)

In [19]:
# Split the data into training and testing sets
trainset, testset = train_test_split(vancouver_data, test_size=0.2, random_state=42)

In [20]:
# Step 3: Build and train the FunkSVD-based collaborative filtering model
model = FunkSVD(n_factors=50, biased=True, random_state=42)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1479e3f50>

In [21]:
# Step 4: Make predictions on the test set
predictions = model.test(testset)

In [22]:
# Step 5: Evaluate the model's performance
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

RMSE: 0.9218
MAE:  0.7182


In [23]:
# Example Usage: Recommend restaurants similar to a specific restaurant
restaurant_name = "Miku" 

In [24]:
# Get the user-item matrix used for factorization
trainset_full = vancouver_data.build_full_trainset()
user_item_matrix = trainset_full.ur

In [25]:
# Find the index of the input restaurant name in the pivot table
restaurant_index = trainset_full.to_inner_iid(restaurant_name)

In [26]:
# Get the latent factors for the input restaurant
restaurant_factors = model.qi[restaurant_index]

In [27]:
# Calculate similarity scores with other restaurants based on latent factors
similarity_scores = np.dot(model.qi, restaurant_factors)

In [28]:
# Sort the restaurants based on similarity scores in descending order
similar_restaurant_indices = np.argsort(similarity_scores)[::-1]

In [29]:
# Get top N recommended restaurants (excluding the input restaurant itself)
top_n = 5
recommended_restaurants = []
for index in similar_restaurant_indices:
    name = trainset_full.to_raw_iid(index)
    if name != restaurant_name:
        recommended_restaurants.append(name)
        if len(recommended_restaurants) == top_n:
            break

print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants))

Recommended Restaurants for Miku: ['Joe Fortes Seafood & Chop House', 'Campagnolo', 'Sushi Jin', 'Sushi Mura', 'The Charlatan']
