## Modeling 

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/18/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Final Dataset](#final)
    * Data Dictionary
3. [Collaborative-Filtering Recommendation System without SVD](#nosvd)
4. [Collaborative-Filtering Recommendation System with SVD](#svd)
5. [Collaborative-Filtering Recommendation System with FunkSVD](#funksvd)

### Introduction <a class="anchor" id="intro"></a>

asdf

#### Importing Python Libraries 

Importing necessary libraries for the EDA process.

In [1]:
# Import the basic packages
import numpy as np 
import pandas as pd 

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from scipy.sparse.linalg import svds

# Import the surprise packages
from surprise import SVD
from surprise.reader import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="final"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
vancouver_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/vancouver_data.pkl')

In [3]:
vancouver_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73022 entries, 186 to 5562411
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          73022 non-null  int64  
 1   business_id      73022 non-null  int64  
 2   rating           73022 non-null  float64
 3   restaurant_name  73022 non-null  object 
 4   categories       73022 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 3.3+ MB


In [4]:
vancouver_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
186,75654,6390,5.0,Beaches Restaurant & Bar,"[Bars, Pizza, American (New), Nightlife, Seafood]"
1101,70315,1407,4.0,Meat & Bread,"[Fast Food, Bakeries, Sandwiches, Salad, Soup,..."
1105,70315,1356,3.0,Edible Canada At the Market,"[Seafood, Canadian (New), American (New), Spec..."
1109,70315,7370,4.0,The Lamplighter Public House,"[Nightlife, Gastropubs, Bars, Pubs]"
1144,70315,1143,5.0,Miku,"[Japanese, Sushi Bars]"


In [5]:
vancouver_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
print(f"The size of our model dataset is {vancouver_data.shape[0]} entries.")

The size of our model dataset is 73022 entries.


In [7]:
# sampled_model_data = model_data.sample(frac=0.01, random_state=42)

# print(f"The size of our sampled model dataset is {sampled_model_data.shape[0]} entries.")

### Collaborative-Filtering Recommendation System without SVD <a class="anchor" id="nosvd"></a>

Collaborative filtering is a general technique used in recommendation systems to predict user preferences based on the preferences of similar users. It does not involve matrix factorization. Instead, it relies on computing similarities between users or items to generate recommendations. Collaborative filtering without SVD directly operates on the user-item interaction matrix and may use various similarity metrics to find similar users or items. 

In [8]:
# Creating the User-Item Matrix
def create_user_item_matrix(data):
    # Pivot the data to create a matrix where rows are restaurant names and columns are user IDs
    user_item_matrix = data.pivot_table(index='restaurant_name', columns='user_id', values='rating', fill_value=0)
    return user_item_matrix

# Calculate Similarity Scores
def calculate_similarity_scores(user_item_matrix):
    # Calculate the cosine similarity between restaurants based on their user-item matrix
    similarity_scores = cosine_similarity(user_item_matrix)
    return similarity_scores

# Recommend Restaurants
def recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores, num_recommendations=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(restaurant_name)

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 'num_recommendations' similar items
    similar_items = sorted(enumerate(similarity_scores[index]), key=lambda x: x[1], reverse=True)[1:num_recommendations + 1]

    # Fetch the relevant restaurant names from the 'user_item_matrix' dataset
    recommended_restaurant_names = [user_item_matrix.index[i[0]] for i in similar_items]

    return recommended_restaurant_names 

In [9]:
# Creating the User-Item Matrix
user_item_matrix = create_user_item_matrix(vancouver_data)

user_item_matrix.head()

user_id,4,7,12,25,27,28,33,42,45,77,...,81092,81094,81098,81101,81102,81104,81111,81124,81127,81139
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3G Vegetarian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49th Parallel Coffee,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6 Degrees Eatery,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A La Mode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARC Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Calculate Similarity Scores
similarity_scores = calculate_similarity_scores(user_item_matrix)

similarity_scores.shape

(969, 969)

In [12]:
# Get recommendations for a specific restaurant
restaurant_name = "Miku"
recommended_restaurants = recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores)
print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants))

Recommended Restaurants for Miku: ['Hokkaido Ramen Santouka', 'Phnom Penh', 'Kingyo', 'Chambar', 'Jam Cafe on Beatty']


### Collaborative-Filtering Recommendation System with SVD <a class="anchor" id="svd"></a>

Traditional Singular Value Decomposition is a matrix factorization technique that decomposes a given matrix into three matrices: U (user features), Σ (singular values), and V^T (item features). While traditional SVD can be applied to recommendation systems, it assumes a complete user-item interaction matrix without any missing values. This assumption is often not applicable in real-world scenarios where user-item matrices are typically sparse.

In [18]:
# Creating the User-Item Matrix
def create_user_item_matrix(data):
    # Pivot the data to create a matrix where rows are restaurant names and columns are user IDs
    user_item_matrix = data.pivot_table(index='restaurant_name', columns='user_id', values='rating', fill_value=0)
    return user_item_matrix

# Perform Singular Value Decomposition (SVD)
def perform_svd(user_item_matrix, num_latent_features=50):
    # Perform SVD on the user-item matrix
    U, sigma, Vt = np.linalg.svd(user_item_matrix)

    # Reduce the dimensions based on the number of latent features
    U = U[:, :num_latent_features]
    sigma = np.diag(sigma[:num_latent_features])
    Vt = Vt[:num_latent_features, :]

    return U, sigma, Vt

# Calculate Similarity Scores
def calculate_similarity_scores(Vt):
    # Calculate the cosine similarity between restaurants based on the Vt matrix from SVD
    similarity_scores = cosine_similarity(Vt)
    return similarity_scores

# Recommend Restaurants
def recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores, num_recommendations=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(restaurant_name)

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 'num_recommendations' similar items
    similar_items = sorted(enumerate(similarity_scores[index]), key=lambda x: x[1], reverse=True)[1:num_recommendations + 1]

    # Fetch the relevant restaurant names from the 'user_item_matrix' dataset
    recommended_restaurant_names = [user_item_matrix.index[i[0]] for i in similar_items]

    return recommended_restaurant_names 

In [19]:
# Creating the User-Item Matrix
user_item_matrix = create_user_item_matrix(vancouver_data)

user_item_matrix.head()

user_id,4,7,12,25,27,28,33,42,45,77,...,81092,81094,81098,81101,81102,81104,81111,81124,81127,81139
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3G Vegetarian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49th Parallel Coffee,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6 Degrees Eatery,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A La Mode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARC Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Perform Singular Value Decomposition (SVD)
U, sigma, Vt = perform_svd(user_item_matrix)

print("Matrix U:")
print(U)

print("\nMatrix Sigma:")
print(sigma)

print("\nMatrix Vt:")
print(Vt)

Matrix U:
[[-0.01212647 -0.00122714 -0.02618524 ... -0.02007029  0.00253048
  -0.00391356]
 [-0.08992659  0.03368459  0.00968976 ...  0.22347215 -0.03472124
  -0.05725705]
 [-0.01091681  0.00536591 -0.00603746 ...  0.01565318  0.02379923
   0.00528421]
 ...
 [-0.02076434  0.00420317  0.02269547 ... -0.04966689  0.03356186
  -0.01084917]
 [-0.03860753  0.00747438  0.0246726  ...  0.0271781   0.0117732
  -0.02539337]
 [-0.01009873  0.00273174  0.00236246 ... -0.00278884  0.00797555
  -0.02488204]]

Matrix Sigma:
[[283.90158521   0.           0.         ...   0.           0.
    0.        ]
 [  0.         130.0091853    0.         ...   0.           0.
    0.        ]
 [  0.           0.         120.34570797 ...   0.           0.
    0.        ]
 ...
 [  0.           0.           0.         ...  60.23048218   0.
    0.        ]
 [  0.           0.           0.         ...   0.          59.90551695
    0.        ]
 [  0.           0.           0.         ...   0.           0.
   59.4255506

In [21]:
# Calculate Similarity Scores
similarity_scores = calculate_similarity_scores(user_item_matrix)

similarity_scores.shape

(969, 969)

In [22]:
# Get recommendations for a specific restaurant
restaurant_name = "Miku"
recommended_restaurants = recommend_restaurants(restaurant_name, user_item_matrix, similarity_scores)
print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants)) 

Recommended Restaurants for Miku: ['Hokkaido Ramen Santouka', 'Phnom Penh', 'Kingyo', 'Chambar', 'Jam Cafe on Beatty']


### Collaborative-Filtering Recommendation System with FunkSVD <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

In [23]:
# Load your user-item interaction data into Surprise Dataset
reader = Reader(rating_scale=(0, 5))
vancouver_data = Dataset.load_from_df(vancouver_data[['user_id', 'restaurant_name', 'rating']], reader)

In [24]:
# Split the data into training and testing sets
trainset, testset = train_test_split(vancouver_data, test_size=0.2, random_state=42)

In [25]:
# Step 3: Build and train the FunkSVD-based collaborative filtering model
model = FunkSVD(n_factors=50, biased=True, random_state=42)
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14d316af0>

In [26]:
# Step 4: Make predictions on the test set
predictions = model.test(testset)

In [28]:
# Step 5: Evaluate the model's performance
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

RMSE: 0.9286
MAE:  0.7251


In [29]:
# Example Usage: Recommend restaurants similar to a specific restaurant
restaurant_name = "Miku" 

In [32]:
# Get the user-item matrix used for factorization
trainset_full = vancouver_data.build_full_trainset()
user_item_matrix = trainset_full.ur

In [33]:
# Find the index of the input restaurant name in the pivot table
restaurant_index = trainset_full.to_inner_iid(restaurant_name)

In [34]:
# Get the latent factors for the input restaurant
restaurant_factors = model.qi[restaurant_index]

In [35]:
# Calculate similarity scores with other restaurants based on latent factors
similarity_scores = np.dot(model.qi, restaurant_factors)

In [36]:
# Sort the restaurants based on similarity scores in descending order
similar_restaurant_indices = np.argsort(similarity_scores)[::-1]

In [37]:
# Get top N recommended restaurants (excluding the input restaurant itself)
top_n = 5
recommended_restaurants = []
for index in similar_restaurant_indices:
    name = trainset_full.to_raw_iid(index)
    if name != restaurant_name:
        recommended_restaurants.append(name)
        if len(recommended_restaurants) == top_n:
            break

print("Recommended Restaurants for {}: {}".format(restaurant_name, recommended_restaurants))

Recommended Restaurants for Miku: ['Breakfast Table', 'Rib City- Vancouver', 'Eastland Sushi & Asian Cuisine', 'I Heart Gyro', 'Sushi Yama']


In [None]:
# Creating the User-Item Matrix
def create_user_item_matrix(data):
    # Pivot the data to create a matrix where rows are user IDs and columns are item IDs
    user_item_matrix = data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)
    return user_item_matrix

# FunkSVD Algorithm for Matrix Factorization
def FunkSVD(user_item_matrix, num_latent_features=50, learning_rate=0.005, regularization=0.02, epochs=50):
    # Initialize user and item matrices with random values
    num_users, num_items = user_item_matrix.shape
    user_matrix = np.random.rand(num_users, num_latent_features)
    item_matrix = np.random.rand(num_latent_features, num_items)

    # Update matrices iteratively using stochastic gradient descent
    for epoch in range(epochs):
        for user_id in range(num_users):
            for item_id in range(num_items):
                if user_item_matrix[user_id, item_id] > 0:
                    error = user_item_matrix[user_id, item_id] - np.dot(user_matrix[user_id, :], item_matrix[:, item_id])
                    for k in range(num_latent_features):
                        user_matrix[user_id, k] += learning_rate * (2 * error * item_matrix[k, item_id] - regularization * user_matrix[user_id, k])
                        item_matrix[k, item_id] += learning_rate * (2 * error * user_matrix[user_id, k] - regularization * item_matrix[k, item_id])

    return user_matrix, item_matrix

# Reconstruct the Original User-Item Matrix from Factors
def reconstruct_user_item_matrix(user_matrix, item_matrix):
    # Reconstruct the user-item matrix from the learned factors
    user_item_matrix_reconstructed = np.dot(user_matrix, item_matrix)
    return user_item_matrix_reconstructed 

In [None]:
# Example Usage:
import numpy as np
import pandas as pd

# Assuming you have loaded the 'user_item_data' DataFrame with 'user_id', 'item_id', and 'rating' columns
user_item_matrix = create_user_item_matrix(user_item_data)
user_matrix, item_matrix = FunkSVD(user_item_matrix)

# Reconstruct the user-item matrix from the learned factors
user_item_matrix_reconstructed = reconstruct_user_item_matrix(user_matrix, item_matrix)

# Compare the original and reconstructed user-item matrices (optional)
print("Original User-Item Matrix:")
print(user_item_matrix)

print("\nReconstructed User-Item Matrix:")
print(user_item_matrix_reconstructed)


In [None]:
# Creating the User-Item Matrix
user_item_matrix = create_user_item_matrix(vancouver_data)

user_item_matrix.head()

user_id,4,7,12,25,27,28,33,42,45,77,...,81092,81094,81098,81101,81102,81104,81111,81124,81127,81139
restaurant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3G Vegetarian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49th Parallel Coffee,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6 Degrees Eatery,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A La Mode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARC Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Calculate Similarity Scores
similarity_scores = calculate_similarity_scores(user_item_matrix)

similarity_scores.shape

(969, 969)

In [None]:
# Set the reader with accurate rating scale
my_reader = Reader(rating_scale=(1, 10))

In [52]:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Assuming you have loaded the 'vancouver_data' DataFrame
# Create a reader object to specify the rating scale (from 0 to 5 in this case)
reader = Reader(rating_scale=(0, 5))

# Load the data into Surprise's Dataset format
data = Dataset.load_from_df(vancouver_data[['user_id', 'restaurant_name', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Initialize the SVD algorithm
# You can tune the number of latent factors (n_factors) here
svd = SVD(n_factors=10, random_state=42)

# Fit the model on the trainset
svd.fit(trainset)

# Make predictions on the testset
predictions = svd.test(testset)

# Evaluate the model using Root Mean Squared Error (RMSE)
rmse = accuracy.rmse(predictions)
print("Root Mean Squared Error (RMSE) of SVD: {:.4f}".format(rmse))


RMSE: 0.9251
Root Mean Squared Error (RMSE) of SVD: 0.9251
