## Modeling 

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/18/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Final Dataset](#final)
    * Data Dictionary
3. [Collaborative-Filtering Recommendation System without SVD](#nosvd)
4. [Collaborative-Filtering Recommendation System with SVD](#svd)
5. [Collaborative-Filtering Recommendation System with FunkSVD](#funksvd)

### Introduction <a class="anchor" id="intro"></a>

asdf

#### Importing Python Libraries 

Importing necessary libraries for the EDA process.

In [1]:
# Import the basic packages
import numpy as np 
import pandas as pd 

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

# Import the surprise packages
from surprise import SVD
from surprise.reader import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="final"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
model_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/model_data.pkl')

In [3]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1203530 entries, 40 to 5572793
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   user_id          1203530 non-null  int64  
 1   business_id      1203530 non-null  int64  
 2   rating           1203530 non-null  float64
 3   restaurant_name  1203530 non-null  object 
 4   categories       1203530 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 55.1+ MB


In [4]:
model_data.head()

Unnamed: 0,user_id,business_id,rating,restaurant_name,categories
40,53031,6620,4.0,Thaitation,[Thai]
41,53031,4147,2.0,Howling Wolf Taqueria,"[Bars, Arts & Entertainment, Nightlife, Music ..."
42,53031,12401,3.0,Santarpio's Pizza,"[Pizza, American (Traditional), Italian]"
43,53031,1357,2.0,The Gallows,"[Seafood, Bars, American (New), American (Trad..."
44,53031,3498,3.0,Antique Table,[Italian]


In [5]:
model_data.isnull().sum()

user_id            0
business_id        0
rating             0
restaurant_name    0
categories         0
dtype: int64

In [6]:
print(f"The size of our model dataset is {model_data.shape[0]} entries.")

The size of our model dataset is 1203530 entries.


In [7]:
sampled_model_data = model_data.sample(frac=0.01, random_state=42)

print(f"The size of our sampled model dataset is {sampled_model_data.shape[0]} entries.")

The size of our sampled model dataset is 12035 entries.


### Collaborative-Filtering Recommendation System without SVD <a class="anchor" id="nosvd"></a>

Collaborative filtering is a general technique used in recommendation systems to predict user preferences based on the preferences of similar users. It does not involve matrix factorization. Instead, it relies on computing similarities between users or items to generate recommendations. Collaborative filtering without SVD directly operates on the user-item interaction matrix and may use various similarity metrics to find similar users or items. 

In [8]:
# User-Item Interaction Matrix
user_item_matrix = sampled_model_data.pivot_table(index='user_id', columns='business_id', values='rating')
user_item_matrix.sample(5)

business_id,2,4,5,6,8,9,10,11,13,14,...,14301,14305,14306,14308,14310,14313,14315,14317,14318,14321
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
71152,,,,,,,,,,,...,,,,,,,,,,
15950,,,,,,,,,,,...,,,,,,,,,,
67142,,,,,,,,,,,...,,,,,,,,,,
77691,,,,,,,,,,,...,,,,,,,,,,
61003,,,,,,,,,,,...,,,,,,,,,,


In [9]:
print(f"The shape of the user item matrix is {user_item_matrix.shape[0]} by {user_item_matrix.shape[1]}.")

The shape of the user item matrix is 8981 by 6649.


In [10]:
# Normalize user-item matrix
user_item_matrix_norm = user_item_matrix.subtract(user_item_matrix.mean(axis=1), axis = 'rows')
user_item_matrix_norm.head()

business_id,2,4,5,6,8,9,10,11,13,14,...,14301,14305,14306,14308,14310,14313,14315,14317,14318,14321
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
18,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,
27,,,,,,,,,,,...,,,,,,,,,,


In [16]:
# User similarity matrix using Pearson correlation
user_similarity = user_item_matrix_norm.T.corr()
user_similarity.head()

user_id,4,10,18,25,27,28,36,49,57,60,...,81055,81063,81091,81092,81094,81101,81115,81130,81132,81133
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
18,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,
27,,,,,,,,,,,...,,,,,,,,,,


In [19]:
# Pick a user ID
picked_userid = 67142

# Remove picked user ID from the candidate list
user_similarity.drop(index=picked_userid, inplace=True)

# Take a look at the data
user_similarity.head()

user_id,4,10,18,25,27,28,36,49,57,60,...,81055,81063,81091,81092,81094,81101,81115,81130,81132,81133
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
18,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,
27,,,,,,,,,,,...,,,,,,,,,,


In [20]:
# Number of similar users
n = 10

# User similarity threashold
user_similarity_threshold = 0.3

# Get top n similar users
similar_users = user_similarity[user_similarity[picked_userid]>user_similarity_threshold][picked_userid].sort_values(ascending=False)[:n]
Q
# Print out top n similar users
print(f'The similar users for user {picked_userid} are', similar_users)

The similar users for user 67142 are Series([], Name: 67142, dtype: float64)


In [9]:
# Calculate the similarity matrix 
similarity_matrix = user_item_matrix.corr()

In [None]:
# Choose a user for whom we want to make recommendations (e.g., user_id = 5)
target_user_id = 53031

# Get the ratings of the target user
target_user_ratings = user_item_matrix.loc[target_user_id]

# Find similar users to the target user based on the similarity matrix
similar_users = similarity_matrix[target_user_id].drop(target_user_id).sort_values(ascending=False)

# Filter out restaurants the target user has already rated
unrated_restaurants = target_user_ratings[target_user_ratings == 0].index

# Make restaurant recommendations based on the ratings of similar users
recommendations = pd.Series()
for restaurant_id in unrated_restaurants:
    similar_users_ratings = user_item_matrix.loc[similar_users.index, restaurant_id]
    weighted_rating = sum(similar_users_ratings * similar_users) / sum(similar_users)
    recommendations[restaurant_id] = weighted_rating

# Sort the recommendations in descending order
recommendations = recommendations.sort_values(ascending=False)

print("Restaurant recommendations for User {}: \n{}".format(target_user_id, recommendations))


In [10]:
# Function to get top N recommendations for a user
def get_top_N_recommendations(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)

    similar_users = user_similarity[user_index]

    top_similar_users_indices = similar_users.argsort()[::-1][1:N+1]  # Exclude the user itself

    top_recommendations = user_item_matrix.iloc[top_similar_users_indices].mean(axis=0)
    top_recommendations = top_recommendations.sort_values(ascending=False)
    
    return top_recommendations.index.tolist()

In [11]:
# Example: Get top 5 recommendations for a user 
user_id = 53031
top_recommendations = get_top_N_recommendations(user_id, N=5)
print(top_recommendations)

[10817, 13179, 10569, 4256, 6333, 1675, 901, 6465, 2527, 9567, 9560, 9566, 9571, 9574, 9564, 9575, 9577, 2, 9558, 9580, 9547, 9546, 9545, 9540, 9539, 9536, 9534, 9533, 9529, 9526, 9523, 9549, 9589, 9583, 9584, 9626, 9625, 9624, 9619, 9618, 9617, 9614, 9613, 9611, 9608, 9607, 9604, 9600, 9597, 9596, 9595, 9593, 9592, 9591, 9521, 9587, 9522, 9516, 9520, 9432, 9456, 9455, 9452, 9451, 9450, 9444, 9442, 9441, 9437, 9433, 9427, 9458, 9426, 9425, 9422, 9414, 9411, 9409, 9405, 9403, 9402, 9400, 9457, 9459, 9519, 9493, 9628, 9515, 9512, 9508, 9506, 9504, 9503, 9500, 9495, 9494, 9483, 9461, 9480, 9477, 9473, 9470, 9469, 9468, 9466, 9465, 9463, 9462, 9627, 9643, 9634, 9780, 9806, 9804, 9799, 9798, 9796, 9789, 9788, 9786, 9783, 9782, 9779, 9752, 9774, 9772, 9771, 9769, 9766, 9762, 9760, 9759, 9758, 9754, 9807, 9808, 9810, 9811, 9863, 9862, 9861, 9860, 9857, 9854, 9844, 9842, 9840, 9837, 9836, 9834, 9829, 9828, 9827, 9825, 9824, 9822, 9818, 9815, 9813, 9753, 9750, 9635, 9659, 9686, 9680, 9679, 9675

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='ratingg')

In [None]:
# Fill missing values (NaNs) with zeros
user_item_matrix = user_item_matrix.fillna(0)

In [None]:
user_item_matrix.shape

(81142, 14323)

In [None]:
# Displaying the first few rows to get an initial glimpse of the data
user_item_matrix.head()

business_id,--164t1nclzzmca7eDiJMw,--Q3mAcX9t63f7Xcbn7LVA,--UNNdnHRhsyFUbDgumdtQ,-0A60UZl9nbdq2WWySJ_tQ,-0iqnv7MjKrgh7Q7bYRlUQ,-0sIQ96u8XevGUXZ--pvaA,-1ShItlulHnBsoOQWnblzw,-1h2qkElNfKjUPw6brMbIw,-1mmKpu7b_NlBit2pOOPnQ,-1sIJLX71taHD-BgbwY64Q,...,zvKfCAOBzVcxc1HLpoIY8A,zwKIQgthba1FUPWS7nOo0w,zwhSGiftT_yzKSEmMCol6Q,zwn53gHyn1NlX9h3jKFOUg,zyBC3BUkH9klhPhMyQmxAQ,zyHMtStYlKG67WRprp6GZQ,zyauuvAYdVweBK4L7wBRmw,zz4WGzntV59HqhefV5zigQ,zzin1d1oHi81GuI0ufo1VA,zzlkjDG9Rv8Jn-vSolMgyw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--0zxhZTSLZ7w1hUD2bEwA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--17Db1K-KujRuN7hY9Z0Q,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2vR0DIsmQ6WfcSzKWigw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--3WaS23LcIXtxyFULJHTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--3l8wysfp49Z2TLnyT0vg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
similarity_scores = cosine_similarity(user_item_matrix)
similarity_scores.shape

: 

: 

In [None]:
def recommend(business_id):
    # Find the index of the input restaurant name in the pivot table
    index = np.where(user_item_matrix.index == business_id)[0][0]

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 4 similar items
    similar_items = sorted(list(enumerate(similarity_scores[index])), key=lambda x: x[1], reverse=True)[1:5]

    # Initialize an empty list to store recommended restaurant names
    data = []

    # Iterate through each similar item
    for i in similar_items:
        # Fetch the relevant restaurant name from the 'business_data' dataset
        similar_business_id = user_item_matrix.index[i[0]]

        # Append the restaurant name to the 'data' list
        data.append(similar_business_id)

    # Return the 'data' list containing names of the recommended restaurants
    return data

### Collaborative-Filtering Recommendation System with SVD <a class="anchor" id="svd"></a>

Traditional Singular Value Decomposition is a matrix factorization technique that decomposes a given matrix into three matrices: U (user features), Σ (singular values), and V^T (item features). While traditional SVD can be applied to recommendation systems, it assumes a complete user-item interaction matrix without any missing values. This assumption is often not applicable in real-world scenarios where user-item matrices are typically sparse.

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='rating').fillna(0)
user_item_matrix.sample(5)

restaurant_name,Gruby's New York Deli,'Ohana,/pôr/ wine house,10 Barrel Brewing Portland,10 Degrees South,101 Beer Kitchen,101 By Teahaus,101 Steak,10th & Piedmont,110 Grill,...,laV,mmmpanadas,nati's southern seafood boil,sweetgreen,wagamama,wagamama - faneuil hall,wagamama - prudential,wagamama - seaport,zpizza,ñoños tacos
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
UZRYHUjRmNrPOTjmCa4_gg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wBT7zqYaMMfsuhHKB5XqgQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RkLluG0LGXiJgf2i9dGmDQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5AL4m5Nh1P91HuKxewdWPQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IBOnLGJ4jEti15dw-nasPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Traditional SVD
svd = TruncatedSVD(n_components=50) 
user_features = svd.fit_transform(user_item_matrix)

In [None]:
# Function to get top N recommendations for a user using Traditional SVD
def get_top_N_recommendations_svd(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)
    user_feature = user_features[user_index]

    predicted_ratings = pd.Series(user_features.dot(svd.components_)[user_index])

    top_recommendations = predicted_ratings.sort_values(ascending=False)

    return top_recommendations.index.tolist()[:N]

In [None]:
# Example: Get top 5 recommendations for a user with user_id = 123 using Traditional SVD
user_id = 'UZRYHUjRmNrPOTjmCa4_gg'
top_recommendations_svd = get_top_N_recommendations_svd(user_id, N=5)
print(top_recommendations_svd)

[8335, 10217, 12180, 11548, 3779]


### Collaborative-Filtering Recommendation System with FunkSVD <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

In [12]:
# User-Item Interaction Matrix
user_item_matrix = sampled_model_data.pivot_table(index='user_id', columns='business_id', values='rating')
user_item_matrix.sample(5)

business_id,2,4,5,6,8,9,10,11,13,14,...,14301,14305,14306,14308,14310,14313,14315,14317,14318,14321
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53394,,,,,,,,,,,...,,,,,,,,,,
57999,,,,,,,,,,,...,,,,,,,,,,
47608,,,,,,,,,,,...,,,,,,,,,,
18551,,,,,,,,,,,...,,,,,,,,,,
19621,,,,,,,,,,,...,,,,,,,,,,


In [13]:
# FunkSVD
def FunkSVD(matrix, latent_features=50, learning_rate=0.0002, epochs=100):
    user_matrix = np.random.rand(matrix.shape[0], latent_features)
    item_matrix = np.random.rand(matrix.shape[1], latent_features)
    
    for _ in range(epochs):
        for i in range(matrix.shape[0]):
            for j in range(matrix.shape[1]):
                if matrix[i, j] > 0:
                    error = matrix[i, j] - np.dot(user_matrix[i, :], item_matrix[j, :].T)
                    for k in range(latent_features):
                        user_matrix[i, k] += learning_rate * (2 * error * item_matrix[j, k])
                        item_matrix[j, k] += learning_rate * (2 * error * user_matrix[i, k])
    
    return user_matrix, item_matrix

In [14]:
# Function to get top N recommendations for a user using FunkSVD
def get_top_N_recommendations_funksvd(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)
    user_feature = user_matrix[user_index]

    predicted_ratings = pd.Series(user_matrix.dot(item_matrix.T)[user_index])

    top_recommendations = predicted_ratings.sort_values(ascending=False)
    
    return top_recommendations.index.tolist()[:N]

In [15]:
# Example: Get top 5 recommendations for a user 
user_id = 53031
top_recommendations_funksvd = get_top_N_recommendations_funksvd(user_id, N=5)
print(top_recommendations_funksvd)

NameError: name 'user_matrix' is not defined

In [16]:
# User-Item Interaction Matrix
user_item_matrix = sampled_model_data.pivot_table(index='user_id', columns='business_id', values='rating')

In [17]:
user_item_matrix.shape

(8981, 6649)

In [18]:
# Displaying the first few rows to get an initial glimpse of the data
user_item_matrix.head()

business_id,2,4,5,6,8,9,10,11,13,14,...,14301,14305,14306,14308,14310,14313,14315,14317,14318,14321
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
18,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,
27,,,,,,,,,,,...,,,,,,,,,,


In [19]:
user_item_matrix.columns

Index([    2,     4,     5,     6,     8,     9,    10,    11,    13,    14,
       ...
       14301, 14305, 14306, 14308, 14310, 14313, 14315, 14317, 14318, 14321],
      dtype='int64', name='business_id', length=6649)

In [21]:
# Set the reader with accurate rating scale
my_reader = Reader(rating_scale=(1, 5))

# Create the dataset using the reader object and the rating DataFrame
my_dataset = Dataset.load_from_df(model_data[['user_id', 'business_id', 'rating']], my_reader)

In [22]:
my_dataset

<surprise.dataset.DatasetAutoFolds at 0x17ac46430>

In [23]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] } #The parameter indicates to the algorithm that all latent information must be stored. 

# Set GridSearchCV with 3 cross-validation
GS = GridSearchCV(SVD, param_grid, measures=['fcp'], cv=3)

# Fit the model with the grid search on the training set
GS.fit(my_dataset)

# Get the best hyperparameters
best_params = GS.best_params['fcp']
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'n_factors': 150, 'n_epochs': 10, 'lr_all': 0.005, 'biased': False}


In [24]:
# Split train-test set 
trainset, testset = train_test_split(my_dataset, test_size=0.25)

In [25]:
# Set the algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=10, 
                 lr_all=0.005, 
                 biased=False,
                 verbose=0)
# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

TypeError: FunkSVD() got an unexpected keyword argument 'n_factors'

In [None]:
# Access the P and Q matrices from the fitted model
P = my_svd.pu  # User matrix (P)
P
Q = my_svd.qi  # Item matrix (Q)
Q

array([[ 0.06194114, -0.14112599, -0.31683039, ..., -0.16282319,
         0.17948535, -0.2134076 ],
       [-0.34110531,  0.29034201, -0.00711049, ...,  0.06596525,
         0.34702696, -0.17407634],
       [-0.13488896,  0.43430532, -0.50093401, ..., -0.14135051,
         0.17353653,  0.30471074],
       ...,
       [-0.02159553,  0.13798291,  0.01950472, ..., -0.07777211,
        -0.0902592 , -0.04440696],
       [-0.08594597,  0.1347763 , -0.08041756, ...,  0.09145995,
         0.18483861, -0.16068039],
       [-0.19825591, -0.11820167, -0.01589753, ...,  0.19685929,
         0.00525314, -0.26347986]])

In [None]:
# Put my_pred result in a dataframe
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                                'business_id',
                                                'actual',
                                                'prediction',
                                                'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['prediction'] - 
                            df_prediction['actual'])

In [None]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,user_id,business_id,actual,prediction,details,diff
0,hYlCMQ278BvKv9IP9v_m4w,Dinesty Dumpling House,1.0,3.298706,{'was_impossible': False},2.298706
1,XUQjZyApQXImNifP-2tAFQ,The Original Hoffbrau,5.0,3.201321,{'was_impossible': False},1.798679
2,Pf7FI0OukC_CEcCz0ZxoUw,KOi Fusion,5.0,4.448637,{'was_impossible': False},0.551363
3,g37Y_WmgPcJI9bf_kPV2Og,First Printer,4.0,2.085826,{'was_impossible': False},1.914174
4,ZveYZ3n1IOjP9H4HfFn3Yg,Fabian's,5.0,3.457422,{'was_impossible': False},1.542578


In [None]:
# See the best 10 predictions
df_prediction.sort_values(by='diff')[:10]

Unnamed: 0,user_id,business_id,actual,prediction,details,diff
242072,UZ8_xqhiguIYb9Lu2Wu8og,Museum Of Fine Arts,5.0,5.0,{'was_impossible': False},0.0
49595,9EB_WZ5Lw991mrnfkzkqvQ,Sushi Zanmai,5.0,5.0,{'was_impossible': False},0.0
102938,oSN3M4_WKdlTsnpgqPDiBg,Powell's City of Books,5.0,5.0,{'was_impossible': False},0.0
240896,lGxssT2UmyNZQZWwPDgX3A,Bar Mezzana,5.0,5.0,{'was_impossible': False},0.0
102990,0d89GUvxpJG4oFeL9rtUxQ,Tako Cheena,5.0,5.0,{'was_impossible': False},0.0
240911,nxI8n6lARJpMP5SI8U9S6w,Le Pigeon,5.0,5.0,{'was_impossible': False},0.0
6394,g3UbQdtWX1Luh9_FGIeCAw,Schmidt's Sausage Haus,5.0,5.0,{'was_impossible': False},0.0
102997,Je-c4Qu5od0DwPmYeHYOVg,Screen Door,5.0,5.0,{'was_impossible': False},0.0
280476,krWkC-U2U_YAtYdAvuRwAQ,Santarpio's Pizza,5.0,5.0,{'was_impossible': False},0.0
49526,7mL5GK8Qt3iIkNHfPsGnkg,Ball Square Cafe,5.0,5.0,{'was_impossible': False},0.0


In [None]:
(df_prediction["diff"] <= 1).mean()

0.6014563800547057

In [None]:
# Calculate RMSE
rmse = accuracy.rmse(my_pred)

# Calculate MAE
mae = accuracy.mae(my_pred)

RMSE: 1.3122
MAE:  1.0054


In [None]:
def recommend(business_id, user_item_matrix, P, Q, top_n=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(business_id)

    # Predict the ratings for the input restaurant using the FunkSVD model
    predicted_ratings = np.dot(P, Q.T)
    restaurant_ratings = predicted_ratings[index, :]

    # Get the indices of top recommended restaurants based on predicted ratings
    top_indices = np.argsort(restaurant_ratings)[::-1][:top_n]

    # Convert the indices to restaurant names
    recommended_restaurants = user_item_matrix.columns[top_indices]

    return recommended_restaurants

In [None]:
recommend('Miku', user_item_matrix, P, Q, top_n=5)

KeyError: 'Miku'

### Item-Item Collaborative-Filtering Recommendation System 


In [None]:
def item_similarity_matrix(df):
    pivot_df = df.pivot(index='user_id', columns='business_id', values='rating').fillna(0)
    item_sim = np.corrcoef(pivot_df.T)
    return item_sim

item_sim_matrix = item_similarity_matrix(model_data)

ValueError: Index contains duplicate entries, cannot reshape

In [None]:
def item_item_collaborative_filtering(user_id, item_sim_matrix, user_item_matrix, top_n=5):
    user_items = user_item_matrix.loc[user_id]
    non_rated_items = user_items[user_items.isnull()].index

    scores = item_sim_matrix[:, non_rated_items].T.dot(user_items)
    scores /= np.array(np.abs(item_sim_matrix[:, non_rated_items]).sum(axis=0)).reshape(-1, 1)

    top_items_idx = np.argsort(scores)[::-1][:top_n]
    top_items = non_rated_items[top_items_idx]
    return top_items

# Example usage: Recommend top 5 restaurants for user_id=1
user_id = 1
top_restaurants = item_item_collaborative_filtering(user_id, item_sim_matrix, df.pivot(index='user_id', columns='restaurant_id', values='rating'))
print(top_restaurants)

In [None]:
model_data['user_id']

40         djp57omz9cccV1wI0_sqqA
41         djp57omz9cccV1wI0_sqqA
42         djp57omz9cccV1wI0_sqqA
43         djp57omz9cccV1wI0_sqqA
44         djp57omz9cccV1wI0_sqqA
                    ...          
5572066    Mc4C7fVY0sEcD-U5eOA2Og
5572085    huXqrSaGyNO1aZKiM55EUg
5572508    KEF5A094wOUdBG7SsS7qKg
5572754    zt9FNJMJNVt65Dl1GMuJqA
5572793    jrfAvTdjH0ykHEtJsqTRRA
Name: user_id, Length: 1203530, dtype: object