## Modeling 

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/18/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Final Dataset](#final)
    * Data Dictionary
3. [Collaborative-Filtering Recommendation System without SVD](#nosvd)
4. [Collaborative-Filtering Recommendation System with SVD](#svd)
5. [Collaborative-Filtering Recommendation System with FunkSVD](#funksvd)

### Introduction <a class="anchor" id="intro"></a>

asdf

#### Importing Python Libraries 

Importing necessary libraries for the EDA process.

In [1]:
# Import the basic packages
import numpy as np 
import pandas as pd 

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

# Import the surprise packages
from surprise import SVD
from surprise.reader import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

### Model Dataset <a class="anchor" id="final"></a>

**Data Dictionary:**
* `user_id`: unique user id
* `business_id`: unique user id
* `rating`: star rating

In [2]:
model_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/model_data.pkl')

In [3]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1203530 entries, 40 to 5572793
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   user_id      1203530 non-null  int64  
 1   business_id  1203530 non-null  int64  
 2   rating       1203530 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 36.7 MB


In [4]:
model_data.head()

Unnamed: 0,user_id,business_id,rating
40,53031,6620,4.0
41,53031,4147,2.0
42,53031,12401,3.0
43,53031,1357,2.0
44,53031,3498,3.0


In [6]:
model_data.isnull().sum()

user_id        0
business_id    0
rating         0
dtype: int64

### Collaborative-Filtering Recommendation System without SVD <a class="anchor" id="nosvd"></a>

Collaborative filtering is a general technique used in recommendation systems to predict user preferences based on the preferences of similar users. It does not involve matrix factorization. Instead, it relies on computing similarities between users or items to generate recommendations. Collaborative filtering without SVD directly operates on the user-item interaction matrix and may use various similarity metrics to find similar users or items. 

In [7]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='rating').fillna(0)
user_item_matrix.sample(5)

In [None]:
# Similarity Calculation (Cosine Similarity)
user_similarity = cosine_similarity(user_item_matrix)

In [None]:
# Function to get top N recommendations for a user
def get_top_N_recommendations(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)

    similar_users = user_similarity[user_index]

    top_similar_users_indices = similar_users.argsort()[::-1][1:N+1]  # Exclude the user itself

    top_recommendations = user_item_matrix.iloc[top_similar_users_indices].mean(axis=0)
    top_recommendations = top_recommendations.sort_values(ascending=False)
    
    return top_recommendations.index.tolist()

In [None]:
# Example: Get top 5 recommendations for a user with user_id = 123
user_id = 123
top_recommendations = get_top_N_recommendations(user_id, N=5)
print(top_recommendations)

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='ratingg')

In [None]:
# Fill missing values (NaNs) with zeros
user_item_matrix = user_item_matrix.fillna(0)

In [None]:
user_item_matrix.shape

(81142, 14323)

In [None]:
# Displaying the first few rows to get an initial glimpse of the data
user_item_matrix.head()

business_id,--164t1nclzzmca7eDiJMw,--Q3mAcX9t63f7Xcbn7LVA,--UNNdnHRhsyFUbDgumdtQ,-0A60UZl9nbdq2WWySJ_tQ,-0iqnv7MjKrgh7Q7bYRlUQ,-0sIQ96u8XevGUXZ--pvaA,-1ShItlulHnBsoOQWnblzw,-1h2qkElNfKjUPw6brMbIw,-1mmKpu7b_NlBit2pOOPnQ,-1sIJLX71taHD-BgbwY64Q,...,zvKfCAOBzVcxc1HLpoIY8A,zwKIQgthba1FUPWS7nOo0w,zwhSGiftT_yzKSEmMCol6Q,zwn53gHyn1NlX9h3jKFOUg,zyBC3BUkH9klhPhMyQmxAQ,zyHMtStYlKG67WRprp6GZQ,zyauuvAYdVweBK4L7wBRmw,zz4WGzntV59HqhefV5zigQ,zzin1d1oHi81GuI0ufo1VA,zzlkjDG9Rv8Jn-vSolMgyw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--0zxhZTSLZ7w1hUD2bEwA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--17Db1K-KujRuN7hY9Z0Q,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2vR0DIsmQ6WfcSzKWigw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--3WaS23LcIXtxyFULJHTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--3l8wysfp49Z2TLnyT0vg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
similarity_scores = cosine_similarity(user_item_matrix)
similarity_scores.shape

: 

: 

In [None]:
def recommend(business_id):
    # Find the index of the input restaurant name in the pivot table
    index = np.where(user_item_matrix.index == business_id)[0][0]

    # Retrieve the similarity scores of the input restaurant with other restaurants,
    # sort them in descending order, and select the top 4 similar items
    similar_items = sorted(list(enumerate(similarity_scores[index])), key=lambda x: x[1], reverse=True)[1:5]

    # Initialize an empty list to store recommended restaurant names
    data = []

    # Iterate through each similar item
    for i in similar_items:
        # Fetch the relevant restaurant name from the 'business_data' dataset
        similar_business_id = user_item_matrix.index[i[0]]

        # Append the restaurant name to the 'data' list
        data.append(similar_business_id)

    # Return the 'data' list containing names of the recommended restaurants
    return data

### Collaborative-Filtering Recommendation System with SVD <a class="anchor" id="svd"></a>

Traditional Singular Value Decomposition is a matrix factorization technique that decomposes a given matrix into three matrices: U (user features), Σ (singular values), and V^T (item features). While traditional SVD can be applied to recommendation systems, it assumes a complete user-item interaction matrix without any missing values. This assumption is often not applicable in real-world scenarios where user-item matrices are typically sparse.

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='ratingg').fillna(0)
user_item_matrix.sample(5)

restaurant_name,Gruby's New York Deli,'Ohana,/pôr/ wine house,10 Barrel Brewing Portland,10 Degrees South,101 Beer Kitchen,101 By Teahaus,101 Steak,10th & Piedmont,110 Grill,...,laV,mmmpanadas,nati's southern seafood boil,sweetgreen,wagamama,wagamama - faneuil hall,wagamama - prudential,wagamama - seaport,zpizza,ñoños tacos
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
UZRYHUjRmNrPOTjmCa4_gg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wBT7zqYaMMfsuhHKB5XqgQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RkLluG0LGXiJgf2i9dGmDQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5AL4m5Nh1P91HuKxewdWPQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IBOnLGJ4jEti15dw-nasPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Traditional SVD
svd = TruncatedSVD(n_components=50) 
user_features = svd.fit_transform(user_item_matrix)

In [None]:
# Function to get top N recommendations for a user using Traditional SVD
def get_top_N_recommendations_svd(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)
    user_feature = user_features[user_index]

    predicted_ratings = pd.Series(user_features.dot(svd.components_)[user_index])

    top_recommendations = predicted_ratings.sort_values(ascending=False)

    return top_recommendations.index.tolist()[:N]

In [None]:
# Example: Get top 5 recommendations for a user with user_id = 123 using Traditional SVD
user_id = 'UZRYHUjRmNrPOTjmCa4_gg'
top_recommendations_svd = get_top_N_recommendations_svd(user_id, N=5)
print(top_recommendations_svd)

[8335, 10217, 12180, 11548, 3779]


### Collaborative-Filtering Recommendation System with FunkSVD <a class="anchor" id="funksvd"></a>

FunkSVD is a specific variant of SVD designed for collaborative filtering tasks in recommendation systems. It addresses the sparsity issue present in user-item interaction matrices by incorporating stochastic gradient descent to handle missing values efficiently. FunkSVD performs matrix factorization and decomposes the user-item interaction matrix into user and item latent feature matrices.

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='ratingg')
user_item_matrix.sample(5)

In [None]:
# FunkSVD
def FunkSVD(matrix, latent_features=50, learning_rate=0.0002, epochs=100):
    user_matrix = np.random.rand(matrix.shape[0], latent_features)
    item_matrix = np.random.rand(matrix.shape[1], latent_features)
    
    for _ in range(epochs):
        for i in range(matrix.shape[0]):
            for j in range(matrix.shape[1]):
                if matrix[i, j] > 0:
                    error = matrix[i, j] - np.dot(user_matrix[i, :], item_matrix[j, :].T)
                    for k in range(latent_features):
                        user_matrix[i, k] += learning_rate * (2 * error * item_matrix[j, k])
                        item_matrix[j, k] += learning_rate * (2 * error * user_matrix[i, k])
    
    return user_matrix, item_matrix

In [None]:
# Function to get top N recommendations for a user using FunkSVD
def get_top_N_recommendations_funksvd(user_id, N=5):
    user_index = user_item_matrix.index.get_loc(user_id)
    user_feature = user_matrix[user_index]
    predicted_ratings = pd.Series(user_matrix.dot(item_matrix.T)[user_index])
    top_recommendations = predicted_ratings.sort_values(ascending=False)
    return top_recommendations.index.tolist()[:N]

In [None]:
# Example: Get top 5 recommendations for a user with user_id = 123 using FunkSVD
user_id = 123
top_recommendations_funksvd = get_top_N_recommendations_funksvd(user_id, N=5)
print(top_recommendations_funksvd)

In [None]:
# User-Item Interaction Matrix
user_item_matrix = model_data.pivot_table(index='user_id', columns='business_id', values='ratingg')

In [None]:
user_item_matrix.shape

(81142, 12192)

In [None]:
# Displaying the first few rows to get an initial glimpse of the data
user_item_matrix.head()

restaurant_name,Gruby's New York Deli,'Ohana,/pôr/ wine house,10 Barrel Brewing Portland,10 Degrees South,101 Beer Kitchen,101 By Teahaus,101 Steak,10th & Piedmont,110 Grill,...,laV,mmmpanadas,nati's southern seafood boil,sweetgreen,wagamama,wagamama - faneuil hall,wagamama - prudential,wagamama - seaport,zpizza,ñoños tacos
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--0zxhZTSLZ7w1hUD2bEwA,,,,,,,,,,,...,,,,,,,,,,
--17Db1K-KujRuN7hY9Z0Q,,,,,,,,,,,...,,,,,,,,,,
--2vR0DIsmQ6WfcSzKWigw,,,,,,,,,,,...,,,,,,,,,,
--3WaS23LcIXtxyFULJHTA,,,,,,,,,,,...,,,,,,,,,,
--3l8wysfp49Z2TLnyT0vg,,,,,,,,,,,...,,,,,,,,,,


In [None]:
user_item_matrix.columns

Index([' Gruby's New York Deli', ''Ohana', '/pôr/ wine house',
       '10 Barrel Brewing Portland', '10 Degrees South', '101 Beer Kitchen',
       '101 By Teahaus', '101 Steak', '10th & Piedmont', '110 Grill',
       ...
       'laV', 'mmmpanadas', 'nati's southern seafood boil', 'sweetgreen',
       'wagamama', 'wagamama - faneuil hall', 'wagamama - prudential',
       'wagamama - seaport', 'zpizza', 'ñoños tacos'],
      dtype='object', name='restaurant_name', length=12192)

In [None]:
# Set the reader with accurate rating scale
my_reader = Reader(rating_scale=(1, 5))

# Create the dataset using the reader object and the rating DataFrame
my_dataset = Dataset.load_from_df(model_data[['user_id', 'business_id', 'ratingg']], my_reader)

In [None]:
my_dataset

<surprise.dataset.DatasetAutoFolds at 0x3d8ed9bb0>

In [None]:
# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] } #The parameter indicates to the algorithm that all latent information must be stored. 

# Set GridSearchCV with 3 cross-validation
GS = GridSearchCV(SVD, param_grid, measures=['fcp'], cv=3)

# Fit the model with the grid search on the training set
GS.fit(my_dataset)

# Get the best hyperparameters
best_params = GS.best_params['fcp']
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'n_factors': 100, 'n_epochs': 10, 'lr_all': 0.005, 'biased': False}


In [None]:
# Split train-test set 
trainset, testset = train_test_split(my_dataset, test_size=0.25)

In [None]:
# Set the algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=10, 
                 lr_all=0.005, 
                 biased=False,
                 verbose=0)
# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [None]:
# Access the P and Q matrices from the fitted model
P = my_svd.pu  # User matrix (P)
P
Q = my_svd.qi  # Item matrix (Q)
Q

array([[ 0.06194114, -0.14112599, -0.31683039, ..., -0.16282319,
         0.17948535, -0.2134076 ],
       [-0.34110531,  0.29034201, -0.00711049, ...,  0.06596525,
         0.34702696, -0.17407634],
       [-0.13488896,  0.43430532, -0.50093401, ..., -0.14135051,
         0.17353653,  0.30471074],
       ...,
       [-0.02159553,  0.13798291,  0.01950472, ..., -0.07777211,
        -0.0902592 , -0.04440696],
       [-0.08594597,  0.1347763 , -0.08041756, ...,  0.09145995,
         0.18483861, -0.16068039],
       [-0.19825591, -0.11820167, -0.01589753, ...,  0.19685929,
         0.00525314, -0.26347986]])

In [None]:
# Put my_pred result in a dataframe
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                                'business_id',
                                                'actual',
                                                'prediction',
                                                'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['prediction'] - 
                            df_prediction['actual'])

In [None]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,user_id,business_id,actual,prediction,details,diff
0,hYlCMQ278BvKv9IP9v_m4w,Dinesty Dumpling House,1.0,3.298706,{'was_impossible': False},2.298706
1,XUQjZyApQXImNifP-2tAFQ,The Original Hoffbrau,5.0,3.201321,{'was_impossible': False},1.798679
2,Pf7FI0OukC_CEcCz0ZxoUw,KOi Fusion,5.0,4.448637,{'was_impossible': False},0.551363
3,g37Y_WmgPcJI9bf_kPV2Og,First Printer,4.0,2.085826,{'was_impossible': False},1.914174
4,ZveYZ3n1IOjP9H4HfFn3Yg,Fabian's,5.0,3.457422,{'was_impossible': False},1.542578


In [None]:
# See the best 10 predictions
df_prediction.sort_values(by='diff')[:10]

Unnamed: 0,user_id,business_id,actual,prediction,details,diff
242072,UZ8_xqhiguIYb9Lu2Wu8og,Museum Of Fine Arts,5.0,5.0,{'was_impossible': False},0.0
49595,9EB_WZ5Lw991mrnfkzkqvQ,Sushi Zanmai,5.0,5.0,{'was_impossible': False},0.0
102938,oSN3M4_WKdlTsnpgqPDiBg,Powell's City of Books,5.0,5.0,{'was_impossible': False},0.0
240896,lGxssT2UmyNZQZWwPDgX3A,Bar Mezzana,5.0,5.0,{'was_impossible': False},0.0
102990,0d89GUvxpJG4oFeL9rtUxQ,Tako Cheena,5.0,5.0,{'was_impossible': False},0.0
240911,nxI8n6lARJpMP5SI8U9S6w,Le Pigeon,5.0,5.0,{'was_impossible': False},0.0
6394,g3UbQdtWX1Luh9_FGIeCAw,Schmidt's Sausage Haus,5.0,5.0,{'was_impossible': False},0.0
102997,Je-c4Qu5od0DwPmYeHYOVg,Screen Door,5.0,5.0,{'was_impossible': False},0.0
280476,krWkC-U2U_YAtYdAvuRwAQ,Santarpio's Pizza,5.0,5.0,{'was_impossible': False},0.0
49526,7mL5GK8Qt3iIkNHfPsGnkg,Ball Square Cafe,5.0,5.0,{'was_impossible': False},0.0


In [None]:
(df_prediction["diff"] <= 1).mean()

0.6014563800547057

In [None]:
# Calculate RMSE
rmse = accuracy.rmse(my_pred)

# Calculate MAE
mae = accuracy.mae(my_pred)

RMSE: 1.3122
MAE:  1.0054


In [None]:
def recommend(business_id, user_item_matrix, P, Q, top_n=5):
    # Find the index of the input restaurant name in the pivot table
    index = user_item_matrix.index.get_loc(business_id)

    # Predict the ratings for the input restaurant using the FunkSVD model
    predicted_ratings = np.dot(P, Q.T)
    restaurant_ratings = predicted_ratings[index, :]

    # Get the indices of top recommended restaurants based on predicted ratings
    top_indices = np.argsort(restaurant_ratings)[::-1][:top_n]

    # Convert the indices to restaurant names
    recommended_restaurants = user_item_matrix.columns[top_indices]

    return recommended_restaurants

In [None]:
recommend('Miku', user_item_matrix, P, Q, top_n=5)

KeyError: 'Miku'

### Item-Item Collaborative-Filtering Recommendation System 


In [None]:
def item_similarity_matrix(df):
    pivot_df = df.pivot(index='user_id', columns='business_id', values='rating').fillna(0)
    item_sim = np.corrcoef(pivot_df.T)
    return item_sim

item_sim_matrix = item_similarity_matrix(model_data)

ValueError: Index contains duplicate entries, cannot reshape

In [None]:
def item_item_collaborative_filtering(user_id, item_sim_matrix, user_item_matrix, top_n=5):
    user_items = user_item_matrix.loc[user_id]
    non_rated_items = user_items[user_items.isnull()].index

    scores = item_sim_matrix[:, non_rated_items].T.dot(user_items)
    scores /= np.array(np.abs(item_sim_matrix[:, non_rated_items]).sum(axis=0)).reshape(-1, 1)

    top_items_idx = np.argsort(scores)[::-1][:top_n]
    top_items = non_rated_items[top_items_idx]
    return top_items

# Example usage: Recommend top 5 restaurants for user_id=1
user_id = 1
top_restaurants = item_item_collaborative_filtering(user_id, item_sim_matrix, df.pivot(index='user_id', columns='restaurant_id', values='rating'))
print(top_restaurants)

In [None]:
model_data['user_id']

40         djp57omz9cccV1wI0_sqqA
41         djp57omz9cccV1wI0_sqqA
42         djp57omz9cccV1wI0_sqqA
43         djp57omz9cccV1wI0_sqqA
44         djp57omz9cccV1wI0_sqqA
                    ...          
5572066    Mc4C7fVY0sEcD-U5eOA2Og
5572085    huXqrSaGyNO1aZKiM55EUg
5572508    KEF5A094wOUdBG7SsS7qKg
5572754    zt9FNJMJNVt65Dl1GMuJqA
5572793    jrfAvTdjH0ykHEtJsqTRRA
Name: user_id, Length: 1203530, dtype: object