## Content Based Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)

### Introduction <a class="anchor" id="intro"></a>

In this notebook, where we'll be building a content-based recommender system for restaurants. Our goal is to create a system that suggests restaurants to users based on their preferences and past interactions with restaurants. By leveraging the content and features of the restaurants, we can provide personalized recommendations that align with each user's taste. 

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import NLP (Natural Language Processing) packages
import string
import nltk

# Import TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import train_test_split function from scikit-learn to split the data into training and test sets
from sklearn.model_selection import train_test_split

# Import cosine_similarity function from scikit-learn to calculate similarity scores
from sklearn.metrics.pairwise import cosine_similarity

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Business Dataset <a class="anchor" id="model"></a>

We will begin by importing our Business Dataset and giving it a good inspection to see how it looks. This dataset will serve as a foundational element for our project, so it's crucial to understand its structure and contents before diving into further steps.

**Data Dictionary:**
| Column                   | Description                                           |
|--------------------------|-------------------------------------------------------|
| review_id                | Unique identifier for each review.                   |
| user_id                  | Unique identifier for each user.                     |
| business_id              | Unique identifier for each business (restaurant).    |
| stars                    | Rating given by the user for the restaurant (1 to 5 stars). |
| text                     | The review text provided by the user for the restaurant. |
| restaurant_name          | The name of the restaurant.                          |
| address                  | The address of the restaurant.                       |
| city                     | The city where the restaurant is located.            |
| state                    | The state where the restaurant is located.           |
| postal_code              | The postal code of the restaurant's location.        |
| latitude                 | The latitude coordinate of the restaurant's location.|
| longitude                | The longitude coordinate of the restaurant's location.|
| restaurant_rating        | The overall rating of the restaurant.                |
| restaurant_review_count  | The total number of reviews for the restaurant.      |
| is_open                  | Indicator whether the restaurant is open or closed (1 for open, 0 for closed). |
| categories               | The categories or types of the restaurant (comma-separated strings). |
| name                     | The name of the user who provided the review.        |
| user_review_count        | The total number of reviews provided by the user.    |
| average_stars            | The average rating given by the user to various restaurants. |


In [2]:
# Read data from a pickle file into a Pandas DataFrame
final_reviews = pd.read_pickle('T:/GitHub/Brainstation_Capstone/Data/final_reviews.pkl')

In [3]:
# Display concise information about the 'final_reviews' DataFrame
final_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5574784 entries, 0 to 5574783
Data columns (total 19 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   review_id                object 
 1   user_id                  object 
 2   business_id              object 
 3   stars                    float64
 4   text                     object 
 5   restaurant_name          object 
 6   address                  object 
 7   city                     object 
 8   state                    object 
 9   postal_code              object 
 10  latitude                 float64
 11  longitude                float64
 12  restaurant_rating        float64
 13  restaurant_review_count  int64  
 14  is_open                  int64  
 15  categories               object 
 16  name                     object 
 17  user_review_count        int64  
 18  average_stars            float64
dtypes: float64(5), int64(3), object(11)
memory usage: 850.6+ MB


In [4]:
# Display the first few rows of the 'final_reviews' DataFrame
final_reviews.head()

Unnamed: 0,review_id,user_id,business_id,stars,text,restaurant_name,address,city,state,postal_code,latitude,longitude,restaurant_rating,restaurant_review_count,is_open,categories,name,user_review_count,average_stars
0,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4.0,Apparently Prides Osteria had a rough summer a...,Prides Osteria,240 Rantoul St,Beverly,MA,1915,42.549609,-70.884046,3.5,83,0,"[Wine Bars, Nightlife, Farmers Market, Bars, I...",Mel,63,4.3
1,fLlML7BjkR4_fJnND_hEJw,ak0TdVmGKo4pwqdJSTLwWw,bNZ3-0rse12NKdSVqQ30xw,4.0,"Came with friends, split the funghi pizza and ...",Sulmona,608 Main St,Cambridge,MA,2139,42.362867,-71.093846,4.0,220,1,"[Pizza, Italian, Nightlife, Bars]",Mel,63,4.3
2,pRtbswupEVIG1Ykj9xkL7Q,ak0TdVmGKo4pwqdJSTLwWw,BVsIaKL-8QXVjt0Z9WoFWw,4.0,Went for late lunch had the combination seafoo...,Village Roast Beef & Seafood,10 Bessom St,Marblehead,MA,1945,42.500243,-70.859237,4.5,53,1,"[Seafood, American (Traditional)]",Mel,63,4.3
3,fUYl6bnZy4bSGnbPAizXug,ak0TdVmGKo4pwqdJSTLwWw,4MClvr12OXBNvGu8h1yGpA,5.0,"We were super excited to try Sarma, having bee...",Sarma,249 Pearl St,Somerville,MA,2145,42.38818,-71.095545,4.5,883,1,"[Turkish, Middle Eastern, Moroccan, Tapas/Smal...",Mel,63,4.3
4,jHh2LIXNsnJCMUiyI9pt5w,ak0TdVmGKo4pwqdJSTLwWw,2vH58mhkEl8GdcDug1OwWg,5.0,So glad we made the trip to Woburn for Gene's ...,Gene's Chinese Flatbread Cafe,466 Main St,Woburn,MA,1801,42.481598,-71.150877,4.0,233,1,"[Cafes, Noodles, Chinese]",Mel,63,4.3


In [5]:
# Select only the specified columns from the filtered data
final_reviews = final_reviews[['business_id', 'restaurant_name', 'city', 'state', 'restaurant_rating', 'categories']]

In [6]:
# Count the number of missing values in each column of the 'final_reviews' DataFrame
final_reviews.isnull().sum()

business_id          0
restaurant_name      0
city                 0
state                0
restaurant_rating    0
categories           0
dtype: int64

In [7]:
# Print the size of our model dataset
print(f"The size of our model dataset is {final_reviews.shape[0]} entries.")

The size of our model dataset is 5574784 entries.


In [8]:
final_reviews = final_reviews[~final_reviews['restaurant_name'].duplicated(keep='first')]

display(final_reviews)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
0,buF9druCkbuXLX526sGELQ,Prides Osteria,Beverly,MA,3.5,"[Wine Bars, Nightlife, Farmers Market, Bars, I..."
1,bNZ3-0rse12NKdSVqQ30xw,Sulmona,Cambridge,MA,4.0,"[Pizza, Italian, Nightlife, Bars]"
2,BVsIaKL-8QXVjt0Z9WoFWw,Village Roast Beef & Seafood,Marblehead,MA,4.5,"[Seafood, American (Traditional)]"
3,4MClvr12OXBNvGu8h1yGpA,Sarma,Somerville,MA,4.5,"[Turkish, Middle Eastern, Moroccan, Tapas/Smal..."
4,2vH58mhkEl8GdcDug1OwWg,Gene's Chinese Flatbread Cafe,Woburn,MA,4.0,"[Cafes, Noodles, Chinese]"
...,...,...,...,...,...,...
4915725,Oo0XRiHvzsUZiONBQaZUnQ,Geisty's Dogg House,Boulder,CO,2.5,"[American (New), American (Traditional)]"
4916524,h18B4BtBXJlyIWCHTmLeng,Nations Cafe,Hapeville,GA,2.5,"[African, American (Traditional)]"
5079935,5orS6fh8ZX6n7XDl7KZK3Q,Stem & Flats Creole Wingery,Atlanta,GA,4.0,"[Chicken Wings, Comfort Food, Cajun/Creole, Ch..."
5237494,Z_YP0Y7ZxUu-tIaCT0r1fQ,Pacer Inn and Suites,Delaware,OH,4.5,"[Bed & Breakfast, Hotels & Travel, Event Plann..."


To organize the dataset, we'll sort it by the business ID in ascending order. After that, we'll reassign the IDs so that they start from 0 and continue in ascending order. This way, we'll have a neatly organized dataset with consecutive and easily interpretable business IDs.

In [9]:
# Extract columns 'business_id', 'restaurant_name', and 'rating' from 'final_reviews'
sorted_data = final_reviews.sort_values(by='business_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
82678,--164t1nclzzmca7eDiJMw,Me So Hungry,Austin,TX,4.0,"[Ethnic Food, Nightlife, Dive Bars, Bars, Viet..."
375730,--6COJIAjkQwSUZci_4PJQ,Medley,Portland,OR,4.0,"[Breakfast & Brunch, Bakeries, Tea Rooms, Coff..."
356879,--Q3mAcX9t63f7Xcbn7LVA,The Royce,Columbus,OH,4.5,"[American (Traditional), Gastropubs]"
210,--UNNdnHRhsyFUbDgumdtQ,Le Pigeon,Portland,OR,4.5,"[French, American (New)]"
1438245,--_nBudPOb1lNRgKfjLtrw,Mezcal Cantina & Grill,Columbus,OH,4.0,"[Mexican, Gastropubs]"
...,...,...,...,...,...,...
2537,zzin1d1oHi81GuI0ufo1VA,Cafe Bombay,Atlanta,GA,3.5,"[Pakistani, Indian]"
1220222,zzjgNzmsxCGlrMCNoPge5Q,Don Patron,Pickerington,OH,2.5,[Mexican]
3204,zzlkjDG9Rv8Jn-vSolMgyw,Glenn's Kitchen,Atlanta,GA,3.5,"[Comfort Food, Southern, American (New)]"
397442,zzpmoTVq4yn86U7ArHyFBQ,T4,Portland,OR,4.0,"[Juice Bars & Smoothies, Waffles, Bubble Tea, ..."


In [10]:
sorted_data['business_id'] = sorted_data['business_id'].rank(method='dense').astype(int) - 1

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
82678,0,Me So Hungry,Austin,TX,4.0,"[Ethnic Food, Nightlife, Dive Bars, Bars, Viet..."
375730,1,Medley,Portland,OR,4.0,"[Breakfast & Brunch, Bakeries, Tea Rooms, Coff..."
356879,2,The Royce,Columbus,OH,4.5,"[American (Traditional), Gastropubs]"
210,3,Le Pigeon,Portland,OR,4.5,"[French, American (New)]"
1438245,4,Mezcal Cantina & Grill,Columbus,OH,4.0,"[Mexican, Gastropubs]"
...,...,...,...,...,...,...
2537,37333,Cafe Bombay,Atlanta,GA,3.5,"[Pakistani, Indian]"
1220222,37334,Don Patron,Pickerington,OH,2.5,[Mexican]
3204,37335,Glenn's Kitchen,Atlanta,GA,3.5,"[Comfort Food, Southern, American (New)]"
397442,37336,T4,Portland,OR,4.0,"[Juice Bars & Smoothies, Waffles, Bubble Tea, ..."


In [11]:
sorted_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37338 entries, 82678 to 456333
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   business_id        37338 non-null  int32  
 1   restaurant_name    37338 non-null  object 
 2   city               37338 non-null  object 
 3   state              37338 non-null  object 
 4   restaurant_rating  37338 non-null  float64
 5   categories         37338 non-null  object 
dtypes: float64(1), int32(1), object(4)
memory usage: 1.9+ MB


In [12]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['business_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['restaurant_rating'].min(), "to", sorted_data['restaurant_rating'].max())

Number of restaurants: 37338
Number of unique reviewers: 37338
Range of ratings: 1.0 to 5.0


### Baseline Content Based Filtering <a class="anchor" id="base"></a>

Before applying the TF-IDF (term frequency-inverse document frequency) transformation, we'll be converting the `categories` column from list values to strings. This preprocessing step is necessary to ensure that the data is in the appropriate format for the TF-IDF algorithm. 

In [13]:
sorted_data['categories'] = sorted_data['categories'].apply(lambda x: ', '.join(x))

sorted_data['features'] = sorted_data['categories'] + ', ' + sorted_data['city'] + ', ' + sorted_data['state']

sorted_data.drop(['city', 'state', 'restaurant_rating', 'categories'], axis=1, inplace=True)

display(sorted_data)

Unnamed: 0,business_id,restaurant_name,features
82678,0,Me So Hungry,"Ethnic Food, Nightlife, Dive Bars, Bars, Vietn..."
375730,1,Medley,"Breakfast & Brunch, Bakeries, Tea Rooms, Coffe..."
356879,2,The Royce,"American (Traditional), Gastropubs, Columbus, OH"
210,3,Le Pigeon,"French, American (New), Portland, OR"
1438245,4,Mezcal Cantina & Grill,"Mexican, Gastropubs, Columbus, OH"
...,...,...,...
2537,37333,Cafe Bombay,"Pakistani, Indian, Atlanta, GA"
1220222,37334,Don Patron,"Mexican, Pickerington, OH"
3204,37335,Glenn's Kitchen,"Comfort Food, Southern, American (New), Atlant..."
397442,37336,T4,"Juice Bars & Smoothies, Waffles, Bubble Tea, T..."


We will define the `tokenizer` function takes a sentence as input, removes punctuation, converts the text to lowercase, tokenizes the sentence into words, removes English stopwords, and applies stemming using the Porter Stemmer. It helps in preparing text data for further natural language processing tasks like TF-IDF.

In [14]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

ENGLISH_STOP_WORDS = stopwords.words('english')
stemmer = PorterStemmer() 

def tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

Here, we're creating the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer, a crucial step in text feature extraction. We use the tokenizer function defined earlier to tokenize, preprocess, and transform the text data into a format suitable for TF-IDF computation. The `tokenizer` argument ensures that the tokenizer function we defined is used during the vectorization process.

In [15]:
# Create the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, min_df=30, max_features=1000)

# Fit and transform the corpus using the vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(sorted_data['features'])

In [16]:
# Print the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (37338, 404)


In [17]:
# Create a DataFrame 'features' from the TF-IDF transformed data
features = pd.DataFrame(columns=tfidf_vectorizer.get_feature_names_out(), data=tfidf_matrix.toarray())

# Display the DataFrame
display(features)

Unnamed: 0,acai,activ,adult,afghan,african,allston,altamont,american,amus,apopka,...,windermer,wine,wineri,wing,winter,winthrop,woburn,worthington,wrap,yogurt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.284491,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.340804,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.249164,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The essential features we consider for similarity are the restaurant categories and their geographical locations. By computing the cosine similarity between these features, we can identify restaurants that share similar types of cuisines and are situated close to each other. This similarity-based approach allows us to provide relevant and personalized restaurant recommendations to users based on their preferences and the characteristics of restaurants they have enjoyed in the past. 

In [18]:
# Calculate the cosine similarity matrix 
cosine_similarity_matrix = cosine_similarity(features)

In [19]:
print("Shape of cosine_similarity_matrix:", cosine_similarity_matrix.shape)

Shape of cosine_similarity_matrix: (37338, 37338)


The `restaurant_recommender` function suggests similar restaurants to a specified restaurant based on precomputed similarity scores. By filtering and extracting data for the given restaurant, it calculates similarity with other restaurants using a similarity matrix. The top 10 most similar restaurants are then recommended, providing users with personalized dining options that match their preferences and the characteristics of the chosen restaurant.

In [20]:
def restaurant_recommender(name, restaurants, similarities):
    # Get the restaurant by name
    restaurant_data = restaurants[restaurants['restaurant_name'] == name]
    
    # Extract the 'business_id' of the restaurant from the filtered data
    business_id = restaurant_data['business_id'].values[0]

    # Create a dataframe with the restaurant names and similarities
    sim_df = pd.DataFrame(
        {'restaurant': restaurants['restaurant_name'], 
         'similarity': similarities[business_id]
        })
    
    # Get the top 10 similar restaurants
    top_restaurants = sim_df.sort_values(by='similarity', ascending=False).head(10)
    
    return top_restaurants

In [21]:
# Test the recommender
similar_restaurants = restaurant_recommender("Wang's Shanghai Cuisine", sorted_data, cosine_similarity_matrix)
similar_restaurants.head(10)

Unnamed: 0,restaurant,similarity
143432,Wang's Shanghai Cuisine,1.0
90001,Happy Noodle House,0.909666
143847,NamNam Noodle,0.845891
143015,No. 1 Dumpling,0.805372
145352,King's Chinese Cuisine,0.805372
158001,Excellent Dim Sum King Restaurant,0.805372
175176,Golden Star Seafood Restaurant,0.805372
153739,Lucky Dragon Palace Restaurant,0.805372
191402,Dong Tai Xiang Shanghai Dim Sum,0.805372
155221,Star Anise Restaurant,0.805372
