## Content Based Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)

### Introduction <a class="anchor" id="intro"></a>

In this notebook, where we'll be building a content-based recommender system for restaurants. Our goal is to create a system that suggests restaurants to users based on their preferences and past interactions with restaurants. By leveraging the content and features of the restaurants, we can provide personalized recommendations that align with each user's taste. 

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import NLP (Natural Language Processing) packages
import string
import nltk

# Import TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import train_test_split function from scikit-learn to split the data into training and test sets
from sklearn.model_selection import train_test_split

# Import cosine_similarity function from scikit-learn to calculate similarity scores
from sklearn.metrics.pairwise import cosine_similarity

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Business Dataset <a class="anchor" id="model"></a>

We will begin by importing our Business Dataset and giving it a good inspection to see how it looks. This dataset will serve as a foundational element for our project, so it's crucial to understand its structure and contents before diving into further steps.

**Data Dictionary:**
* `business_id`: unique business id
* `restaurant_name`: the restaurant's name
* `address`: the full address of the restaurant
* `city`: the city
* `state`: 2 character state code
* `postal_code`: the postal code
* `latitude`: latitude of the restaurant
* `longitude`: longitude of the restaurant
* `restaurant_rating`: star rating
* `review_count`: number of reviews
* `restaurant_review_count`: number of reviews
* `is_open`: 0 or 1 for closed or open
* `categories`: business categories

In [2]:
# Read data from a pickle file into a Pandas DataFrame
business_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/business_data.pkl')

In [3]:
# Display concise information about the 'business_data' DataFrame
business_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50764 entries, 0 to 160584
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   business_id              50764 non-null  int64  
 1   restaurant_name          50764 non-null  object 
 2   address                  50764 non-null  object 
 3   city                     50764 non-null  object 
 4   state                    50764 non-null  object 
 5   postal_code              50764 non-null  object 
 6   latitude                 50764 non-null  float64
 7   longitude                50764 non-null  float64
 8   restaurant_rating        50764 non-null  float64
 9   restaurant_review_count  50764 non-null  int64  
 10  is_open                  50764 non-null  int64  
 11  categories               50764 non-null  object 
dtypes: float64(3), int64(3), object(6)
memory usage: 5.0+ MB


In [4]:
# Display the first few rows of the 'business_data' DataFrame
business_data.head()

Unnamed: 0,business_id,restaurant_name,address,city,state,postal_code,latitude,longitude,restaurant_rating,restaurant_review_count,is_open,categories
0,6002,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"[Gastropubs, Beer Gardens, Bars, American (Tra..."
1,45324,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"[Salad, Soup, Sandwiches, Delis, Cafes, Vegeta..."
5,11160,Bob Likes Thai Food,3755 Main St,Vancouver,BC,V5V,49.251342,-123.101333,3.5,169,1,[Thai]
7,37485,Boxwood Biscuit,740 S High St,Columbus,OH,43206,39.947007,-82.997471,4.5,11,1,[Breakfast & Brunch]
12,14657,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,01960,42.541155,-70.973438,4.0,39,1,[Pizza]


In [5]:
# Filter the data based on the condition 'restaurant_review_count >= 100'
business_data = business_data[business_data['restaurant_review_count'] >= 100]

# Select only the specified columns from the filtered data
business_data = business_data[['business_id', 'restaurant_name', 'city', 'state', 'restaurant_rating', 'categories']]

In [6]:
# Count the number of missing values in each column of the 'business_data' DataFrame
business_data.isnull().sum()

business_id          0
restaurant_name      0
city                 0
state                0
restaurant_rating    0
categories           0
dtype: int64

In [7]:
# Print the size of our model dataset
print(f"The size of our model dataset is {business_data.shape[0]} entries.")

The size of our model dataset is 14323 entries.


To organize the dataset, we'll sort it by the business ID in ascending order. After that, we'll reassign the IDs so that they start from 0 and continue in ascending order. This way, we'll have a neatly organized dataset with consecutive and easily interpretable business IDs.

In [8]:
# Extract columns 'business_id', 'restaurant_name', and 'rating' from 'business_data'
sorted_data = business_data.sort_values(by='business_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
14338,0,Me So Hungry,Austin,TX,4.0,"[Ethnic Food, Nightlife, Dive Bars, Bars, Viet..."
92232,2,The Royce,Columbus,OH,4.5,"[American (Traditional), Gastropubs]"
66796,3,Le Pigeon,Portland,OR,4.5,"[French, American (New)]"
20493,8,Market Street Cafe,Celebration,FL,3.0,"[Breakfast & Brunch, Diners]"
23827,16,OTTO,Cambridge,MA,4.0,"[Food Delivery Services, Vegan, Pizza, Gluten-..."
...,...,...,...,...,...,...
66673,50741,Cheddar's Scratch Kitchen,Austin,TX,3.5,"[American (Traditional), American (New), Comfo..."
9186,50747,Pietro's Pizza & Pirate Adventure,Beaverton,OR,3.5,"[Arts & Entertainment, Pizza, Arcades]"
68425,50754,Moltaqa Moroccan Restaurant,Vancouver,BC,4.0,"[Arabian, African, Mediterranean, Moroccan, Mi..."
78496,50758,Cafe Bombay,Atlanta,GA,3.5,"[Pakistani, Indian]"


In [9]:
sorted_data['business_id'] = sorted_data['business_id'].rank(method='dense').astype(int) - 1

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
14338,0,Me So Hungry,Austin,TX,4.0,"[Ethnic Food, Nightlife, Dive Bars, Bars, Viet..."
92232,1,The Royce,Columbus,OH,4.5,"[American (Traditional), Gastropubs]"
66796,2,Le Pigeon,Portland,OR,4.5,"[French, American (New)]"
20493,3,Market Street Cafe,Celebration,FL,3.0,"[Breakfast & Brunch, Diners]"
23827,4,OTTO,Cambridge,MA,4.0,"[Food Delivery Services, Vegan, Pizza, Gluten-..."
...,...,...,...,...,...,...
66673,14318,Cheddar's Scratch Kitchen,Austin,TX,3.5,"[American (Traditional), American (New), Comfo..."
9186,14319,Pietro's Pizza & Pirate Adventure,Beaverton,OR,3.5,"[Arts & Entertainment, Pizza, Arcades]"
68425,14320,Moltaqa Moroccan Restaurant,Vancouver,BC,4.0,"[Arabian, African, Mediterranean, Moroccan, Mi..."
78496,14321,Cafe Bombay,Atlanta,GA,3.5,"[Pakistani, Indian]"


In [10]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['business_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['restaurant_rating'].min(), "to", sorted_data['restaurant_rating'].max())

Number of restaurants: 12192
Number of unique reviewers: 14323
Range of ratings: 1.0 to 5.0


### Baseline Content Based Filtering <a class="anchor" id="base"></a>

Before applying the TF-IDF (term frequency-inverse document frequency) transformation, we'll be converting the `categories` column from list values to strings. This preprocessing step is necessary to ensure that the data is in the appropriate format for the TF-IDF algorithm. 

In [11]:
sorted_data['categories'] = sorted_data['categories'].apply(lambda x: ', '.join(x))
sorted_data

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
14338,0,Me So Hungry,Austin,TX,4.0,"Ethnic Food, Nightlife, Dive Bars, Bars, Vietn..."
92232,1,The Royce,Columbus,OH,4.5,"American (Traditional), Gastropubs"
66796,2,Le Pigeon,Portland,OR,4.5,"French, American (New)"
20493,3,Market Street Cafe,Celebration,FL,3.0,"Breakfast & Brunch, Diners"
23827,4,OTTO,Cambridge,MA,4.0,"Food Delivery Services, Vegan, Pizza, Gluten-F..."
...,...,...,...,...,...,...
66673,14318,Cheddar's Scratch Kitchen,Austin,TX,3.5,"American (Traditional), American (New), Comfor..."
9186,14319,Pietro's Pizza & Pirate Adventure,Beaverton,OR,3.5,"Arts & Entertainment, Pizza, Arcades"
68425,14320,Moltaqa Moroccan Restaurant,Vancouver,BC,4.0,"Arabian, African, Mediterranean, Moroccan, Mid..."
78496,14321,Cafe Bombay,Atlanta,GA,3.5,"Pakistani, Indian"


We will define the `tokenizer` function takes a sentence as input, removes punctuation, converts the text to lowercase, tokenizes the sentence into words, removes English stopwords, and applies stemming using the Porter Stemmer. It helps in preparing text data for further natural language processing tasks like TF-IDF.

In [12]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

ENGLISH_STOP_WORDS = stopwords.words('english')
stemmer = PorterStemmer() 

def tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

Here, we're creating the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer, a crucial step in text feature extraction. We use the tokenizer function defined earlier to tokenize, preprocess, and transform the text data into a format suitable for TF-IDF computation. The `tokenizer` argument ensures that the tokenizer function we defined is used during the vectorization process.

In [13]:
# Create the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, min_df=30, max_features=5000)

# Fit and transform the corpus using the vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(sorted_data['categories'])

In [14]:
# Print the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (14323, 173)


In [15]:
# Create a DataFrame 'features' from the TF-IDF transformed data
features = pd.DataFrame(columns=tfidf_vectorizer.get_feature_names_out(), data=tfidf_matrix.toarray())

# Display the DataFrame
display(features)

Unnamed: 0,activ,african,american,arcad,art,asian,bagel,bakeri,bar,barbequ,...,vendor,venu,vietnames,waffl,whiskey,wine,wineri,wing,wrap,yogurt
0,0.0,0.000000,0.000000,0.000000,0.000000,0.194835,0.0,0.0,0.203753,0.0,...,0.310013,0.0,0.230521,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.359355,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.356686,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14318,0.0,0.000000,0.524450,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14319,0.0,0.000000,0.000000,0.687981,0.453116,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14320,0.0,0.525981,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14321,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# Set the batch size
batch_size = 1000  

# Get the number of items in the feature matrix
num_items = features.shape[0]

# Initialize an empty similarity matrix
cosine_similarity_matrix = np.zeros((num_items, num_items))

# Calculate cosine similarity in batches
for i in range(0, num_items, batch_size):
    start_idx = i
    end_idx = min(i + batch_size, num_items)

    # Get the batch of features
    batch_features = features[start_idx:end_idx]

    # Calculate cosine similarity for the batch
    batch_similarity = cosine_similarity(batch_features)

    # Update the similarity matrix with the batch results
    cosine_similarity_matrix[start_idx:end_idx, start_idx:end_idx] = batch_similarity 


In [17]:
print("Shape of cosine_similarity_matrix:", cosine_similarity_matrix.shape)

Shape of cosine_similarity_matrix: (14323, 14323)


In [18]:
def restaurant_recommender(name, restaurants, similarities):
    # Get the restaurant by name
    restaurant_data = restaurants[restaurants['restaurant_name'] == name]
    
    # Extract the 'business_id' of the restaurant from the filtered data
    business_id = restaurant_data['business_id'].values[0]

    # Create a dataframe with the restaurant names and similarities
    sim_df = pd.DataFrame(
        {'restaurant': restaurants['restaurant_name'], 
         'similarity': similarities[business_id]
        })
    
    # Get the top 10 similar restaurants
    top_restaurants = sim_df.sort_values(by='similarity', ascending=False).head(10)
    
    return top_restaurants

In [19]:
business_data.sample(10)

Unnamed: 0,business_id,restaurant_name,city,state,restaurant_rating,categories
5622,25995,The Lamplighter Public House,Vancouver,BC,3.0,"[Nightlife, Gastropubs, Bars, Pubs]"
28720,25625,Eddie V's Prime Seafood,Boston,MA,4.0,"[Bars, Nightlife, Lounges, Seafood, Steakhouses]"
79148,43950,B.GOOD,Hingham,MA,3.5,"[American (New), Burgers, Juice Bars & Smoothi..."
20972,45457,Iron Cactus Mexican Restaurant and Margarita Bar,Austin,TX,3.5,"[Mexican, Tex-Mex, Latin American, Nightlife, ..."
70283,33597,Portofino,Atlanta,GA,4.0,"[Italian, American (New), Nightlife, Bars]"
90988,3515,la Madeleine French Bakery & Cafe,Austin,TX,3.5,"[Breakfast & Brunch, Bakeries, Event Planning ..."
138625,39951,JuiceLand,Austin,TX,4.5,"[Vegetarian, Vegan, Juice Bars & Smoothies]"
28492,36001,Vivo Austin,Austin,TX,4.0,[Tex-Mex]
140666,2386,The Food Shoppe,Atlanta,GA,4.5,"[Seafood, Breakfast & Brunch, Cajun/Creole]"
82538,29255,Hudson Grille,Sandy Springs,GA,3.5,"[American (Traditional), Nightlife, Beer, Wine..."


In [20]:
# Test the recommender
similar_restaurants = restaurant_recommender("Suika", sorted_data, cosine_similarity_matrix)
similar_restaurants.head(10)

Unnamed: 0,restaurant,similarity
78485,Suika,1.0
39556,Izakaya Amu,0.781389
49712,Lola 42,0.760751
88189,Imperium Food & Wine,0.640402
29483,Savin Bar & Kitchen,0.622994
22038,Kyoto Sushi,0.620734
157113,Yui Japanese Bistro,0.620734
3344,Sushi Junai 2,0.620734
121566,Sushi Hurray,0.620734
38504,Sushi Town,0.620734
