## Content Based Model

**Name**: Diane Lu

**Contact**: dianengalu@gmail.com

**Date**: 07/31/2023

### Table of Contents 

1. [Introduction](#intro)
2. [Model Dataset](#model)

### Introduction <a class="anchor" id="intro"></a>

In this notebook, where we'll be building a content-based recommender system for restaurants. Our goal is to create a system that suggests restaurants to users based on their preferences and past interactions with restaurants. By leveraging the content and features of the restaurants, we can provide personalized recommendations that align with each user's taste. 

#### Importing Python Libraries 

Importing necessary libraries.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Import data visualization libraries
import matplotlib.pyplot as plt

# Import NLP (Natural Language Processing) packages
import string
import nltk

# Import TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import train_test_split function from scikit-learn to split the data into training and test sets
from sklearn.model_selection import train_test_split

# Import cosine_similarity function from scikit-learn to calculate similarity scores
from sklearn.metrics.pairwise import cosine_similarity

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

### Business Dataset <a class="anchor" id="model"></a>

We will begin by importing our Business Dataset and giving it a good inspection to see how it looks. This dataset will serve as a foundational element for our project, so it's crucial to understand its structure and contents before diving into further steps.

**Data Dictionary:**
* `business_id`: unique business id
* `restaurant_name`: the restaurant's name
* `address`: the full address of the restaurant
* `city`: the city
* `state`: 2 character state code
* `postal_code`: the postal code
* `latitude`: latitude of the restaurant
* `longitude`: longitude of the restaurant
* `restaurant_rating`: star rating
* `review_count`: number of reviews
* `restaurant_review_count`: number of reviews
* `is_open`: 0 or 1 for closed or open
* `categories`: business categories

In [2]:
# Read data from a pickle file into a Pandas DataFrame
van_bus_data = pd.read_pickle('/Users/diane/Desktop/BrainStation/Brainstation_Capstone/yelp_data/van_bus_data.pkl')

In [3]:
# Display concise information about the 'van_bus_data' DataFrame
van_bus_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 819 entries, 5 to 160318
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   business_id        819 non-null    int64  
 1   restaurant_rating  819 non-null    float64
 2   restaurant_name    819 non-null    object 
 3   categories         819 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 32.0+ KB


In [4]:
# Display the first few rows of the 'van_bus_data' DataFrame
van_bus_data.head()

Unnamed: 0,business_id,restaurant_rating,restaurant_name,categories
5,11160,3.5,Bob Likes Thai Food,[Thai]
713,40402,3.5,Sushi California,"[Japanese, Sushi Bars]"
727,38042,3.5,Romer's Burger Bar,[Burgers]
751,12676,4.0,Mr. Red Cafe,[Vietnamese]
825,37676,4.0,Bauhaus Restaurant,"[German, Bars, Modern European, Nightlife]"


In [5]:
# Count the number of missing values in each column of the 'van_bus_data' DataFrame
van_bus_data.isnull().sum()

business_id          0
restaurant_rating    0
restaurant_name      0
categories           0
dtype: int64

In [6]:
# Print the size of our model dataset
print(f"The size of our model dataset is {van_bus_data.shape[0]} entries.")

The size of our model dataset is 819 entries.


To organize the dataset, we'll sort it by the business ID in ascending order. After that, we'll reassign the IDs so that they start from 0 and continue in ascending order. This way, we'll have a neatly organized dataset with consecutive and easily interpretable business IDs.

In [7]:
# Extract columns 'business_id', 'restaurant_name', and 'rating' from 'van_bus_data'
sorted_data = van_bus_data.sort_values(by='business_id')

# Display the sorted data
display(sorted_data)

Unnamed: 0,business_id,restaurant_rating,restaurant_name,categories
44969,87,4.0,Boneta Restaurant,"[Wine Bars, Bars, American (New), French, Cana..."
35582,89,4.0,Poke Time,"[Hawaiian, Salad, Sushi Bars, Japanese]"
28871,98,3.0,Steel Toad Brewpub & Dining Hall,"[Breweries, Gastropubs, Canadian (New), Brewpu..."
30652,166,3.5,The Vancouver Fish Company,"[Seafood, Pizza, Nightlife, Bars]"
59048,210,4.0,Passion8 Dessert Cafe,"[Desserts, Coffee & Tea, Korean]"
...,...,...,...,...
65893,50669,4.0,Fat Mao Noodles,"[Thai, Noodles, Asian Fusion]"
7745,50685,4.0,Maenam,[Thai]
152457,50705,3.5,The Fish House In Stanley Park,"[Active Life, Venues & Event Spaces, Seafood, ..."
152031,50726,3.0,Delicious Pho,"[Asian Fusion, Vietnamese, Noodles]"


In [8]:
sorted_data['business_id'] = sorted_data['business_id'].rank(method='dense').astype(int) - 1

# Display the updated DataFrame
display(sorted_data)

Unnamed: 0,business_id,restaurant_rating,restaurant_name,categories
44969,0,4.0,Boneta Restaurant,"[Wine Bars, Bars, American (New), French, Cana..."
35582,1,4.0,Poke Time,"[Hawaiian, Salad, Sushi Bars, Japanese]"
28871,2,3.0,Steel Toad Brewpub & Dining Hall,"[Breweries, Gastropubs, Canadian (New), Brewpu..."
30652,3,3.5,The Vancouver Fish Company,"[Seafood, Pizza, Nightlife, Bars]"
59048,4,4.0,Passion8 Dessert Cafe,"[Desserts, Coffee & Tea, Korean]"
...,...,...,...,...
65893,814,4.0,Fat Mao Noodles,"[Thai, Noodles, Asian Fusion]"
7745,815,4.0,Maenam,[Thai]
152457,816,3.5,The Fish House In Stanley Park,"[Active Life, Venues & Event Spaces, Seafood, ..."
152031,817,3.0,Delicious Pho,"[Asian Fusion, Vietnamese, Noodles]"


In [9]:
# Number of restaurants 
print("Number of restaurants:", sorted_data['restaurant_name'].nunique())

# Number of unique reviewers 
print("Number of unique reviewers:", sorted_data['business_id'].nunique())

# Range of ratings
print("Range of ratings:", sorted_data['restaurant_rating'].min(), "to", sorted_data['restaurant_rating'].max())

Number of restaurants: 766
Number of unique reviewers: 819
Range of ratings: 2.5 to 5.0


### Baseline Content Based Filtering <a class="anchor" id="base"></a>

Before applying the TF-IDF (term frequency-inverse document frequency) transformation, we'll be converting the `categories` column from list values to strings. This preprocessing step is necessary to ensure that the data is in the appropriate format for the TF-IDF algorithm. 

In [10]:
sorted_data['categories'] = sorted_data['categories'].apply(lambda x: ', '.join(x))
sorted_data

Unnamed: 0,business_id,restaurant_rating,restaurant_name,categories
44969,0,4.0,Boneta Restaurant,"Wine Bars, Bars, American (New), French, Canad..."
35582,1,4.0,Poke Time,"Hawaiian, Salad, Sushi Bars, Japanese"
28871,2,3.0,Steel Toad Brewpub & Dining Hall,"Breweries, Gastropubs, Canadian (New), Brewpub..."
30652,3,3.5,The Vancouver Fish Company,"Seafood, Pizza, Nightlife, Bars"
59048,4,4.0,Passion8 Dessert Cafe,"Desserts, Coffee & Tea, Korean"
...,...,...,...,...
65893,814,4.0,Fat Mao Noodles,"Thai, Noodles, Asian Fusion"
7745,815,4.0,Maenam,Thai
152457,816,3.5,The Fish House In Stanley Park,"Active Life, Venues & Event Spaces, Seafood, E..."
152031,817,3.0,Delicious Pho,"Asian Fusion, Vietnamese, Noodles"


We will define the `tokenizer` function takes a sentence as input, removes punctuation, converts the text to lowercase, tokenizes the sentence into words, removes English stopwords, and applies stemming using the Porter Stemmer. It helps in preparing text data for further natural language processing tasks like TF-IDF.

In [11]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

ENGLISH_STOP_WORDS = stopwords.words('english')
stemmer = PorterStemmer() 

def tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

Here, we're creating the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer, a crucial step in text feature extraction. We use the tokenizer function defined earlier to tokenize, preprocess, and transform the text data into a format suitable for TF-IDF computation. The `tokenizer` argument ensures that the tokenizer function we defined is used during the vectorization process.

In [12]:
# Create the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, min_df=30, max_features=5000)

# Fit and transform the corpus using the vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(sorted_data['categories'])

In [13]:
# Print the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (819, 42)


In [14]:
# Create a DataFrame 'features' from the TF-IDF transformed data
features = pd.DataFrame(columns=tfidf_vectorizer.get_feature_names_out(), data=tfidf_matrix.toarray())

# Display the DataFrame
display(features)

Unnamed: 0,american,asian,bakeri,bar,breakfast,brunch,burger,cafe,canadian,chines,...,seafood,servic,specialti,sushi,tapassmal,tea,tradit,vegetarian,vietnames,wine
0,0.284797,0.000000,0.0,0.404893,0.0,0.0,0.0,0.0,0.286454,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.376096
1,0.000000,0.000000,0.0,0.438842,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.000000,0.0,0.683809,0.0,0.000000,0.0,0.0,0.000000,0.000000
2,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.522463,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.685961
3,0.000000,0.000000,0.0,0.353149,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.552258,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000
4,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.540208,0.0,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
814,0.000000,0.583716,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000
815,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000
816,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.620750,0.784008,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000
817,0.000000,0.502570,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.508633,0.000000


In [15]:
# Set the batch size
batch_size = 1000  

# Get the number of items in the feature matrix
num_items = features.shape[0]

# Initialize an empty similarity matrix
cosine_similarity_matrix = np.zeros((num_items, num_items))

# Calculate cosine similarity in batches
for i in range(0, num_items, batch_size):
    start_idx = i
    end_idx = min(i + batch_size, num_items)

    # Get the batch of features
    batch_features = features[start_idx:end_idx]

    # Calculate cosine similarity for the batch
    batch_similarity = cosine_similarity(batch_features)

    # Update the similarity matrix with the batch results
    cosine_similarity_matrix[start_idx:end_idx, start_idx:end_idx] = batch_similarity 


In [16]:
print("Shape of cosine_similarity_matrix:", cosine_similarity_matrix.shape)

Shape of cosine_similarity_matrix: (819, 819)


In [17]:
def restaurant_recommender(name, restaurants, similarities):
    # Get the restaurant by name
    restaurant_data = restaurants[restaurants['restaurant_name'] == name]
    
    # Extract the 'business_id' of the restaurant from the filtered data
    business_id = restaurant_data['business_id'].values[0]

    # Create a dataframe with the restaurant names and similarities
    sim_df = pd.DataFrame(
        {'restaurant': restaurants['restaurant_name'], 
         'similarity': similarities[business_id]
        })
    
    # Get the top 10 similar restaurants
    top_restaurants = sim_df.sort_values(by='similarity', ascending=False).head(10)
    
    return top_restaurants

In [18]:
van_bus_data.sample(10)

Unnamed: 0,business_id,restaurant_rating,restaurant_name,categories
64281,47838,4.5,Lupo,[Italian]
29774,43959,3.5,Cora Breakfast and Lunch,"[Breakfast & Brunch, Canadian (New)]"
825,37676,4.0,Bauhaus Restaurant,"[German, Bars, Modern European, Nightlife]"
133012,19484,4.0,The Firewood Café,[Pizza]
149032,4129,3.5,Joyeaux Cafe & Restaurant,"[Vietnamese, Breakfast & Brunch, Sandwiches]"
42547,33884,3.5,Bon's Off Broadway,"[Breakfast & Brunch, American (Traditional)]"
131824,39041,3.5,Belgian Fries,"[Poutineries, Belgian, Fast Food]"
11591,26989,3.0,The Captain's Boil,"[Asian Fusion, Seafood, Cajun/Creole]"
58356,31060,3.5,Pho Goodness,"[Vietnamese, Noodles]"
65482,48828,4.0,Rodney's Oyster House,"[Seafood, Live/Raw Food]"


In [24]:
# Test the recommender
similar_restaurants = restaurant_recommender("Saku", sorted_data, cosine_similarity_matrix)
similar_restaurants.head(10)

Unnamed: 0,restaurant,similarity
111022,Yakinikuya Japanese BBQ Restaurant,1.0
39098,Sushi Coen,1.0
16854,Menya Japanese Noodle,1.0
28920,Pacific Poke,1.0
118064,Gyoza King,1.0
72692,Marutama Ra-men,1.0
38874,Oka-San Kitchen Cafe,1.0
132149,Tentatsu,1.0
115541,JINYA Ramen Bar,1.0
19377,GyuDonYa,1.0


In [20]:
train_data, test_data = train_test_split(sorted_data, test_size=0.2, random_state=42)

In [21]:
# Set the batch size
batch_size = 1000

# Get the number of items in the feature matrix
num_items = train_data.shape[0]

# Initialize an empty similarity matrix
train_cosine_similarity_matrix = np.zeros((num_items, num_items))

# Calculate cosine similarity in batches
for i in range(0, num_items, batch_size):
    start_idx = i
    end_idx = min(i + batch_size, num_items)

    # Get the batch of features
    batch_features = train_data.iloc[start_idx:end_idx]['categories']

    # Compute TF-IDF vectors for the batch
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(batch_features.fillna(''))

    # Calculate cosine similarity for the batch
    batch_similarity = cosine_similarity(tfidf_matrix)

    # Update the similarity matrix with the batch results
    train_cosine_similarity_matrix[start_idx:end_idx, start_idx:end_idx] = batch_similarity

In [22]:
print("Shape of train_cosine_similarity_matrix:", train_cosine_similarity_matrix.shape)

Shape of train_cosine_similarity_matrix: (655, 655)


In [23]:
# Test your recommendation system using the test_data
for restaurant in test_data['restaurant_name']:
    similar_restaurants = restaurant_recommender(restaurant, train_data, train_cosine_similarity_matrix)
    print(f"Recommended restaurants for '{restaurant}':")
    print(similar_restaurants.head(10))
    print()  # Empty line for separation

IndexError: index 0 is out of bounds for axis 0 with size 0