# Exploring the Business Dataset

## Introduction

## Methodology

We will begin by loading the business dataset. We will then filter for restaurants. 

## Data Loading


### Import Required Packages


In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from ast import literal_eval

from sklearn.preprocessing import MultiLabelBinarizer



### Read the dataset


In [2]:
%%time
business_df = pd.read_json('../data/business.json', lines=True)
business_df.head()

CPU times: user 2.66 s, sys: 623 ms, total: 3.28 s
Wall time: 3.28 s


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Antiques, Fashion, Used, Vintage & Consignment...","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0..."
3,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Beauty & Spas, Hair Salons",
4,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,14,1,"{'GoodForKids': 'False', 'BusinessParking': '{...","Gyms, Active Life, Interval Training Gyms, Fit...","{'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'..."


In [3]:
business_df.shape

(160585, 14)

This dataset has 160,585 rows and 14 columns. 

### Data Dictionary

We now examine the data types of each column and investigate what each column describes.

In [4]:
business_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160585 entries, 0 to 160584
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   160585 non-null  object 
 1   name          160585 non-null  object 
 2   address       160585 non-null  object 
 3   city          160585 non-null  object 
 4   state         160585 non-null  object 
 5   postal_code   160585 non-null  object 
 6   latitude      160585 non-null  float64
 7   longitude     160585 non-null  float64
 8   stars         160585 non-null  float64
 9   review_count  160585 non-null  int64  
 10  is_open       160585 non-null  int64  
 11  attributes    145593 non-null  object 
 12  categories    160470 non-null  object 
 13  hours         133244 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 17.2+ MB


We have a mix of numeric and non-numeric columns in the dataset.

As per the documentation of the Yelp dataset, the columns in dataset and their column numbers are as follows:

<ol start="0">
    <li><strong>business_id</strong>: string, representing a unique business ID</li>
  <li><strong>name</strong>: string, representing the name of the business</li>
  <li><strong>address</strong>: string, representing the business address</li>
    <li><strong>city</strong>: string, representing the city the business is located in</li>
    <li><strong>state</strong>: string, representing the state the business is located in</li>
    <li><strong>postal_code</strong>: string, representing the postal code or ZIP code of the business</li>
    <li><strong>latitude</strong>: float, representing the latitude coordinates of the business</li>
    <li><strong>longitude</strong>: float, representing the longitude coordinates of the business</li>
    <li><strong>stars</strong>: float, representing the Yelp star rating of the business</li>
    <li><strong>review_count</strong>: integer, representing the number of reviews received by the business</li>
    <li><strong>is_open</strong>: integer, binary column indicating whether the business is open - 1 refers to open, 0 refers to closed</li>
    <li><strong>attributes</strong>: dictionary object, containing business attributes, with some values being objects themselves</li>
    <li><strong>categories</strong>: string object, containing business categories separated by a comma</li>
    <li><strong>hours</strong>: dictionary object, with each key representing a day of the week, and the value of each key representing the hours of operation for that day in a 24 hour clock format, stored as a string</li>
</ol>



## Data Cleaning

In [5]:
business_df['state'].value_counts().head(10)

MA    36012
OR    25175
TX    24485
FL    21907
GA    18090
BC    17298
OH    11258
CO     3198
WA     3121
CA       13
Name: state, dtype: int64

In [6]:
#business_df = business_df[business_df['state']=='MA']
#business_df.info()

### Filter for food establishments

We now filter for establishments with the category names of restaurant, food, brunch, and breakfast as save those establishments to a new dataframe.

In [7]:
# Filter for restaurnts that contain a restauarant description in the categories column
# and save to new dataframe
restaurant_df = business_df[business_df['categories'].str.contains('restaurant', case=False)==True].reset_index().drop('index', axis=1)
restaurant_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,D4JtQNTI4X3KcbzacDJsMw,Bob Likes Thai Food,3755 Main St,Vancouver,BC,V5V,49.251342,-123.101333,3.5,169,1,"{'GoodForKids': 'True', 'Alcohol': 'u'none'', ...","Restaurants, Thai","{'Monday': '17:0-21:0', 'Tuesday': '17:0-21:0'..."
3,jFYIsSb7r1QeESVUnXPHBw,Boxwood Biscuit,740 S High St,Columbus,OH,43206,39.947007,-82.997471,4.5,11,1,,"Breakfast & Brunch, Restaurants","{'Saturday': '8:0-14:0', 'Sunday': '8:0-14:0'}"
4,HPA_qyMEddpAEtFof02ixg,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,01960,42.541155,-70.973438,4.0,39,1,"{'RestaurantsGoodForGroups': 'True', 'HasTV': ...","Food, Pizza, Restaurants","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."


In [8]:
restaurant_df.shape

(50793, 14)

Filtering for restaurants has produced 50,793 rows (31.6% of our original business dataframe).

In [9]:
restaurant_df['state'].value_counts()

MA     10551
FL      7711
BC      7508
OR      7402
GA      6142
TX      5452
OH      4380
CO       866
WA       774
KS         1
MN         1
VA         1
WY         1
KY         1
NH         1
ABE        1
Name: state, dtype: int64

In [10]:
restaurant_df = restaurant_df[restaurant_df['state']=='MA'].reset_index(drop=True)
restaurant_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10551 entries, 0 to 10550
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   10551 non-null  object 
 1   name          10551 non-null  object 
 2   address       10551 non-null  object 
 3   city          10551 non-null  object 
 4   state         10551 non-null  object 
 5   postal_code   10551 non-null  object 
 6   latitude      10551 non-null  float64
 7   longitude     10551 non-null  float64
 8   stars         10551 non-null  float64
 9   review_count  10551 non-null  int64  
 10  is_open       10551 non-null  int64  
 11  attributes    10481 non-null  object 
 12  categories    10551 non-null  object 
 13  hours         8701 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 1.1+ MB


### Check for Duplicate Rows

In [11]:
attributes_type = type(restaurant_df['attributes'][0])
hours_type = type(restaurant_df['hours'][0])

print(f'Attributes column is a {attributes_type} object.')
print(f'Hours column is a {hours_type} object.')

Attributes column is a <class 'dict'> object.
Hours column is a <class 'dict'> object.


In [12]:
temp_df = restaurant_df.drop(['attributes', 'hours'], axis=1)
temp_df.duplicated().value_counts()/temp_df.shape[0] * 100

False    100.0
dtype: float64

We have no duplicate rows in this dataset.

### Check for Missing Values

In [13]:
restaurant_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10551 entries, 0 to 10550
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   10551 non-null  object 
 1   name          10551 non-null  object 
 2   address       10551 non-null  object 
 3   city          10551 non-null  object 
 4   state         10551 non-null  object 
 5   postal_code   10551 non-null  object 
 6   latitude      10551 non-null  float64
 7   longitude     10551 non-null  float64
 8   stars         10551 non-null  float64
 9   review_count  10551 non-null  int64  
 10  is_open       10551 non-null  int64  
 11  attributes    10481 non-null  object 
 12  categories    10551 non-null  object 
 13  hours         8701 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 1.1+ MB


### Examine the attributes column

Let us examine the number of missing values from the `attributes` column.

In [14]:
missing_value_count = restaurant_df['attributes'].isna().sum()
pct_missing_values = missing_value_count/restaurant_df.shape[0] * 100

print(f'Number of missing values from attributes column: {missing_value_count} ({round(pct_missing_values,2)}%)')

Number of missing values from attributes column: 70 (0.66%)


We have extremely few missing values in this column (< 1%). However, recall this column contains dictionaries of attributes that have been assigned to each restaurant. So we would need to expand each attribute into a binarized column to be able to explore these variables meaningfully. The documentation also warns that some attribute values might also be objects, hence those attribute columns would need to expanded as well before being able to binarize them.

As per Yelp's website, some of these attributes are considered factual and these factual attributes are entered by the businesses themselves when claiming their Yelp page, while some attributes, which are considered subjective, are established by Yelp users who vote on it and they cannot be set by the business.

With this information in mind, we will do the following steps:

1. Filter out any restaurants with missing attributes dictionaries
    - This will exclude any restaurants who are missing both factual and subjective attributes. Since a restaurant can have a many attributes, it would not be possible to impute these missing attributes meaningfully. These restaurants represent less than 1% of our dataset.
    
    
    
2. Only include any restaurants that have a minimum number of reviews
    - Setting a minimum number of reviews gives us a better chance of analyzing restaurants that have had enough users vote on the subjective attributes, minimizing the number of missing values we may see when expanding the subjective attributes.


In [15]:
# Step 1. Drop rows with missing attribute values

restaurant_df = restaurant_df.dropna(axis=0, subset=['attributes']).reset_index().drop('index',axis=1)
restaurant_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,HPA_qyMEddpAEtFof02ixg,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,1960,42.541155,-70.973438,4.0,39,1,"{'RestaurantsGoodForGroups': 'True', 'HasTV': ...","Food, Pizza, Restaurants","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."
1,hcRxdDg7DYryCxCoI8ySQA,Longwood Galleria,340-350 Longwood Ave,Boston,MA,2215,42.338544,-71.106842,2.5,24,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Restaurants, Shopping, Shopping Centers","{'Monday': '6:30-22:0', 'Tuesday': '6:30-22:0'..."
2,jGennaZUr2MsJyRhijNBfA,Legal Sea Foods,1 Harborside Dr,Boston,MA,2128,42.363442,-71.025781,3.5,856,1,"{'NoiseLevel': 'u'average'', 'BikeParking': 'F...","Sandwiches, Food, Restaurants, Breakfast & Bru...","{'Monday': '6:0-21:0', 'Tuesday': '6:0-21:0', ..."
3,iPD8BBvea6YldQZPHzVrSQ,Espresso Minute,334 Mass Ave,Boston,MA,2115,42.342673,-71.084239,4.5,7,0,"{'NoiseLevel': ''quiet'', 'GoodForKids': 'True...","Creperies, Restaurants, Food, Coffee & Tea, Br...","{'Tuesday': '8:0-20:0', 'Wednesday': '8:0-20:0..."
4,Z2JC3Yrz82kyS86zEVJG5A,Gigi's Roast Beef & Pizza,5 Center St,Burlington,MA,1803,42.506935,-71.195854,3.0,16,0,"{'RestaurantsTakeOut': 'True', 'NoiseLevel': '...","Restaurants, Sandwiches, Pizza","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."


In [16]:
# Verify
missing_value_count = restaurant_df['attributes'].isna().sum()

print(f'Number of missing values in attributes column: {missing_value_count}')
print(f'New shape of the dataset: {restaurant_df.shape}')

Number of missing values in attributes column: 0
New shape of the dataset: (10481, 14)


We have successfully dropped the rows with missing attribute columns.

In [17]:
restaurant_df['review_count'].describe()

count    10481.000000
mean       126.043889
std        219.137554
min          5.000000
25%         21.000000
50%         57.000000
75%        143.000000
max       7298.000000
Name: review_count, dtype: float64

In [18]:
%%time
attributes_df = restaurant_df['attributes'].apply(pd.Series)
attributes_df.head()

CPU times: user 2.22 s, sys: 18.7 ms, total: 2.24 s
Wall time: 2.24 s


Unnamed: 0,RestaurantsGoodForGroups,HasTV,GoodForKids,RestaurantsTakeOut,RestaurantsPriceRange2,Ambience,BikeParking,RestaurantsReservations,BusinessParking,RestaurantsTableService,...,CoatCheck,Smoking,DriveThru,BYOB,Corkage,AgesAllowed,RestaurantsCounterService,DietaryRestrictions,Open24Hours,HairSpecializesIn
0,True,True,True,True,2,"{'romantic': False, 'intimate': False, 'classy...",True,False,"{'garage': False, 'street': False, 'validated'...",False,...,,,,,,,,,,
1,True,False,True,,1,"{'romantic': False, 'intimate': False, 'classy...",True,False,"{'garage': True, 'street': False, 'validated':...",,...,,,,,,,,,,
2,False,True,True,True,2,"{'touristy': None, 'hipster': False, 'romantic...",False,False,"{'garage': True, 'street': False, 'validated':...",True,...,,,,,,,,,,
3,True,False,True,True,1,"{'romantic': False, 'intimate': False, 'classy...",True,False,"{'garage': False, 'street': False, 'validated'...",False,...,,,,,,,,,,
4,True,True,True,True,1,"{'romantic': False, 'intimate': False, 'classy...",,False,"{'garage': False, 'street': False, 'validated'...",,...,,,,,,,,,,


In [19]:
attributes_columns = attributes_df.columns
attributes_columns

Index(['RestaurantsGoodForGroups', 'HasTV', 'GoodForKids',
       'RestaurantsTakeOut', 'RestaurantsPriceRange2', 'Ambience',
       'BikeParking', 'RestaurantsReservations', 'BusinessParking',
       'RestaurantsTableService', 'RestaurantsAttire', 'RestaurantsDelivery',
       'OutdoorSeating', 'DogsAllowed', 'NoiseLevel', 'Alcohol',
       'BusinessAcceptsCreditCards', 'Caters', 'WheelchairAccessible', 'WiFi',
       'BusinessAcceptsBitcoin', 'GoodForMeal', 'Music', 'GoodForDancing',
       'BestNights', 'HappyHour', 'BYOBCorkage', 'ByAppointmentOnly',
       'CoatCheck', 'Smoking', 'DriveThru', 'BYOB', 'Corkage', 'AgesAllowed',
       'RestaurantsCounterService', 'DietaryRestrictions', 'Open24Hours',
       'HairSpecializesIn'],
      dtype='object')

In [20]:
restaurant_df2 = pd.concat([restaurant_df, attributes_df], axis=1).drop('attributes', axis=1)

In [21]:
restaurant_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 51 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  hours                       8676 non-null   object 
 13  RestaurantsGoodForGroups    896

In [22]:
type(restaurant_df2.columns.tolist())

list

In [23]:
temp_columns = attributes_columns.to_list()
temp_columns.insert(0, 'Minimum Review Count')
temp_columns.insert(1, 'Number of rows')
temp_columns

['Minimum Review Count',
 'Number of rows',
 'RestaurantsGoodForGroups',
 'HasTV',
 'GoodForKids',
 'RestaurantsTakeOut',
 'RestaurantsPriceRange2',
 'Ambience',
 'BikeParking',
 'RestaurantsReservations',
 'BusinessParking',
 'RestaurantsTableService',
 'RestaurantsAttire',
 'RestaurantsDelivery',
 'OutdoorSeating',
 'DogsAllowed',
 'NoiseLevel',
 'Alcohol',
 'BusinessAcceptsCreditCards',
 'Caters',
 'WheelchairAccessible',
 'WiFi',
 'BusinessAcceptsBitcoin',
 'GoodForMeal',
 'Music',
 'GoodForDancing',
 'BestNights',
 'HappyHour',
 'BYOBCorkage',
 'ByAppointmentOnly',
 'CoatCheck',
 'Smoking',
 'DriveThru',
 'BYOB',
 'Corkage',
 'AgesAllowed',
 'RestaurantsCounterService',
 'DietaryRestrictions',
 'Open24Hours',
 'HairSpecializesIn']

In [24]:
na_count_df = pd.DataFrame(columns=temp_columns).set_index('Minimum Review Count')

for n in range(5, 160, 10):
    df = restaurant_df2[restaurant_df2['review_count'] >= n]
    na_count_df.loc[n,:] = df.isna().sum()/df.shape[0]
    na_count_df.loc[n, 'Number of rows'] = df.shape[0]
        
na_count_df.style.background_gradient(axis=0)

Unnamed: 0_level_0,Number of rows,RestaurantsGoodForGroups,HasTV,GoodForKids,RestaurantsTakeOut,RestaurantsPriceRange2,Ambience,BikeParking,RestaurantsReservations,BusinessParking,RestaurantsTableService,RestaurantsAttire,RestaurantsDelivery,OutdoorSeating,DogsAllowed,NoiseLevel,Alcohol,BusinessAcceptsCreditCards,Caters,WheelchairAccessible,WiFi,BusinessAcceptsBitcoin,GoodForMeal,Music,GoodForDancing,BestNights,HappyHour,BYOBCorkage,ByAppointmentOnly,CoatCheck,Smoking,DriveThru,BYOB,Corkage,AgesAllowed,RestaurantsCounterService,DietaryRestrictions,Open24Hours,HairSpecializesIn
Minimum Review Count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
5,10481,0.144834,0.191871,0.153707,0.048659,0.088827,0.195306,0.314283,0.124225,0.067837,0.654231,0.174029,0.069459,0.109436,0.81996,0.272588,0.162007,0.048755,0.306364,0.758706,0.242057,0.869955,0.461406,0.886557,0.925007,0.910218,0.807175,0.845625,0.945043,0.926057,0.932831,0.938174,0.937792,0.891136,0.999618,0.997806,0.998855,0.999141,0.999905
15,8675,0.099942,0.116196,0.107896,0.02513,0.059712,0.115043,0.230202,0.079193,0.028127,0.611873,0.129452,0.037464,0.063401,0.793545,0.188357,0.103055,0.024438,0.227435,0.7283,0.163343,0.853372,0.382709,0.866167,0.910432,0.892911,0.774986,0.823631,0.935793,0.913314,0.921844,0.936138,0.930259,0.877349,0.999539,0.997349,0.998732,0.998963,0.999885
25,7526,0.079325,0.083444,0.086766,0.019134,0.043981,0.080654,0.182966,0.062184,0.014882,0.582647,0.108424,0.028036,0.04531,0.771858,0.143901,0.077066,0.017008,0.180973,0.70675,0.12025,0.839888,0.330986,0.85012,0.898618,0.878289,0.750199,0.807733,0.92878,0.903136,0.912968,0.939942,0.926654,0.867393,0.999469,0.997077,0.998538,0.998804,0.999867
35,6627,0.064886,0.068809,0.072884,0.015392,0.032292,0.06549,0.15286,0.051909,0.009507,0.557869,0.093557,0.02354,0.037121,0.752829,0.117248,0.066093,0.012072,0.149238,0.689905,0.095518,0.828731,0.288517,0.83522,0.887883,0.864192,0.726422,0.792214,0.92108,0.893617,0.904783,0.938585,0.92274,0.857854,0.999396,0.99668,0.99834,0.998642,0.999849
45,5925,0.056203,0.058059,0.064135,0.01384,0.026835,0.05384,0.130127,0.04557,0.007426,0.536709,0.082869,0.020759,0.032068,0.735359,0.098565,0.055021,0.008945,0.127595,0.675781,0.0773,0.818903,0.255865,0.822616,0.876793,0.852827,0.704135,0.777553,0.915274,0.884557,0.89654,0.93789,0.91865,0.849451,0.999325,0.996287,0.998312,0.998481,1.0
55,5352,0.047272,0.050635,0.056428,0.012706,0.021114,0.047272,0.116031,0.040359,0.005792,0.52074,0.072123,0.01719,0.027093,0.719544,0.08352,0.047459,0.006353,0.108558,0.66648,0.063341,0.811846,0.234305,0.810912,0.867152,0.840994,0.685912,0.765695,0.911809,0.875374,0.8892,0.937593,0.914985,0.841928,0.999253,0.995889,0.998318,0.998505,1.0
65,4845,0.040454,0.044582,0.049123,0.011352,0.015273,0.041692,0.105882,0.033643,0.004954,0.50774,0.062126,0.016099,0.023117,0.701135,0.073478,0.039628,0.004128,0.095975,0.657585,0.05387,0.80289,0.214035,0.799381,0.85676,0.830753,0.668111,0.74902,0.906708,0.867905,0.882353,0.936223,0.912281,0.833643,0.999381,0.995459,0.998349,0.998349,1.0
75,4443,0.036012,0.040513,0.044114,0.010353,0.012604,0.036912,0.099257,0.031285,0.004952,0.494936,0.055143,0.014405,0.019806,0.687824,0.063246,0.034661,0.003826,0.085303,0.649561,0.046365,0.797434,0.19964,0.789557,0.848076,0.820842,0.651362,0.73869,0.902543,0.860905,0.875309,0.934504,0.90817,0.827369,0.999325,0.995273,0.998199,0.998199,1.0
85,4093,0.032983,0.039335,0.039091,0.009773,0.008063,0.036159,0.090887,0.029318,0.004398,0.488151,0.049353,0.013438,0.019301,0.674566,0.058637,0.03225,0.002688,0.077205,0.644759,0.040801,0.793794,0.186416,0.781578,0.839971,0.814073,0.63572,0.726606,0.897386,0.855851,0.870511,0.934034,0.905937,0.820669,0.999267,0.995358,0.998045,0.99829,1.0
95,3770,0.028117,0.03687,0.033952,0.009284,0.00557,0.034218,0.084085,0.027321,0.004244,0.480637,0.043236,0.012997,0.018568,0.665782,0.049602,0.029973,0.001592,0.066578,0.641114,0.036074,0.790451,0.175066,0.774536,0.832626,0.806101,0.628117,0.717772,0.893899,0.851194,0.866313,0.933687,0.90504,0.810875,0.999204,0.99496,0.997878,0.998143,1.0


In [25]:
#restaurant_df2 = restaurant_df2[restaurant_df2['review_count'] >= 5]

In [26]:
restaurant_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 51 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  hours                       8676 non-null   object 
 13  RestaurantsGoodForGroups    896

### Drop Columns with Many Missing values

In [27]:
restaurant_df3 = restaurant_df2.copy()

# Verfiy
restaurant_df3.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,CoatCheck,Smoking,DriveThru,BYOB,Corkage,AgesAllowed,RestaurantsCounterService,DietaryRestrictions,Open24Hours,HairSpecializesIn
0,HPA_qyMEddpAEtFof02ixg,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,1960,42.541155,-70.973438,4.0,39,...,,,,,,,,,,
1,hcRxdDg7DYryCxCoI8ySQA,Longwood Galleria,340-350 Longwood Ave,Boston,MA,2215,42.338544,-71.106842,2.5,24,...,,,,,,,,,,
2,jGennaZUr2MsJyRhijNBfA,Legal Sea Foods,1 Harborside Dr,Boston,MA,2128,42.363442,-71.025781,3.5,856,...,,,,,,,,,,
3,iPD8BBvea6YldQZPHzVrSQ,Espresso Minute,334 Mass Ave,Boston,MA,2115,42.342673,-71.084239,4.5,7,...,,,,,,,,,,
4,Z2JC3Yrz82kyS86zEVJG5A,Gigi's Roast Beef & Pizza,5 Center St,Burlington,MA,1803,42.506935,-71.195854,3.0,16,...,,,,,,,,,,


In [28]:
missing_value_counts = (restaurant_df3.isna().sum()/restaurant_df3.shape[0]*100).sort_values(ascending=False)

cols_to_drop = missing_value_counts[missing_value_counts>30]
print(f'Number of columns with >30% missing values: {cols_to_drop.shape[0]}')
cols_to_drop

Number of columns with >30% missing values: 23


HairSpecializesIn            99.990459
AgesAllowed                  99.961836
Open24Hours                  99.914130
DietaryRestrictions          99.885507
RestaurantsCounterService    99.780555
ByAppointmentOnly            94.504341
DriveThru                    93.817384
BYOB                         93.779220
Smoking                      93.283084
CoatCheck                    92.605667
GoodForDancing               92.500716
BestNights                   91.021849
Corkage                      89.113634
Music                        88.655663
BusinessAcceptsBitcoin       86.995516
BYOBCorkage                  84.562542
DogsAllowed                  81.995993
HappyHour                    80.717489
WheelchairAccessible         75.870623
RestaurantsTableService      65.423147
GoodForMeal                  46.140635
BikeParking                  31.428299
Caters                       30.636390
dtype: float64

In [29]:
restaurant_df3.drop(cols_to_drop.index, axis=1, inplace=True)
restaurant_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  hours                       8676 non-null   object 
 13  RestaurantsGoodForGroups    896

### Inspection of Other Categorical Columns

In [30]:
attributes_list=[]

for attribute in attributes_columns:
    if attribute in restaurant_df3.columns:
        attributes_list.append(attribute)
        
len(attributes_list)

15

In [31]:
for attribute in attributes_columns:
    if attribute in restaurant_df3.columns:
        #attributes_list.append(attribute)
        #col = restaurant_df3.columns[i]
        print(f'Attribute: {attribute}; # of unique values: {len(restaurant_df3[attribute].unique())}')

        if len(restaurant_df3[attribute].unique()) <= 10:
            print(f'Unique values: {restaurant_df3[attribute].unique()}')

        else:
            print('More than 10 unique values')
        print('\n')

Attribute: RestaurantsGoodForGroups; # of unique values: 4
Unique values: ['True' 'False' nan 'None']


Attribute: HasTV; # of unique values: 4
Unique values: ['True' 'False' nan 'None']


Attribute: GoodForKids; # of unique values: 4
Unique values: ['True' nan 'False' 'None']


Attribute: RestaurantsTakeOut; # of unique values: 4
Unique values: ['True' 'None' nan 'False']


Attribute: RestaurantsPriceRange2; # of unique values: 5
Unique values: ['2' '1' nan '3' '4']


Attribute: Ambience; # of unique values: 512
More than 10 unique values


Attribute: RestaurantsReservations; # of unique values: 4
Unique values: ['False' nan 'True' 'None']


Attribute: BusinessParking; # of unique values: 74
More than 10 unique values


Attribute: RestaurantsAttire; # of unique values: 7
Unique values: ["u'casual'" "'casual'" nan "u'formal'" "u'dressy'" "'dressy'" 'None']


Attribute: RestaurantsDelivery; # of unique values: 4
Unique values: ['True' 'None' 'False' nan]


Attribute: OutdoorSeating; # o

In [32]:
restaurant_df3.replace('None', np.NaN, inplace=True)

In [33]:
for attribute in attributes_columns:
    if attribute in restaurant_df3.columns:
        print(f'Attribute: {attribute}; # of unique values: {len(restaurant_df3[attribute].unique())}')

        if len(restaurant_df3[attribute].unique()) <= 10:
            print(f'Unique values: {restaurant_df3[attribute].unique()}')

        else:
            print('More than 10 unique values')
        print('\n')

Attribute: RestaurantsGoodForGroups; # of unique values: 3
Unique values: ['True' 'False' nan]


Attribute: HasTV; # of unique values: 3
Unique values: ['True' 'False' nan]


Attribute: GoodForKids; # of unique values: 3
Unique values: ['True' nan 'False']


Attribute: RestaurantsTakeOut; # of unique values: 3
Unique values: ['True' nan 'False']


Attribute: RestaurantsPriceRange2; # of unique values: 5
Unique values: ['2' '1' nan '3' '4']


Attribute: Ambience; # of unique values: 511
More than 10 unique values


Attribute: RestaurantsReservations; # of unique values: 3
Unique values: ['False' nan 'True']


Attribute: BusinessParking; # of unique values: 73
More than 10 unique values


Attribute: RestaurantsAttire; # of unique values: 6
Unique values: ["u'casual'" "'casual'" nan "u'formal'" "u'dressy'" "'dressy'"]


Attribute: RestaurantsDelivery; # of unique values: 3
Unique values: ['True' nan 'False']


Attribute: OutdoorSeating; # of unique values: 3
Unique values: ['True' 'False'

Let us inspect the columns that have more than 10 unique values - `BusinessParking` and `Ambience`.

In [34]:
restaurant_df3[['BusinessParking']].head()

Unnamed: 0,BusinessParking
0,"{'garage': False, 'street': False, 'validated'..."
1,"{'garage': True, 'street': False, 'validated':..."
2,"{'garage': True, 'street': False, 'validated':..."
3,"{'garage': False, 'street': False, 'validated'..."
4,"{'garage': False, 'street': False, 'validated'..."


In [35]:
type(restaurant_df3['BusinessParking'][0])

str

In [36]:
restaurant_df3[['Ambience']].head()

Unnamed: 0,Ambience
0,"{'romantic': False, 'intimate': False, 'classy..."
1,"{'romantic': False, 'intimate': False, 'classy..."
2,"{'touristy': None, 'hipster': False, 'romantic..."
3,"{'romantic': False, 'intimate': False, 'classy..."
4,"{'romantic': False, 'intimate': False, 'classy..."


In [37]:
type(restaurant_df3['Ambience'][0])

str

These three columns contain nested dictionaries as their values. Hence it is no surprise they had so many unqiue values. We will investigate expand these columns as well. However, similar to how we approached the `attributes` column, we will drop rows with missing values in the `BusinessParking` and `Ambience` columns.

In [38]:
dictionary_attributes = ['BusinessParking', 'Ambience']

In [39]:
for attribute in dictionary_attributes:

    missing_value_count = restaurant_df3[attribute].isna().sum()
    pct_missing_values = missing_value_count/restaurant_df3.shape[0] * 100

    print(f'Number of missing values from {attribute} column: {missing_value_count} ({round(pct_missing_values,2)}%)')

Number of missing values from BusinessParking column: 809 (7.72%)
Number of missing values from Ambience column: 2112 (20.15%)


In [40]:
'''restaurant_df3 = restaurant_df3.\
        drop(restaurant_df3[restaurant_df3[dictionary_attributes].isna().any(axis=1)].index).\
        reset_index().drop('index', axis=1)'''
restaurant_df3.drop(['BusinessParking', 'Ambience'], axis=1, inplace=True)

restaurant_df3.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,RestaurantsTakeOut,RestaurantsPriceRange2,RestaurantsReservations,RestaurantsAttire,RestaurantsDelivery,OutdoorSeating,NoiseLevel,Alcohol,BusinessAcceptsCreditCards,WiFi
0,HPA_qyMEddpAEtFof02ixg,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,1960,42.541155,-70.973438,4.0,39,...,True,2,False,u'casual',True,True,'average',u'none',True,u'free'
1,hcRxdDg7DYryCxCoI8ySQA,Longwood Galleria,340-350 Longwood Ave,Boston,MA,2215,42.338544,-71.106842,2.5,24,...,,1,False,u'casual',,False,u'average','full_bar',True,'free'
2,jGennaZUr2MsJyRhijNBfA,Legal Sea Foods,1 Harborside Dr,Boston,MA,2128,42.363442,-71.025781,3.5,856,...,True,2,False,u'casual',False,False,u'average',u'full_bar',True,u'free'
3,iPD8BBvea6YldQZPHzVrSQ,Espresso Minute,334 Mass Ave,Boston,MA,2115,42.342673,-71.084239,4.5,7,...,True,1,False,'casual',False,True,'quiet','none',True,
4,Z2JC3Yrz82kyS86zEVJG5A,Gigi's Roast Beef & Pizza,5 Center St,Burlington,MA,1803,42.506935,-71.195854,3.0,16,...,True,1,False,u'casual',True,True,u'average',u'none',True,u'no'


In [41]:
restaurant_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  hours                       8676 non-null   object 
 13  RestaurantsGoodForGroups    896

We now turn our attention to the binary attributes. We saw earlier that in order to indicate whether a restaurant has a particular attribute, these columns use a string of either `True` if the attribute is present, or `False` if it is not present. For convenience, we will convert these columns to a numeric format by encoding `True` as 1 and `False` as 0.

We begin by defining a binarizing function that will perform this encoding for a given Pandas series.

In [42]:
def binarize(column):
    '''
    Binarizes a column by converting string or booleans of 'True' or 'False'
    to integers 1 or 0 respectively. Ignores any null values.
    
    Inputs -> Pandas series
    Outputs -> Pandas series
    '''    
    column = column.map({'True': 1, 'False': 0, True: 1, False: 0}, na_action='ignore')
    return column

In [43]:
binary_attributes = ['BusinessAcceptsCreditCards', 'RestaurantsReservations', 'OutdoorSeating',
                    'RestaurantsGoodForGroups', 'HasTV', 'RestaurantsTakeOut', 'RestaurantsDelivery',
                    'GoodForKids']

In [44]:
categorical_attributes = []
for attribute in attributes_columns:
    if attribute in restaurant_df3.columns:
        if attribute not in binary_attributes:
            if attribute not in dictionary_attributes:
                categorical_attributes.append(attribute)
            
categorical_attributes
        

['RestaurantsPriceRange2',
 'RestaurantsAttire',
 'NoiseLevel',
 'Alcohol',
 'WiFi']

In [45]:
restaurant_df3[binary_attributes] = restaurant_df3[binary_attributes].apply(lambda x: binarize(x))

In [46]:
restaurant_df3[binary_attributes]

Unnamed: 0,BusinessAcceptsCreditCards,RestaurantsReservations,OutdoorSeating,RestaurantsGoodForGroups,HasTV,RestaurantsTakeOut,RestaurantsDelivery,GoodForKids
0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,0.0,0.0,1.0,0.0,,,1.0
2,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
3,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0
4,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
10476,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
10477,1.0,1.0,0.0,1.0,,1.0,,0.0
10478,1.0,1.0,,0.0,0.0,,,0.0
10479,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


The binary columns have been converted to numeric columns.

Now let's attend to the categorical columns and remove the unwanted 'u' character.

In [47]:
for col in categorical_attributes:
    
    print(f'Column: {col}; # of unique values BEFORE replacement: {len(restaurant_df3[col].unique())}')
    print(f'Unique values BEFORE replacement: {restaurant_df3[col].unique()}')
    print('\n')
    
    restaurant_df3[col] = restaurant_df3[col].str.replace("u'","")
    restaurant_df3[col] = restaurant_df3[col].str.replace("'","")
    print(f'Column: {col}; # of unique values AFTER replacement: {len(restaurant_df3[col].unique())}')
    print(f'Unique values AFTER replacement: {restaurant_df3[col].unique()}')
    print('____')
    print('\n')

Column: RestaurantsPriceRange2; # of unique values BEFORE replacement: 5
Unique values BEFORE replacement: ['2' '1' nan '3' '4']


Column: RestaurantsPriceRange2; # of unique values AFTER replacement: 5
Unique values AFTER replacement: ['2' '1' nan '3' '4']
____


Column: RestaurantsAttire; # of unique values BEFORE replacement: 6
Unique values BEFORE replacement: ["u'casual'" "'casual'" nan "u'formal'" "u'dressy'" "'dressy'"]


Column: RestaurantsAttire; # of unique values AFTER replacement: 4
Unique values AFTER replacement: ['casual' nan 'formal' 'dressy']
____


Column: NoiseLevel; # of unique values BEFORE replacement: 9
Unique values BEFORE replacement: ["'average'" "u'average'" "'quiet'" "u'quiet'" nan "u'loud'" "'loud'"
 "u'very_loud'" "'very_loud'"]


Column: NoiseLevel; # of unique values AFTER replacement: 5
Unique values AFTER replacement: ['average' 'quiet' nan 'loud' 'very_loud']
____


Column: Alcohol; # of unique values BEFORE replacement: 7
Unique values BEFORE replace

Note that the `RestaurantsPriceRange2` includes integers encoded as strings. We will convert this column to an integer data type after imputing the missing values.

We can now replace the missing values with mode value.

In [50]:
for column in binary_attributes:

    mode = restaurant_df3[column].mode()[0]

    print(f'Value Counts BEFORE imputing:')
    print(restaurant_df3[column].value_counts(dropna=False))
    print(f'Mode: {mode}')
    restaurant_df3[column].fillna(mode, inplace=True)
    restaurant_df3[column] = restaurant_df3[column].astype('uint8')


    print(f'Value Counts AFTER imputing:')
    print(restaurant_df3[column].value_counts(dropna=False))

    print('\n')


Value Counts BEFORE imputing:
1.0    9490
NaN     512
0.0     479
Name: BusinessAcceptsCreditCards, dtype: int64
Mode: 1.0
Value Counts AFTER imputing:
1    10002
0      479
Name: BusinessAcceptsCreditCards, dtype: int64


Value Counts BEFORE imputing:
0.0    5795
1.0    3337
NaN    1349
Name: RestaurantsReservations, dtype: int64
Mode: 0.0
Value Counts AFTER imputing:
0    7144
1    3337
Name: RestaurantsReservations, dtype: int64


Value Counts BEFORE imputing:
0.0    6081
1.0    2763
NaN    1637
Name: OutdoorSeating, dtype: int64
Mode: 0.0
Value Counts AFTER imputing:
0    7718
1    2763
Name: OutdoorSeating, dtype: int64


Value Counts BEFORE imputing:
1.0    6967
0.0    1993
NaN    1521
Name: RestaurantsGoodForGroups, dtype: int64
Mode: 1.0
Value Counts AFTER imputing:
1    8488
0    1993
Name: RestaurantsGoodForGroups, dtype: int64


Value Counts BEFORE imputing:
1.0    6156
0.0    2312
NaN    2013
Name: HasTV, dtype: int64
Mode: 1.0
Value Counts AFTER imputing:
1    8169
0    23

In [51]:
for column in categorical_attributes:
    mode = restaurant_df3[column].mode()[0]

    print(f'Value Counts BEFORE imputing:')
    print(restaurant_df3[column].value_counts(dropna=False))
    print(f'Mode: {mode}')
    restaurant_df3[column].fillna(mode, inplace=True)
    
    
    print(f'Value Counts AFTER imputing:')
    print(restaurant_df3[column].value_counts(dropna=False))
    
    print('\n')

Value Counts BEFORE imputing:
2      4894
1      4026
NaN     931
3       556
4        74
Name: RestaurantsPriceRange2, dtype: int64
Mode: 2
Value Counts AFTER imputing:
2    5825
1    4026
3     556
4      74
Name: RestaurantsPriceRange2, dtype: int64


Value Counts BEFORE imputing:
casual    8412
NaN       1828
dressy     229
formal      12
Name: RestaurantsAttire, dtype: int64
Mode: casual
Value Counts AFTER imputing:
casual    10240
dressy      229
formal       12
Name: RestaurantsAttire, dtype: int64


Value Counts BEFORE imputing:
average      5236
NaN          2864
quiet        1597
loud          552
very_loud     232
Name: NoiseLevel, dtype: int64
Mode: average
Value Counts AFTER imputing:
average      8100
quiet        1597
loud          552
very_loud     232
Name: NoiseLevel, dtype: int64


Value Counts BEFORE imputing:
none             4661
full_bar         3029
NaN              1702
beer_and_wine    1089
Name: Alcohol, dtype: int64
Mode: none
Value Counts AFTER imputing:
no

In [52]:
restaurant_df3['RestaurantsPriceRange2'] = restaurant_df3['RestaurantsPriceRange2'].astype('int64')

restaurant_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  hours                       8676 non-null   object 
 13  RestaurantsGoodForGroups    104

We now turn our attention to the dicitonary columns.

### Expanding the Ambience Column

In [53]:
def make_dictionary(ambience_string):
    if isinstance(ambience_string, str):
        return literal_eval(ambience_string)
    else:
        return ambience_string

### `hours` Column

The last column to be cleaned appropriately is the `hours` column. Let us take a look at this column.

In [54]:
restaurant_df3[['hours']].head()

Unnamed: 0,hours
0,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."
1,"{'Monday': '6:30-22:0', 'Tuesday': '6:30-22:0'..."
2,"{'Monday': '6:0-21:0', 'Tuesday': '6:0-21:0', ..."
3,"{'Tuesday': '8:0-20:0', 'Wednesday': '8:0-20:0..."
4,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."


In [55]:
type(restaurant_df3['hours'][0])

dict

In [56]:
missing_value_count = restaurant_df3['hours'].isna().sum()
pct_missing_values = missing_value_count/restaurant_df3.shape[0] * 100

print(f'Number of missing values in the "hours" column: {missing_value_count} ({round(pct_missing_values,2)})%')

Number of missing values in the "hours" column: 1805 (17.22)%


This column contains dictionaries where every key represents a day of the week, and the value of each key are the hours of operation for that day, stored as a string. This column has 1,805 (17.2%) missing values. While we can expand this column into multiple columns for each day of the week, in an attempt to keep the number of independent variable columns manageable for the EDA section, we will not expand this column at this time as it may not add any additional insights. Hence we will drop this column.

In [57]:
restaurant_df4 = restaurant_df3.drop('hours', axis=1)
restaurant_df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  RestaurantsGoodForGroups    10481 non-null  uint8  
 13  HasTV                       104

### Categories

In [61]:
restaurant_df4['categories']

0                                 Food, Pizza, Restaurants
1                  Restaurants, Shopping, Shopping Centers
2        Sandwiches, Food, Restaurants, Breakfast & Bru...
3        Creperies, Restaurants, Food, Coffee & Tea, Br...
4                           Restaurants, Sandwiches, Pizza
                               ...                        
10476                            Pizza, Delis, Restaurants
10477                                 Restaurants, Italian
10478                    Japanese, Sushi Bars, Restaurants
10479                          Restaurants, Pizza, Italian
10480               Restaurants, American (New), Nightlife
Name: categories, Length: 10481, dtype: object

In [64]:
restaurant_df4['categories'].str.split(', ').apply(pd.Series)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,Food,Pizza,Restaurants,,,,,,,,,,,,,,,,,
1,Restaurants,Shopping,Shopping Centers,,,,,,,,,,,,,,,,,
2,Sandwiches,Food,Restaurants,Breakfast & Brunch,Seafood,Italian,Beer,Wine & Spirits,Cocktail Bars,Gluten-Free,Nightlife,Bars,Salad,,,,,,,
3,Creperies,Restaurants,Food,Coffee & Tea,Breakfast & Brunch,,,,,,,,,,,,,,,
4,Restaurants,Sandwiches,Pizza,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10476,Pizza,Delis,Restaurants,,,,,,,,,,,,,,,,,
10477,Restaurants,Italian,,,,,,,,,,,,,,,,,,
10478,Japanese,Sushi Bars,Restaurants,,,,,,,,,,,,,,,,,
10479,Restaurants,Pizza,Italian,,,,,,,,,,,,,,,,,


In [99]:
mlb = MultiLabelBinarizer()
category_data = mlb.fit_transform(restaurant_df4['categories'].str.split(', '))
category_classes = mlb.classes_

category_df = pd.DataFrame(data=category_data, columns=category_classes, index=restaurant_df4.index).astype('uint8')
category_df.head()

Unnamed: 0,Acai Bowls,Accessories,Active Life,Adult,Adult Education,Adult Entertainment,Afghan,African,Air Duct Cleaning,Airport Lounges,...,Wholesale Stores,Wholesalers,Wigs,Wine & Spirits,Wine Bars,Wineries,Women's Clothing,Wraps,Yelp Events,Zoos
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [100]:
category_df.sum().sort_values(ascending=False).head(30)

Restaurants                  10480
Food                          3217
Sandwiches                    1881
Nightlife                     1824
Pizza                         1792
Bars                          1736
American (Traditional)        1403
American (New)                1252
Italian                       1207
Breakfast & Brunch            1147
Coffee & Tea                  1043
Chinese                        845
Seafood                        762
Burgers                        695
Fast Food                      668
Salad                          653
Event Planning & Services      617
Cafes                          613
Mexican                        551
Bakeries                       542
Japanese                       520
Caterers                       435
Delis                          430
Specialty Food                 413
Sushi Bars                     406
Asian Fusion                   388
Desserts                       354
Cocktail Bars                  343
Mediterranean       

In [101]:
top_categories = category_df.sum().sort_values(ascending=False).drop(index=['Restaurants', 'Food',
                                                                           'Event Planning & Services',
                                                                           'Caterers']).head(20)
top_categories

Sandwiches                1881
Nightlife                 1824
Pizza                     1792
Bars                      1736
American (Traditional)    1403
American (New)            1252
Italian                   1207
Breakfast & Brunch        1147
Coffee & Tea              1043
Chinese                    845
Seafood                    762
Burgers                    695
Fast Food                  668
Salad                      653
Cafes                      613
Mexican                    551
Bakeries                   542
Japanese                   520
Delis                      430
Specialty Food             413
dtype: int64

In [102]:
category_df = category_df[top_categories.index]
category_df.head()

Unnamed: 0,Sandwiches,Nightlife,Pizza,Bars,American (Traditional),American (New),Italian,Breakfast & Brunch,Coffee & Tea,Chinese,Seafood,Burgers,Fast Food,Salad,Cafes,Mexican,Bakeries,Japanese,Delis,Specialty Food
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [103]:
category_df.shape

(10481, 20)

In [104]:
category_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   Sandwiches              10481 non-null  uint8
 1   Nightlife               10481 non-null  uint8
 2   Pizza                   10481 non-null  uint8
 3   Bars                    10481 non-null  uint8
 4   American (Traditional)  10481 non-null  uint8
 5   American (New)          10481 non-null  uint8
 6   Italian                 10481 non-null  uint8
 7   Breakfast & Brunch      10481 non-null  uint8
 8   Coffee & Tea            10481 non-null  uint8
 9   Chinese                 10481 non-null  uint8
 10  Seafood                 10481 non-null  uint8
 11  Burgers                 10481 non-null  uint8
 12  Fast Food               10481 non-null  uint8
 13  Salad                   10481 non-null  uint8
 14  Cafes                   10481 non-null  uint8
 15  Mexican            

In [105]:
restaurant_df5 = pd.concat([restaurant_df4, category_df], axis=1)
restaurant_df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10481 entries, 0 to 10480
Data columns (total 45 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   business_id                 10481 non-null  object 
 1   name                        10481 non-null  object 
 2   address                     10481 non-null  object 
 3   city                        10481 non-null  object 
 4   state                       10481 non-null  object 
 5   postal_code                 10481 non-null  object 
 6   latitude                    10481 non-null  float64
 7   longitude                   10481 non-null  float64
 8   stars                       10481 non-null  float64
 9   review_count                10481 non-null  int64  
 10  is_open                     10481 non-null  int64  
 11  categories                  10481 non-null  object 
 12  RestaurantsGoodForGroups    10481 non-null  uint8  
 13  HasTV                       104

## Exploratory Data Analysis (EDA)

We will divide our data into categorical data and numeric data and then proceed with our EDA. For each variable, we will conduct a univariate analysis where we consider the distribution of the variable in the dataset, followed by a bivariate analysis where we explore the relationship with respect to our dependent variable which is the star rating of the establishment.

### City Distribution

In [152]:
df = pd.DataFrame({'City': restaurant_df5['city'].value_counts().index,
                  'Counts': restaurant_df5['city'].value_counts(),
                  'Percentage': round(restaurant_df5['city'].value_counts()/restaurant_df5.shape[0]*100,1).\
                   astype('str')+'%'}).reset_index(drop=True)
df = df.head(20)

In [153]:
fig = px.bar(df, y='City', x='Counts', text='Percentage', orientation='h',
             title='Distribution of Cities')
fig.update_traces(textposition='outside')
fig.update_layout(title_x=0.5)
fig.show()

The city of Boston makes up the majority of the dataset with 2,830 restaurants (27%), followed by the nearby city of Cambridge with 769 restaurants (7.3%) and Somerville with 388 restaurants (3.7%).

In [154]:
df = restaurant_df5[restaurant_df5['city'].isin(df['City'].unique())]

In [155]:
#df = px.data.tips()
fig = px.box(df, x="city", y="stars", title="Distribution of Star Rating by City",
             labels={
                     "city": "City",
                     "stars": "Star Rating"
                 })
fig.update_layout(title_x=0.5)
fig.show()

### Star Rating

In [172]:
fig= px.histogram(restaurant_df5, x='stars',
                 title='Distribution of Star Rating',
                  labels = dict(stars="Stars"))
fig.update_layout(yaxis_title="Number of Restaurants", title_x=0.5)
fig.show()

### Review Counts

In [173]:
fig= px.histogram(restaurant_df5, x='review_count',
                 title='Distribution of Review Count',
                  labels = dict(review_count="Review Count"))
fig.update_layout(yaxis_title="Number of Restaurants", title_x=0.5)
fig.show()

### WiFi Services

We b

In [98]:
fig = go.Figure()
bar = go.Bar(x=restaurant_df7['WiFi'].unique(), y=restaurant_df7['WiFi'].value_counts().values, name='Bar')
fig.add_trace(bar)

# Set up Layout
layout = go.Layout(title="Distribution of WiFi Availability",
                   xaxis_title='WiFi Access', yaxis_title='Count')

# Pass this into the update_layout method
fig.update_layout(layout)

fig.show()

### Bivariate Analysis

## Feature Engineering