# Restaurants Features Engineering

*Restaurant* referred historically only to places, providing tables and chairs, where diners eat their meals. With the progress of civilization, *restaurant* now refers a business that provides dining area to diners, served by a waiter. Additionally, many restaurants offer take-out and food delivery services. *Restaurants* typically are categorized in many different ways. The most common factor is the cuisine, such as Italian, Chinese, Janpanese, etc. 

We found out the existed categories in Yelp business dataset is disordered and confused after filling up the majority of missing data. Therefore, we decided to re-classify restaurants in the dataset.

We create a comprehensive category list to classify restaurants and to generate meaningful insights from the dataset. The critic to generate the desired category are as following:
- Each of the generated category must contain sufficient amount (5%) of data indice
- All categories with more than 5% of the data should be preserved
- Any generated category are allow to have less than 5% of the data, if the category is generated from multiple smaller categories

## Table of Content
- Features Engineering (Updated Categories)
- Yelp Busienss Dataset Final Version 

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime as dt
from matplotlib import pyplot
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
# Load files
categories_dummy_df = pd.read_csv('data/categories_dummy_df.csv',index_col='Unnamed: 0')
missing_df = pd.read_csv('data/missing_df.csv',index_col='Unnamed: 0')
yelp_bus_df = pd.read_csv('data/yelp_bus_df.csv',index_col='Unnamed: 0')
check_in = pd.read_json("data/checkin.json", lines=True)

Generate a categories dataframe

We generate a categories of restaurants dataframe from [Yelp.com](https://blog.yelp.com/2018/01/yelp_category_list#section21).

In [3]:
categories_dict = {
 0: {'Tags': 'Barbeque', 'Business_Type': 1, 'Nationality_Type': 0},
 1: {'Tags': 'Breakfast & Brunch', 'Business_Type': 1, 'Nationality_Type': 0},
 2: {'Tags': 'Pancakes', 'Business_Type': 1, 'Nationality_Type': 0},
 3: {'Tags': 'Buffets', 'Business_Type': 1, 'Nationality_Type': 0},
 4: {'Tags': 'Burgers', 'Business_Type': 1, 'Nationality_Type': 0},
 5: {'Tags': 'Cafes', 'Business_Type': 1, 'Nationality_Type': 0},
 6: {'Tags': 'Themed Cafes', 'Business_Type': 1, 'Nationality_Type': 0},
 7: {'Tags': 'Cafeteria', 'Business_Type': 1, 'Nationality_Type': 0},
 8: {'Tags': 'Cheesesteaks', 'Business_Type': 1, 'Nationality_Type': 0},
 9: {'Tags': 'Chicken Shop', 'Business_Type': 1, 'Nationality_Type': 0},
 10: {'Tags': 'Chicken Wings', 'Business_Type': 1, 'Nationality_Type': 0},
 11: {'Tags': 'Dim Sum', 'Business_Type': 1, 'Nationality_Type': 0},
 12: {'Tags': 'Creperies', 'Business_Type': 1, 'Nationality_Type': 0},
 13: {'Tags': 'Comfort Food', 'Business_Type': 1, 'Nationality_Type': 0},
 14: {'Tags': 'Diners', 'Business_Type': 1, 'Nationality_Type': 0},
 15: {'Tags': 'Dinner Theater', 'Business_Type': 1, 'Nationality_Type': 0},
 16: {'Tags': 'Fast Food', 'Business_Type': 1, 'Nationality_Type': 0},
 17: {'Tags': 'Fish & Chips', 'Business_Type': 1, 'Nationality_Type': 0},
 18: {'Tags': 'Fondue', 'Business_Type': 1, 'Nationality_Type': 0},
 19: {'Tags': 'Food Court', 'Business_Type': 1, 'Nationality_Type': 0},
 20: {'Tags': 'Food Stands', 'Business_Type': 1, 'Nationality_Type': 0},
 21: {'Tags': 'Falafel', 'Business_Type': 1, 'Nationality_Type': 0},
 22: {'Tags': 'New Mexican Cuisine',
  'Business_Type': 1,
  'Nationality_Type': 0},
 23: {'Tags': 'Hong Kong Style Cafe',
  'Business_Type': 1,
  'Nationality_Type': 0},
 24: {'Tags': 'Hot Dogs', 'Business_Type': 1, 'Nationality_Type': 0},
 25: {'Tags': 'Hot Pot', 'Business_Type': 1, 'Nationality_Type': 0},
 26: {'Tags': 'Japanese Curry', 'Business_Type': 1, 'Nationality_Type': 0},
 27: {'Tags': 'Ramen', 'Business_Type': 1, 'Nationality_Type': 0},
 28: {'Tags': 'Conveyor Belt Sushi',
  'Business_Type': 1,
  'Nationality_Type': 0},
 29: {'Tags': 'Teppanyaki', 'Business_Type': 1, 'Nationality_Type': 0},
 30: {'Tags': 'Kebab', 'Business_Type': 1, 'Nationality_Type': 0},
 31: {'Tags': 'Izakaya', 'Business_Type': 1, 'Nationality_Type': 0},
 32: {'Tags': 'Live/Raw Food', 'Business_Type': 1, 'Nationality_Type': 0},
 33: {'Tags': 'Tacos', 'Business_Type': 1, 'Nationality_Type': 0},
 34: {'Tags': 'Noodles', 'Business_Type': 1, 'Nationality_Type': 0},
 35: {'Tags': 'Pizza', 'Business_Type': 1, 'Nationality_Type': 0},
 36: {'Tags': 'Pop-Up Restaurants', 'Business_Type': 1, 'Nationality_Type': 0},
 37: {'Tags': 'Poutineries', 'Business_Type': 1, 'Nationality_Type': 0},
 38: {'Tags': 'Salad', 'Business_Type': 1, 'Nationality_Type': 0},
 39: {'Tags': 'Sandwiches', 'Business_Type': 1, 'Nationality_Type': 0},
 40: {'Tags': 'Seafood', 'Business_Type': 1, 'Nationality_Type': 0},
 41: {'Tags': 'Soul Food', 'Business_Type': 1, 'Nationality_Type': 0},
 42: {'Tags': 'Soup', 'Business_Type': 1, 'Nationality_Type': 0},
 43: {'Tags': 'Steakhouses', 'Business_Type': 1, 'Nationality_Type': 0},
 44: {'Tags': 'Supper Clubs', 'Business_Type': 1, 'Nationality_Type': 0},
 45: {'Tags': 'Sushi Bars', 'Business_Type': 1, 'Nationality_Type': 0},
 46: {'Tags': 'Tapas Bars', 'Business_Type': 1, 'Nationality_Type': 0},
 47: {'Tags': 'Tapas/Small Plates', 'Business_Type': 1, 'Nationality_Type': 0},
 48: {'Tags': 'Vegan', 'Business_Type': 1, 'Nationality_Type': 0},
 49: {'Tags': 'Vegetarian', 'Business_Type': 1, 'Nationality_Type': 0},
 50: {'Tags': 'Waffles', 'Business_Type': 1, 'Nationality_Type': 0},
 51: {'Tags': 'Wraps', 'Business_Type': 1, 'Nationality_Type': 0},
 52: {'Tags': 'Reunion', 'Business_Type': 1, 'Nationality_Type': 0},
 53: {'Tags': 'Game Meat', 'Business_Type': 1, 'Nationality_Type': 0},
 54: {'Tags': 'Gluten-Free', 'Business_Type': 1, 'Nationality_Type': 0},
 55: {'Tags': 'Afghan', 'Business_Type': 0, 'Nationality_Type': 1},
 56: {'Tags': 'African', 'Business_Type': 0, 'Nationality_Type': 1},
 57: {'Tags': 'Senegalese', 'Business_Type': 0, 'Nationality_Type': 1},
 58: {'Tags': 'South African', 'Business_Type': 0, 'Nationality_Type': 1},
 59: {'Tags': 'American (New)', 'Business_Type': 0, 'Nationality_Type': 1},
 60: {'Tags': 'American (Traditional)',
  'Business_Type': 0,
  'Nationality_Type': 1},
 61: {'Tags': 'Arabian', 'Business_Type': 0, 'Nationality_Type': 1},
 62: {'Tags': 'Argentine', 'Business_Type': 0, 'Nationality_Type': 1},
 63: {'Tags': 'Armenian', 'Business_Type': 0, 'Nationality_Type': 1},
 64: {'Tags': 'Asian Fusion', 'Business_Type': 0, 'Nationality_Type': 1},
 65: {'Tags': 'Australian', 'Business_Type': 0, 'Nationality_Type': 1},
 66: {'Tags': 'Austrian', 'Business_Type': 0, 'Nationality_Type': 1},
 67: {'Tags': 'Bangladeshi', 'Business_Type': 0, 'Nationality_Type': 1},
 68: {'Tags': 'Basque', 'Business_Type': 0, 'Nationality_Type': 1},
 69: {'Tags': 'Belgian', 'Business_Type': 0, 'Nationality_Type': 1},
 70: {'Tags': 'Brasseries', 'Business_Type': 0, 'Nationality_Type': 1},
 71: {'Tags': 'Brazilian', 'Business_Type': 0, 'Nationality_Type': 1},
 72: {'Tags': 'British', 'Business_Type': 0, 'Nationality_Type': 1},
 73: {'Tags': 'Bulgarian', 'Business_Type': 0, 'Nationality_Type': 1},
 74: {'Tags': 'Burmese', 'Business_Type': 0, 'Nationality_Type': 1},
 75: {'Tags': 'Cajun/Creole', 'Business_Type': 0, 'Nationality_Type': 1},
 76: {'Tags': 'Cambodian', 'Business_Type': 0, 'Nationality_Type': 1},
 77: {'Tags': 'Caribbean', 'Business_Type': 0, 'Nationality_Type': 1},
 78: {'Tags': 'Dominican', 'Business_Type': 0, 'Nationality_Type': 1},
 79: {'Tags': 'Haitian', 'Business_Type': 0, 'Nationality_Type': 1},
 80: {'Tags': 'Puerto Rican', 'Business_Type': 0, 'Nationality_Type': 1},
 81: {'Tags': 'Trinidadian', 'Business_Type': 0, 'Nationality_Type': 1},
 82: {'Tags': 'Catalan', 'Business_Type': 0, 'Nationality_Type': 1},
 83: {'Tags': 'Chinese', 'Business_Type': 0, 'Nationality_Type': 1},
 84: {'Tags': 'Cantonese', 'Business_Type': 0, 'Nationality_Type': 1},
 85: {'Tags': 'Hainan', 'Business_Type': 0, 'Nationality_Type': 1},
 86: {'Tags': 'Shanghainese', 'Business_Type': 0, 'Nationality_Type': 1},
 87: {'Tags': 'Szechuan', 'Business_Type': 0, 'Nationality_Type': 1},
 88: {'Tags': 'Cuban', 'Business_Type': 0, 'Nationality_Type': 1},
 89: {'Tags': 'Czech', 'Business_Type': 0, 'Nationality_Type': 1},
 90: {'Tags': 'Delis', 'Business_Type': 1, 'Nationality_Type': 0},
 91: {'Tags': 'Eritrean', 'Business_Type': 0, 'Nationality_Type': 1},
 92: {'Tags': 'Ethiopian', 'Business_Type': 0, 'Nationality_Type': 1},
 93: {'Tags': 'Filipino', 'Business_Type': 0, 'Nationality_Type': 1},
 94: {'Tags': 'French', 'Business_Type': 0, 'Nationality_Type': 1},
 95: {'Tags': 'Mauritius', 'Business_Type': 0, 'Nationality_Type': 1},
 96: {'Tags': 'Gastropubs', 'Business_Type': 0, 'Nationality_Type': 1},
 97: {'Tags': 'Georgian', 'Business_Type': 0, 'Nationality_Type': 1},
 98: {'Tags': 'German', 'Business_Type': 0, 'Nationality_Type': 1},
 99: {'Tags': 'Greek', 'Business_Type': 0, 'Nationality_Type': 1},
 100: {'Tags': 'Guamanian', 'Business_Type': 0, 'Nationality_Type': 1},
 101: {'Tags': 'Halal', 'Business_Type': 0, 'Nationality_Type': 1},
 102: {'Tags': 'Hawaiian', 'Business_Type': 0, 'Nationality_Type': 1},
 103: {'Tags': 'Himalayan/Nepalese',
  'Business_Type': 0,
  'Nationality_Type': 1},
 104: {'Tags': 'Honduran', 'Business_Type': 0, 'Nationality_Type': 1},
 105: {'Tags': 'Hungarian', 'Business_Type': 0, 'Nationality_Type': 1},
 106: {'Tags': 'Iberian', 'Business_Type': 0, 'Nationality_Type': 1},
 107: {'Tags': 'Indian', 'Business_Type': 0, 'Nationality_Type': 1},
 108: {'Tags': 'Indonesian', 'Business_Type': 0, 'Nationality_Type': 1},
 109: {'Tags': 'Irish', 'Business_Type': 0, 'Nationality_Type': 1},
 110: {'Tags': 'Italian', 'Business_Type': 0, 'Nationality_Type': 1},
 111: {'Tags': 'Calabrian', 'Business_Type': 0, 'Nationality_Type': 1},
 112: {'Tags': 'Sardinian', 'Business_Type': 0, 'Nationality_Type': 1},
 113: {'Tags': 'Sicilian', 'Business_Type': 0, 'Nationality_Type': 1},
 114: {'Tags': 'Tuscan', 'Business_Type': 0, 'Nationality_Type': 1},
 115: {'Tags': 'Japanese', 'Business_Type': 0, 'Nationality_Type': 1},
 116: {'Tags': 'Korean', 'Business_Type': 0, 'Nationality_Type': 1},
 117: {'Tags': 'Kosher', 'Business_Type': 0, 'Nationality_Type': 1},
 118: {'Tags': 'Laotian', 'Business_Type': 0, 'Nationality_Type': 1},
 119: {'Tags': 'Latin American', 'Business_Type': 0, 'Nationality_Type': 1},
 120: {'Tags': 'Colombian', 'Business_Type': 0, 'Nationality_Type': 1},
 121: {'Tags': 'Salvadoran', 'Business_Type': 0, 'Nationality_Type': 1},
 122: {'Tags': 'Venezuelan', 'Business_Type': 0, 'Nationality_Type': 1},
 123: {'Tags': 'Malaysian', 'Business_Type': 0, 'Nationality_Type': 1},
 124: {'Tags': 'Mediterranean', 'Business_Type': 0, 'Nationality_Type': 1},
 125: {'Tags': 'Mexican', 'Business_Type': 0, 'Nationality_Type': 1},
 126: {'Tags': 'Middle Eastern', 'Business_Type': 0, 'Nationality_Type': 1},
 127: {'Tags': 'Egyptian', 'Business_Type': 0, 'Nationality_Type': 1},
 128: {'Tags': 'Lebanese', 'Business_Type': 0, 'Nationality_Type': 1},
 129: {'Tags': 'Modern European', 'Business_Type': 0, 'Nationality_Type': 1},
 130: {'Tags': 'Mongolian', 'Business_Type': 0, 'Nationality_Type': 1},
 131: {'Tags': 'Moroccan', 'Business_Type': 0, 'Nationality_Type': 1},
 132: {'Tags': 'Nicaraguan', 'Business_Type': 0, 'Nationality_Type': 1},
 133: {'Tags': 'Pakistani', 'Business_Type': 0, 'Nationality_Type': 1},
 134: {'Tags': 'Persian/Iranian', 'Business_Type': 0, 'Nationality_Type': 1},
 135: {'Tags': 'Peruvian', 'Business_Type': 0, 'Nationality_Type': 1},
 136: {'Tags': 'Polish', 'Business_Type': 0, 'Nationality_Type': 1},
 137: {'Tags': 'Polynesian', 'Business_Type': 0, 'Nationality_Type': 1},
 138: {'Tags': 'Portuguese', 'Business_Type': 0, 'Nationality_Type': 1},
 139: {'Tags': 'Russian', 'Business_Type': 0, 'Nationality_Type': 1},
 140: {'Tags': 'Scandinavian', 'Business_Type': 0, 'Nationality_Type': 1},
 141: {'Tags': 'Scottish', 'Business_Type': 0, 'Nationality_Type': 1},
 142: {'Tags': 'Singaporean', 'Business_Type': 0, 'Nationality_Type': 1},
 143: {'Tags': 'Slovakian', 'Business_Type': 0, 'Nationality_Type': 1},
 144: {'Tags': 'Somali', 'Business_Type': 0, 'Nationality_Type': 1},
 145: {'Tags': 'Southern', 'Business_Type': 0, 'Nationality_Type': 1},
 146: {'Tags': 'Spanish', 'Business_Type': 0, 'Nationality_Type': 1},
 147: {'Tags': 'Sri Lankan', 'Business_Type': 0, 'Nationality_Type': 1},
 148: {'Tags': 'Syrian', 'Business_Type': 0, 'Nationality_Type': 1},
 149: {'Tags': 'Taiwanese', 'Business_Type': 0, 'Nationality_Type': 1},
 150: {'Tags': 'Thai', 'Business_Type': 0, 'Nationality_Type': 1},
 151: {'Tags': 'Turkish', 'Business_Type': 0, 'Nationality_Type': 1},
 152: {'Tags': 'Ukrainian', 'Business_Type': 0, 'Nationality_Type': 1},
 153: {'Tags': 'Uzbek', 'Business_Type': 0, 'Nationality_Type': 1},
 154: {'Tags': 'Vietnamese', 'Business_Type': 0, 'Nationality_Type': 1}}

In [4]:
categories_types = pd.DataFrame(categories_dict).T
categories_types

Unnamed: 0,Tags,Business_Type,Nationality_Type
0,Barbeque,1,0
1,Breakfast & Brunch,1,0
2,Pancakes,1,0
3,Buffets,1,0
4,Burgers,1,0
...,...,...,...
150,Thai,0,1
151,Turkish,0,1
152,Ukrainian,0,1
153,Uzbek,0,1


In [5]:
def remove_space(lst):
    new_lst = []
    for entry in lst:
        new_lst.append(entry.strip(' '))
    return new_lst

In [6]:
# We preliminary classify categories as two types, business and nationality
# The categories in the tag list below belong to business type
# We mark business type as 1 and nationality type as 0
tag_list = [['Desserts', 1, 0],
                    ['Bakeries', 1, 0],
                    ['Ice Cream & Frozen Yogurt', 1, 0],
                    ['Coffee & Tea', 1, 0],
                    ['Bagels', 1, 0],
                    ['Sandwiches', 1, 0],
                    ['Donuts', 1, 0],
                    ['Food', 1, 0]]

categories_types = pd.concat([categories_types, 
           pd.DataFrame(tag_list, columns = categories_types.columns)],
          ignore_index=True)

# Let's check how many nationality types totally do we have
existing_tag_df = categories_types[
                    categories_types['Tags'].isin(
                    list(categories_dummy_df.columns)
                    )]
print('Total categories of nationality is: {}'
      .format(len(existing_tag_df)))

Total categories of nationality is: 130


In [7]:
# Let's prepare a dataframe
yelp_df_columns = ['business_id', 'name', 'categories',
                   'address', 'postal_code', 'latitude', 'longitude',
                   'stars', 'review_count', 'is_open', 'attributes.RestaurantsPriceRange2']

yelp_bus_narrow = yelp_bus_df[yelp_df_columns]

In [8]:
# Create two category types, nationality and business
category_list_nationalTyp = existing_tag_df[existing_tag_df.Nationality_Type == 1]
category_list_businessTyp = existing_tag_df[existing_tag_df.Business_Type == 1]
category_list = existing_tag_df.copy()

categories_dummy_narrow = categories_dummy_df[category_list.Tags]
yelp_df = yelp_bus_narrow.join(categories_dummy_narrow, how = 'inner')

print('The shape of yelp_df_narrow is: {}'
      .format(yelp_bus_narrow.shape))
print('The shape of categories_dummy_narrow is: {}'
      .format(categories_dummy_narrow.shape))
print('The shape of yelp_df is: {}'
      .format(yelp_df.shape))

The shape of yelp_df_narrow is: (3632, 11)
The shape of categories_dummy_narrow is: (3748, 130)
The shape of yelp_df is: (3632, 141)


In [9]:
# Uncomment below codes to examinate the categorise distribution within the dataset
# yelp_df[category_list_nationalTyp.Tags].sum(axis = 1).value_counts()

# yelp_df[category_list_businessTyp.Tags].sum(axis = 1).value_counts()

In [10]:
# Create a dictionary of categories and sub-categories
Restaurant_Types = {'Latin America':['Argentine', 'Latin American', 'Cajun/Creole',
                                      'Brazilian','Colombian', 'Venezuelan',
                                      'Peruvian', 'Caribbean', 'Dominican', 'Haitian',
                                      'Trinidadian', 'Puerto Rican', 'Cuban', 'Cajun/Creole',
                                      'Salvadoran', 'Mexican', 'Nicaraguan', 
                                      'New Mexican Cuisine', 'Tacos', 'Tapas/Small Plates'], 
                    
                    'North American':['Fondue', 'Seafood', 'Steakhouses', 'Southern', 
                                      'Breakfast & Brunch'],
                    
                    'Ottoman Cuisine':['Arabian', 'Greek', 'Halal', 'Mediterranean',
                                       'Middle Eastern', 'Lebanese', 'Persian/Iranian', 
                                       'Turkish', 'Uzbek', 'Falafel', 'Kebab'], 
                    
                    'Quick and Greasy':['Barbeque', 'Burgers', 'Chicken Wings', 
                                        'Fast Food', 'Fish & Chips', 'Delis', 
                                        'Food Stands', 'Food Court', 'Hot Dogs',
                                        'Pop-Up Restaurants', 'Pizza', 'Chicken Shop',
                                        'Soul Food'], 
                    
                    'European':['Belgian', 'Brasseries', 'British', 'French',
                                'German', 'Irish', 'Italian', 'Calabrian',
                                'Sardinian', 'Modern European', 'Portuguese',
                                'Gastropubs', 'Armenian', 'Basque', 'Hungarian',
                                'Polish', 'Russian'], 
                    
                    'Asian':['Asian Fusion', 'Bangladeshi', 'Chinese', 
                             'Cantonese', 'Shanghainese', 'Szechuan', 
                             'Szechuan', 'Filipino', 'Indian', 'Indonesian',
                             'Japanese', 'Korean', 'Laotian', 'Malaysian', 
                             'Pakistani', 'Polynesian', 'Taiwanese', 'Thai',
                             'Vietnamese', 'Dim Sum', 'Conveyor Belt Sushi',
                             'Conveyor Belt Sushi', 'Hot Pot', 'Hong Kong Style Cafe',
                             'Izakaya', 'Sushi Bars'], 
                    
                    'Cafes & Desserts':['Cafes', 'Themed Cafes', 'Creperies', 
                                        'Salad', 'Soup', 'Desserts', 'Bakeries', 
                                        'Ice Cream & Frozen Yogurt', 'Coffee & Tea', 
                                        'Bagels', 'Sandwiches', 'Donuts', 'Cafeteria'],

                    
                    'Other':['African',  'Ethiopian', 'Australian',
                             'Hawaiian', 'Kosher', 'Egyptian', 'Moroccan', 'Food']}

In [11]:
# Create a category list
feature_lst = list(Restaurant_Types.keys())

for feature in feature_lst:
    yelp_df[feature] = np.NaN
    
for feature in feature_lst:   
    yelp_df[feature] = yelp_df[yelp_df[Restaurant_Types[feature]
                                      ].sum(axis = 1) >= 1][
        Restaurant_Types[feature]].max(axis = 1)

# Let's see how many restaurants don't belong to any of categories we made
yelp_df_unmarked = yelp_df[yelp_df[feature_lst].sum(axis = 1) == 0]

yelp_df_unmarked[
    yelp_df_unmarked[['American (New)','American (Traditional)']
                    ].sum(axis = 1) == 0][['name','categories']].shape


(61, 2)

In [12]:
# Let's take a look what are their categories and see if we can find any similarity and features.
yelp_df_unmarked[yelp_df_unmarked[['American (New)', 
                                   'American (Traditional)']
                                 ].sum(axis = 1) == 0][['name',
                                                        'categories']].head(15)

Unnamed: 0,name,categories
2659,SeaWorld Orlando,"Amusement Parks, Active Life, Performing Arts,..."
2938,Mel's Drive-In,"Active Life, Diners, Amusement Parks, Restaurants"
4953,Renaissance Orlando at SeaWorld,"Hotels, Hotels & Travel, Restaurants, Event Pl..."
5860,Nature's Table,"Vegetarian, Restaurants"
6738,ICON Park,"Amusement Parks, Local Flavor, Active Life, Ar..."
9971,Hyatt Regency Orlando,"Venues & Event Spaces, Hotels, Diners, Hotels ..."
12096,USTA National Campus,"Tennis, Sporting Goods, Sports Wear, Fashion, ..."
15550,Radisson Hotels and Resorts Orlando,"Hotels & Travel, Event Planning & Services, Re..."
19685,Ceviche Tapas Bar & Restaurant,"Dance Clubs, Spanish, Event Planning & Service..."
20590,Great Western Steaks & Buffet,"Restaurants, Buffets"


In [13]:
# If the restaurant belongs to neither American(New) or American(Traditional), we classify it to Other Category
# If the restaurant belongs to either American(New) or American(Traditional), we classify it to North American Category
other_index_id = yelp_df_unmarked[
                    yelp_df_unmarked[
                        ['American (New)', 'American (Traditional)']].sum(axis = 1) == 0
                        ].business_id

American_index_id = yelp_df_unmarked[
                        yelp_df_unmarked[
                            ['American (New)', 'American (Traditional)']].sum(axis = 1) >= 1
                            ].business_id

print('The amount of reataurant that would be assign to Other Category: {}'
      .format(len(other_index_id)))
print('The amount of reataurant that would be assign to North American Category: {}'
      .format(len(American_index_id)))

The amount of reataurant that would be assign to Other Category: 61
The amount of reataurant that would be assign to North American Category: 196


In [14]:
# Now let's check if all restaurants belong to at least one category
yelp_df.loc[yelp_df.business_id.isin(other_index_id), 'Other'] = 1
yelp_df.loc[yelp_df.business_id.isin(American_index_id), 'North American'] = 1
# Fill NaN values as 0
yelp_df.fillna(0, inplace = True)
yelp_df[feature_lst].sum(axis=1).value_counts()

1.0    1718
2.0    1186
3.0     578
4.0     133
5.0      14
6.0       3
dtype: int64

The previous steps have made sure all businesses are labeled with at least one of our new type features. 

We finished the feature engineering by merging the dataframe with the checkin dataset

In [15]:
# Extract datetime from check in column
# Calculate total days of each restaurant is opened
def timespan_extractor(df):
    checkin_sum = []
    fisrt_record = []
    last_record = []
    total_days = []
    for index, row in df.iterrows():
        checkin = row.date.split(',')
        date_start = dt.strptime(checkin[0].strip(), "%Y-%m-%d %H:%M:%S")
        date_end = dt.strptime(checkin[-1].strip(), "%Y-%m-%d %H:%M:%S")
        # all desired information extracted from checkin
        checkin_sum.append(len(checkin))
        fisrt_record.append(str(date_start))
        last_record.append(str(date_end))
        total_days.append((date_end - date_start).total_seconds()/(3600*24))
    df['checkin_sum'] = checkin_sum
    df['fisrt_record'] = fisrt_record
    df['last_record'] = last_record
    df['total_days'] = total_days
    return df

checkin_expanded = timespan_extractor(check_in)

In [16]:
category_df = yelp_df[list(yelp_bus_narrow.columns) + feature_lst]
category_df.columns = ['business_id', 'name', 'categories', 'address', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'price_range', 'Latin America', 'North American',
       'Ottoman Cuisine', 'Quick and Greasy', 'European', 'Asian',
       'Cafes & Desserts', 'Other']

In [17]:
# Let's merge category dataframe and checkin dataframe
yelp_df = pd.merge(category_df, checkin_expanded, how="inner", on=["business_id"])
yelp_df.drop(labels = 'date', axis = 1, inplace = True)
yelp_df.loc[yelp_df[feature_lst].sum(axis = 1) >1, 'Other'] = 0

print('The current shape of yelp_df is:{}'
      .format(yelp_df.shape))
yelp_df.head(5)

The current shape of yelp_df is:(3607, 23)


Unnamed: 0,business_id,name,categories,address,postal_code,latitude,longitude,stars,review_count,is_open,...,Ottoman Cuisine,Quick and Greasy,European,Asian,Cafes & Desserts,Other,checkin_sum,fisrt_record,last_record,total_days
0,ufCxltuh56FF4-ZFZ6cVhg,Sister Honey's,"Restaurants, American (New), Bakeries, Dessert...",247 E Michigan St,32806.0,28.513265,-81.374707,4.5,135,1,...,0.0,0.0,0.0,0.0,1.0,0.0,246,2012-08-29 22:10:36,2020-10-09 20:50:00,2962.944028
1,GfWJ19Js7wX9rwaHQ7KbGw,Everything POP Shopping & Dining,"Restaurants, American (New), Food Court, Flowe...",1050 Century Dr,32830.0,28.350498,-81.542819,3.0,7,1,...,0.0,1.0,0.0,0.0,0.0,0.0,63,2013-07-12 15:25:43,2020-09-16 03:28:27,2622.501898
2,ynTjh_FdhbG5hY69HsEoaA,Cascade Restaurant,"Hotels, American (Traditional), Restaurants, E...","Hyatt Regency Grand Cypress, 1 Grand Cypress Blvd",32836.0,28.381945,-81.510327,3.5,18,0,...,0.0,0.0,0.0,0.0,0.0,0.0,78,2011-04-23 14:34:30,2016-01-12 00:28:05,1724.412211
3,qbZJh9lR0gh4Wca96NQv9g,Chuck E. Cheese,"Pizza, Event Planning & Services, Arcades, Par...",7456 W Colonial Dr,32818.0,28.551335,-81.483167,2.0,15,1,...,0.0,1.0,0.0,0.0,0.0,0.0,21,2011-01-05 22:45:45,2018-07-28 21:18:03,2760.939097
4,EGZ0fhB9k0ZlI5sHda4vFw,Andy's Frozen Custard,"Food, Ice Cream & Frozen Yogurt, Restaurants, ...",5381 International Dr,32819.0,28.463278,-81.451181,4.5,36,0,...,0.0,0.0,0.0,0.0,1.0,0.0,51,2018-03-20 01:50:18,2018-09-23 02:28:03,187.026215


In [18]:
# yelp_df.to_csv('data/yelp_df_base.csv')