## 1. Data of Interests

In the proposal, most projects only focus on a subset of data. Please state the subset of data to explore again here.

Our questions we're trying to answer is "Which cuisines are most positively received in certain region (West, Midwest, Northeast, and South) of the continental United States?” The subset of data that we will explore is the yelp dataset in business.json for entries in the continental United States only. (All US states except Alaska and Hawaii, not including Washington D.C.). The states will be divided up by region National Geographic's guideline for United States regions (https://www.nationalgeographic.org/maps/united-states-regions/). 

## 2. Data Preprocessing

Describe what preprocessing is done. This includes details of cleaning and reorganization.

fdsajfdksa;jfsd

In [38]:
# load dataset
import json
import pandas as pd
import pprint

filepath = '../../yelp_dataset'

business = []
for l in open(filepath+"/business.json", encoding="utf8").readlines():
    business.append(json.loads(l))
df_business = pd.DataFrame.from_records(business)

In [37]:
# Find business categories
categories = {}
for _, row in df_business.iterrows():
    if row.categories:
        row_categories = row.categories.split(', ')
        for category in row_categories:
            if category not in categories:
                categories[category] = 1
            else:
                categories[category] += 1
categories = {k: v for k, v in sorted(categories.items(), key=lambda item: item[1])}
print(categories)

{'Rodeo': 1, 'Customs Brokers': 1, 'Rotisserie Chicken': 1, 'Street Art': 1, 'Toxicologists': 1, 'Dialysis Clinics': 1, 'Beach Volleyball': 1, 'Japanese Sweets': 1, 'Czech/Slovakian': 1, 'Ceremonial Clothing': 1, 'Experiences': 1, 'DUI Schools': 1, 'General Festivals': 1, 'Minho': 1, 'Backshop': 1, 'Udon': 1, 'Tonkatsu': 1, 'Tempura': 1, 'Court Reporters': 1, 'Oaxacan': 1, 'Island Pub': 1, 'Homeopathic': 1, 'Medical Foot Care': 1, 'Geneticists': 1, 'Calligraphy': 1, 'Eastern European': 1, 'Sauna Installation & Repair': 1, 'Studio Taping': 1, 'Bocce Ball': 1, 'Sport Equipment Hire': 1, 'Christmas Markets': 1, 'Beer Hall': 1, 'Soba': 1, 'Drive-Thru Bars': 1, 'Bulgarian': 1, 'Hang Gliding': 1, 'Milkshake Bars': 1, 'Hainan': 1, 'Calabrian': 1, 'Halfway Houses': 1, 'Otologists': 1, 'Entertainment Law': 1, 'Market Stalls': 1, 'Senegalese': 1, 'Osteopaths': 1, 'Game Meat': 1, 'Churros': 1, 'Nicaraguan': 2, 'Habilitative Services': 2, 'Flyboarding': 2, 'Waldorf Schools': 2, 'Fencing Clubs': 2,

In [54]:
# Top 15 categories
print(dict(list(categories.items())[-15:]) )

{'Coffee & Tea': 7321, 'Sandwiches': 7332, 'Fashion': 7798, 'Active Life': 9521, 'Event Planning & Services': 10371, 'Bars': 11341, 'Nightlife': 13095, 'Automotive': 13203, 'Local Services': 13932, 'Health & Medical': 17171, 'Beauty & Spas': 19370, 'Home Services': 19729, 'Food': 29989, 'Shopping': 31878, 'Restaurants': 59371}


In [73]:
# Filter businesses that are only in the 'Food' or 'Restaurant' category
def check_for_rest_or_food(row):
    category = row['categories']
    if category:
        tokens = category.split(', ')
        return 'Food' in tokens or 'Restaurant' in tokens
    return False

df_business['is_restaurant_or_food'] = df_business.apply(check_for_rest_or_food, axis=1)
food_businesses = df_business[df_business['is_restaurant_or_food'] == True]
print(food_businesses)

                   business_id                                 name  \
1       QXAEGFB4oINsVuTFxEYKFQ           Emerald Chinese Restaurant   
14      -K4gAv8_vjx8-2BxkVeRkA                           Baby Cakes   
25      tstimHoMcYbkSC4eBA1wEg  Maria's Mexican Restaurant & Bakery   
26      C9oCPomVP0mtKa8z99E3gg                        Bakery Gateau   
29      NDuUMJfrWk52RA-H-OtrpA                       Bolt Fresh Bar   
...                        ...                                  ...   
192572  sVEE_Mp3EbWW1UIhfActVA                   The King's Kitchen   
192573  6A6wbLDM1wIG--6psAOqLQ                 99 Cents Only Stores   
192582  Pc0C3Pzf-DwEfJZrkCR3QA              Mercator Euro Mini Mart   
192598  vIAEWbTJc657yN8I4z7whQ                            Starbucks   
192602  go-_xdHHSufchOeZ3kkC8w            Cedar Green Wine & Cheese   

                              address                city state postal_code  \
1                30 Eglinton Avenue W         Mississauga    ON     

## 3. EDA

Describe in detail what EDA and Statistical Testing are performed. You should perform at least three meaningful plots/testings. Please also summarize the insights from EDA.

For our data set, we will perform the following EDA test:

In [None]:
# load dataset
import json
import pandas as pd

filepath = '../../yelp_dataset'

business = []
for l in open(filepath+"/business.json", encoding="utf8").readlines():
    business.append(json.loads(l))
df_business = pd.DataFrame.from_records(business)
print(df_business['state'].value_counts())

# cleaning data for inconsistent state names

west = ['WA', 'OR', 'ID', 'MT', 'WY', 'CA', 'NV', 'UT', 'CO']
southwest = ['AZ', 'NM', 'OK', 'TX']
midwest = ['ND', 'SD', 'KS', 'MO', 'NE', 'IA', 'MN', 'WI', 'IL', 'MI', 'IN', 'OH']
northeast = ['ME', 'NH', 'VT', 'NY', 'RI', 'CT', 'MA', 'PA', 'NJ']
southeast = ['WV', 'MD', 'DE', 'VA', 'KY', 'NC', 'TN', 'SC', 'GA', 'AL', 'MS', 'AR', 'LA', 'FL']
states = west + southwest + midwest + northeast + southeast

states_mappings = {
    'WASHINGTON': 'WA', 
    'OREGON': 'OR', 
    'IDAHO': 'ID',
    'MONTANA': 'MT', 
    'WYOMING': 'WY', 
    'CALIFORNIA': 'CA', 
    'NEVADA': 'NV', 'UTAH': 'UT', 
    'COLORADO': 'CO',
    'ARIZONA': 'AZ', 
    'NEW MEXICO': 'NM', 
    'OKLAHOMA': 'OK', 
    'TEXAS': 'TX',
    'NORTH DAKOTA' : 'ND',
    'SOUTH DAKOTA' : 'SD',
    'KANSAS' : 'KS',
    'MONTANA' : 'MO',
    'NEBRASKA' : 'NE',
    'IOWA' : 'IA',
    'MINNESOTA' : 'MN',
    'WISCONSIN' : 'WI',
    'ILLINOIS' : 'IL',
    'MICHIGAN' : 'MI',
    'INDIANA' : 'IN',
    'OHIO' : 'OH',
    'MAINE' : 'ME',
    'NEW HAMPSHIRE' : 'NH',
    'VERMONT' : 'VT',
    'NEW YORK' : 'NY',
    'RHODE ISLAND' : 'RI',
    'CONNECTICUT' : 'CT',
    'MASSACHUSETTS' : 'MA',
    'PENNSYLVANIA' : 'PA',
    'NEW JERSEY' : 'NJ',
    'WEST VIRGINA' : 'WV',
    'MARYLAND' : 'MD',
    'DELAWARE' : 'DE',
    'VIRGINIA' : 'VA',
    'KENTUCKY' : 'KY',
    'NORTH CAROLINA' : 'NC',
    'TENNESSEE' : 'TN',
    'SOUTH CAROLINA' : 'SC',
    'GEORGIA' : 'GA',
    'ALABAMA' : 'AL',
    'MISSOURI' : 'MS', 
    'ARKANSAS' : 'AR', 
    'LOUISIANA' : 'LA', 
    'FLORIDA' : 'FL'
}

def convert_to_upper_abr(state):
    state = state.upper()
    if state in states_mappings:
        return states_mappings[state]
    return state

# convert all state entries to upper case and fix inconsistencies with states fully spelled our vs abb
for _, row in df_business.iterrows():
    row.state = convert_to_upper_abr(row.state)
    
# filter dataset by region
print(df_business['state'].value_counts())
west_businesses = df_business.loc[df_business['state'].isin(west)]
south_businesses = df_business.loc[df_business['state'].isin(southwest)]
midwest_businesses = df_business.loc[df_business['state'].isin(midwest)]
northeast_businesses = df_business.loc[df_business['state'].isin(northeast)]
southeast_businesses = df_business.loc[df_business['state'].isin(southeast)]
print(west_businesses['state'].value_counts())

import matplotlib.pyplot as plt 

plt.boxplot([west_businesses.stars, south_businesses.stars, midwest_businesses.stars, northeast_businesses.stars, southeast_businesses.stars])
plt.xticks([1, 2, 3, 4, 5], ['West', 'South', 'Midwest', 'Northeast', 'Southeast'])
plt.show()