#Exploratory Analysis -- Yelp

## I. Problem Statement & Background

Due to Yelp’s ability to reach tons of customers and attract crowds of foot traffic, restaurants are always trying to optimize their yelp ratings. However, what are the drivers behind yelp ratings? What separates a two star restaurant from a four star restaurant? Does ambiance contribute to the rating? What about distance from the geographic city center? There are a variety of features that could potentially contribute to businesses’ Yelp rating, and we want to discover what those are and to what degree they matter. At the end we will look at multiple factors and see what correlates most highly to a positive restaurant rating. 

What we do know is: Put Sources Here

## II. Sources Intend For Use

In [2]:
import json
import pandas as pd
import sys
import os
import numpy as np
%matplotlib inline

In [9]:
#Load Biz Dataset into DF
biz_data = []
biz_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json'
with open(biz_fn) as data_file:
    for line in data_file:
        biz_data.append(json.loads(line))
biz_df = pd.DataFrame(biz_data)

In [61]:
#filtering businesses into restaurants
category_csv = 'restaurantcategories.csv'
all_categories = []
restaurants = []
with open(category_csv) as categories:
    for line in categories:
        all_categories.append(line)
biz_dict = biz_df.to_dict()
all_categories = all_categories[0].split('\r')
print all_categories
for index in biz_dict['categories'].keys():
    if len(biz_dict['categories'][index]) == 0:
        for col in biz_dict.keys():
                del biz_dict[col][index] 
    elif len(biz_dict['categories'][index]) > 0:
        allInCats = True
        for elem in biz_dict['categories'][index]:
            if elem not in all_categories:
                print elem
                allInCats = False
                break;
        if not allInCats:
            for col in biz_dict.keys():
                del biz_dict[col][index] 
                
res_df = pd.DataFrame(biz_dict)    

['Afghan', 'African', 'Senegalese', 'South African', 'American (New)', 'American (Traditional)', 'Arabian', 'Argentine', 'Armenian', 'Asian Fusion', 'Australian', 'Austrian', 'Bangladeshi', 'Barbeque', 'Basque', 'Belgian', 'Brasseries', 'Brazilian', 'Breakfast & Brunch', 'British', 'Buffets', 'Burgers', 'Burmese', 'Cafes', 'Cafeteria', 'Cajun/Creole', 'Cambodian', 'Caribbean', 'Dominican', 'Haitian', 'Puerto Rican', 'Trinidadian', 'Catalan', 'Cheesesteaks', 'Chicken Wings', 'Chinese', 'Cantonese', 'Dim Sum', 'Shanghainese', 'Szechuan', 'Comfort Food', 'Creperies', 'Cuban', 'Czech', 'Delis', 'Diners', 'Ethiopian', 'Fast Food', 'Filipino', 'Fish & Chips', 'Fondue', 'Food Court', 'Food Stands', 'French', 'Gastropubs', 'German', 'Gluten-Free', 'Greek', 'Halal', 'Hawaiian', 'Himalayan/Nepalese', 'Hot Dogs', 'Hot Pot', 'Hungarian', 'Iberian', 'Indian', 'Indonesian', 'Irish', 'Italian', 'Japanese', 'Korean', 'Kosher', 'Laotian', 'Latin American', 'Columbian', 'Salvadoran', 'Venezuelan', 'Live

In [63]:
# graphing category counts
import numpy as mp
import matplotlib.pyplot as plt
import operator

cats = {}
for elem in biz_dict['categories']:
    for value in biz_dict['categories'][elem]:
        if value in cats:
            cats[value] = cats[value] + 1
        else:
            cats[value] = 1

print sorted(cats.items(),key=operator.itemgetter(1))  
plt.bar(range(len(cats)), cats.values(), align='center', width=3)
plt.xticks(range(len(cats)), cats.keys())

plt.show()

[(u'Austrian', 1), (u'Haitian', 1), (u'Czech', 1), (u'Iberian', 1), (u'Cafeteria', 1), (u'Singaporean', 2), (u'Hungarian', 2), (u'Dominican', 3), (u'Burmese', 3), (u'Bangladeshi', 3), (u'Hot Pot', 3), (u'Laotian', 3), (u'Egyptian', 3), (u'Arabian', 4), (u'Venezuelan', 4), (u'Scandinavian', 5), (u'Shanghainese', 5), (u'Indonesian', 6), (u'Argentine', 7), (u'Cambodian', 8), (u'Lebanese', 9), (u'Russian', 10), (u'Afghan', 10), (u'Malaysian', 10), (u'Live/Raw Food', 11), (u'Irish', 11), (u'Belgian', 12), (u'Food Court', 12), (u'Salvadoran', 14), (u'Polish', 14), (u'Himalayan/Nepalese', 14), (u'Basque', 15), (u'Fondue', 16), (u'Cantonese', 16), (u'Kosher', 16), (u'African', 18), (u'Scottish', 18), (u'Halal', 19), (u'Persian/Iranian', 21), (u'Ethiopian', 21), (u'Szechuan', 24), (u'Mongolian', 25), (u'Cuban', 28), (u'Brazilian', 28), (u'Taiwanese', 29), (u'Modern European', 30), (u'Turkish', 34), (u'Creperies', 36), (u'Brasseries', 39), (u'Filipino', 46), (u'Peruvian', 46), (u'Cheesesteaks', 

In [None]:
#graphing numreviews vs num restaurants with that num
numReviews = {}
for elem in biz_dict['review_count']:
    value = biz_dict['review_count'][elem]
    if value in numReviews:
        numReviews[value] = numReviews[value] + 1
    else:
        numReviews[value] = 1

plt.hist(numReviews.values())
locs, labels = plt.xticks(range(len(numReviews)), numReviews.keys())
plt.setp(labels, rotation=90)
plt.show()

In [None]:
#box and whisker plots, summary statistics of num reviews
from pylab import *
data = []
for key in numReviews.keys():
    for i in range(0,numReviews[key]):
        data.append(key)

boxplot(data)
figure()
show()

In [None]:
#Load Checkin Dataset into DF
checkin_data = []
checkin_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json'
with open(checkin_fn) as data_file:
    for line in data_file:
        checkin_data.append(json.loads(line))
checkin_df = pd.DataFrame(checkin_data)

In [None]:
#Load review dataset into df
review_data = []
review_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json'
with open(review_fn) as data_file:
    for line in data_file:
        review_data.append(json.loads(line))
review_df = pd.DataFrame(review_data)

In [None]:
#Load tip dataset into df
tip_data = []
tip_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_tip.json'
with open(tip_fn) as data_file:
    for line in data_file:
        tip_data.append(json.loads(line))
tip_df = pd.DataFrame(tip_data)

In [None]:
#Load user dataset into df
user_data = []
user_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json'
with open(user_fn) as data_file:
    for line in data_file:
        user_data.append(json.loads(line))
user_df = pd.DataFrame(user_data)

## Review Summary Stats

So we are interested in the summary stats of yelp reviews. To start let us first get an idea of what the distribution of reviews looks like.

In [4]:
review_df.hist(column='stars')

NameError: name 'review_df' is not defined

So this distribution shows that amongst people who actually write reviews there is a pretty heavy skew towards 4-5 star ratings which is pretty intersting. Now let us see if this corresponds to the rating distribution of restaurants.

In [5]:
biz_df.hist(column='stars')

NameError: name 'biz_df' is not defined

So this is a much nicer distribution. This distribution is skewed to the right which seems to give some evidence that people tend to give nicer reviews. 

However, let us first get some summary stats of this biz distribution and of the review distribution.

In [6]:
biz_df['stars'].describe()

NameError: name 'biz_df' is not defined

In [7]:
review_df['stars'].describe()

NameError: name 'review_df' is not defined

So looking at the summary stats, it is pretty clear that these are decent distributions where more of the mass is clustered around the mean for the average biz stars versus individual reviews. Now Let us break this down by category which will be pretty useful.

In [9]:
elements = set()
for element_list in biz_df['categories']:
    elements = elements.union(set(element_list))
    
category_dist = {key:[] for key in elements}
for i in range(len(biz_df)):
    row = biz_df.ix[i]
    star = row['stars']
    cats = row['categories']
    for cat in cats:
        category_dist[cat].append(star)

import matplotlib.pyplot as plt
for category in category_dist.keys():
    distribution = category_dist[category]
    if len(distribution) > 1:
        plt.hist(distribution)
        plt.xlabel('{0} stars'.format(category))
        plt.ylabel('Num Stars')
        plt.show()
    else:
        print("Category {0} only has 1 data point. That point is: {1}".format(category,distribution[0]))

NameError: name 'biz_df' is not defined