Data: 1000 restaurants for each city

1. Cuisines: most popular (bar chart)
2. Chains: Rating and # franchinese (bokeh)
3. Distributions of features
    * 2D distributions of features
    
**4. Cuisines: price vs rating (bokeh)**

TODO
* Scatter plot: average rating and cost for each cuisine, cuisine at least N samples
* Which cities have the greatest concentration of mexican, ethiopian, etc.
* Determine restaurants with single category and then look for relationships between category and price, rating, review count, etc.
* Explore data using a bokeh plot in http://localhost:8889/notebooks/examples/app/movies/Untitled.ipynb
    * This is a plot could have on my webpage (not a dashboard) !!
* What are the most popular and least popular?
* Which cities are nicest, best restaurants? (may be sampling bias. maybe should use sort by alphabet?)
* Which cities are cheapest?
* plot poke on maps

Notes
* Per capita analysis may not be valid because yelp searches around a city, not just where the population was counted
    * e.g. South San Francisco search on Yelp likely brings up restaurants outside the range of population counted
    * This could be assuaged if I instead delineate restaurants by the city their address says
* Categories - might be overlapping
    * I should look through top 100 and manually collapse some (deli and sanwich; japanese and sushi). One is a subset of another

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import glob
import os
import scipy as sp
from scipy import stats

from tools.plt import color2d #from the 'srcole/tools' repo
from matplotlib import cm

### Load dataframes

In [2]:
# Load cities info
df_cities = pd.read_csv('/gh/data2/yelp/city_pop.csv', index_col=0)
df_cities.head()

Unnamed: 0,city,state,population,total_food,latitude,longitude,total_scraped
0,New York,New York,8537673,54191,40.705445,-73.994293,1000
1,Los Angeles,California,3976322,41685,34.06159,-118.321381,1000
2,Chicago,Illinois,2704958,19315,41.905159,-87.677765,1000
3,Houston,Texas,2303482,15197,29.784854,-95.359955,1000
4,Phoenix,Arizona,1615017,11034,33.465086,-112.07016,1000


In [3]:
# Load restaurants
df_restaurants = pd.read_csv('/gh/data2/yelp/food_by_city/df_restaurants.csv', index_col=0)
df_restaurants.head()

Unnamed: 0,id,name,city,state,rating,review_count,cost,latitude,longitude,has_delivery,has_pickup,url
0,poquito-picante-brooklyn-2,Poquito Picante,New York,New York,4.5,40,2,40.685742,-73.981262,True,True,https://www.yelp.com/biz/poquito-picante-brook...
1,nourish-brooklyn-4,Nourish,New York,New York,4.0,65,2,40.67796,-73.96855,True,True,https://www.yelp.com/biz/nourish-brooklyn-4?ad...
2,taste-of-heaven-brooklyn,Taste of Heaven,New York,New York,5.0,19,2,40.71715,-73.94054,False,True,https://www.yelp.com/biz/taste-of-heaven-brook...
3,milk-and-cream-cereal-bar-new-york,Milk & Cream Cereal Bar,New York,New York,4.5,307,2,40.71958,-73.99654,False,False,https://www.yelp.com/biz/milk-and-cream-cereal...
4,the-bao-shoppe-new-york-2,The Bao Shoppe,New York,New York,4.0,99,1,40.714345,-73.990518,False,False,https://www.yelp.com/biz/the-bao-shoppe-new-yo...


In [4]:
# Load categories by restaurant
df_categories = pd.read_csv('/gh/data2/yelp/food_by_city/df_categories.csv', index_col=0)
df_categories.head()

Unnamed: 0,acaibowls,accessories,active,acupuncture,adultedu,advertising,aerialfitness,afghani,african,airport_shuttles,...,wine_bars,wineries,winetasteclasses,winetastingroom,winetours,womenscloth,wraps,yelpevents,yoga,zoos
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
from bokeh.io import output_notebook
from bokeh.layouts import row, widgetbox
from bokeh.models import CustomJS, Slider, Legend, HoverTool
from bokeh.plotting import figure, output_file, show, ColumnDataSource

output_notebook()

# 4. Average price and ratings for different cuisines

### Make cuisine df

In [6]:
# New dataframe: For each cuisine, compute the average rating, average price, and # restaurants
all_cuisines = df_categories.keys()
cuisine_dict = {'cuisine': [],
                'avg_rating': [],
                'avg_cost': [],
                'N': []}
for k in all_cuisines:
    df_temp = df_restaurants[df_categories[k]==1]
    cuisine_dict['cuisine'].append(k)
    cuisine_dict['avg_rating'].append(df_temp['rating'].mean())
    cuisine_dict['avg_cost'].append(df_temp['cost'].mean())
    cuisine_dict['N'].append(len(df_temp))
df_cuisine = pd.DataFrame.from_dict(cuisine_dict)

### Make bokeh plot

In [10]:
# Slider variables
min_N_franchises = 100

# Determine dataframe sources
df_cuisine_limit = df_cuisine[df_cuisine['N'] > min_N_franchises].reset_index()

# Create data source for plotting and Slider callback
source1 = ColumnDataSource(df_cuisine_limit, id='source1')
source2 = ColumnDataSource(df_cuisine, id='source2')

hover = HoverTool(tooltips=[
    ("Cuisine", "@cuisine"),
    ("Avg Stars", "@avg_rating"),
    ("Avg $", "@avg_cost"),
    ("# locations", "@N")])

# Make initial figure of net income vs years of saving
plot = figure(plot_width=400, plot_height=400,
              x_axis_label='Average cost ($)',
              y_axis_label='Average rating (*)',
              tools=[hover],
              y_range=(2.5,5), x_range=(1,3))

plot.scatter('avg_cost', 'avg_rating', source=source1, line_width=3, line_alpha=0.6, line_color='black')

# Declare how to update plot on slider change
callback = CustomJS(args=dict(s1=source1, s2=source2), code="""
    var d1 = s1.get("data");
    var d2 = s2.get("data");
    var N = N.value;
    d1["cuisine"] = [];
    d1["avg_rating"] = [];
    d1["avg_cost"] = [];
    d1["N"] = [];
    for(i=0;i <=d2["N"].length; i++){
        if (d2["N"][i] >= N) {
        d1["cuisine"].push(d2["cuisine"][i]);
        d1["avg_rating"].push(d2["avg_rating"][i]);
        d1["avg_cost"].push(d2["avg_cost"][i]);
        d1["N"].push(d2["N"][i]);
        }
    }

    s1.change.emit();
""")

N_slider = Slider(start=100, end=10000, value=min_N_franchises, step=100,
                  title="minimum number of franchises", callback=callback)
callback.args["N"] = N_slider

# Define layout of plot and sliders
layout = row(plot, widgetbox(N_slider))

# Output and show
output_file("/gh/srcole.github.io/assets/misc/cuisine_bokeh.html", title="Cuisine WIP")
show(layout)

In [25]:
df_cuisine_limit

Unnamed: 0,index,N,avg_cost,avg_rating,cuisine
0,0,669,1.678625,4.243647,acaibowls
1,7,436,1.782110,4.298165,afghani
2,8,486,1.870370,4.245885,african
3,18,174,1.655172,4.155172,arabian
4,19,620,1.827419,3.429839,arcades
5,21,452,1.732301,4.224558,argentine
6,22,164,1.682927,4.262195,armenian
7,30,12304,1.703836,3.865085,asianfusion
8,42,3117,1.244466,3.761630,bagels
9,43,20171,1.692231,4.092509,bakeries
