<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5

## Help Yelp

---

In this project you will be investigating a small version of the [Yelp challenge dataset](https://www.yelp.com/dataset_challenge). You'll practice using classification algorithms, cross-validation, gridsearching – all that good stuff.



---

### The data

There are 5 individual .csv files that have the information, zipped into .7z format like with the SF data last project. The dataset is located in your datasets folder:

    DSI-SF-2/datasets/yelp_arizona_data.7z

The columns in each are:

    businesses_small_parsed.csv
        business_id: unique business identifier
        name: name of the business
        review_count: number of reviews per business
        city: city business resides in
        stars: average rating
        categories: categories the business falls into (can be one or multiple)
        latitude
        longitude
        neighborhoods: neighborhoods business belongs to
        variable: "property" of the business (a tag)
        value: True/False for the property
        
    reviews_small_nlp_parsed.csv
        user_id: unique user identifier
        review_id: unique review identifier
        votes.cool: how many thought the review was "cool"
        business_id: unique business id the review is for
        votes.funny: how many thought the review was funny
        stars: rating given
        date: date of review
        votes.useful: how many thought the review was useful
        ... 100 columns of counts of most common 2 word phrases that appear in reviews in this review
        
    users_small_parsed.csv
        yelping_since: signup date
        compliments.plain: # of compliments "plain"
        review_count: # of reviews:
        compliments.cute: total # of compliments "cute"
        compliments.writer: # of compliments "writer"
        compliments.note: # of compliments "note" (not sure what this is)
        compliments.hot: # of compliments "hot" (?)
        compliments.cool: # of compliments "cool"
        compliments.profile: # of compliments "profile"
        average_stars: average rating
        compliments.more: # of compliments "more"
        elite: years considered "elite"
        name: user's name
        user_id: unique user id
        votes.cool: # of votes "cool"
        compliments.list: # of compliments "list"
        votes.funny: # of compliments "funny"
        compliments.photos: # of compliments "photos"
        compliments.funny: # of compliments "funny"
        votes.useful: # of votes "useful"
       
    checkins_small_parsed.csv
        business_id: unique business identifier
        variable: day-time identifier of checkins (0-0 is Sunday 0:00 - 1:00am,  for example)
        value: # of checkins at that time
    
    tips_small_nlp_parsed.csv
        user_id: unique user identifier
        business_id: unique business identifier
        likes: likes that the tip has
        date: date of tip
        ... 100 columns of counts of most common 2 word phrases that appear in tips in this tip

The reviews and tips datasets in particular have parsed "NLP" columns with counts of 2-word phrases in that review or tip (a "tip", it seems, is some kind of smaller review).

The user dataset has a lot of columns of counts of different compliments and votes. I'm not sure whether the compliments or votes are _by_ the user or _for_ the user.

---

If you look at the website, or the full data, you'll see I have removed pieces of the data and cut it down quite a bit. This is to simplify it for this project. Specifically, business are limited to be in these cities:

    Phoenix
    Surprise
    Las Vegas
    Waterloo

Apparently there is a city called "Surprise" in Arizona. 

Businesses are also restricted to at least be in one of the following categories, because I thought the mix of them was funny:

    Airports
    Breakfast & Brunch
    Bubble Tea
    Burgers
    Bars
    Bakeries
    Breweries
    Cafes
    Candy Stores
    Comedy Clubs
    Courthouses
    Dance Clubs
    Fast Food
    Museums
    Tattoo
    Vape Shops
    Yoga
    
---

### Project requirements

**You will be performing 4 different sections of analysis, like in the last project.**

Remember that classification targets are categorical and regression targets are continuous variables.

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Constructing a "profile" for Las Vegas

---

Yelp is interested in building out what they are calling "profiles" for cities. They want you to start with just Las Vegas to see what a prototype of this would look like. Essentially, they want to know what makes Las Vegas distinct from the other four.

Use the data you have to predict Las Vegas from the other variables you have. You should not be predicting the city from any kind of location data or other data perfectly associated with that city (or another city).

You may use any classification algorithm you deem appropriate, or even multiple models. You should:

1. Build at least one model predicting Las Vegas vs. the other cities.
- Validate your model(s).
- Interpret and visualize, in some way, the results.
- Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

In [3]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd
import patsy

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.cross_validation import cross_val_score, StratifiedKFold, train_test_split
from sklearn.grid_search import GridSearchCV

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [4]:
yelp_bus = '/Users/Divya/desktop/DSI-SF-2/datasets/yelp_arizona_data/businesses_small_parsed.csv'
yelp_bus = pd.read_csv(yelp_bus)

yelp_rev = '/Users/Divya/desktop/DSI-SF-2/datasets/yelp_arizona_data/reviews_small_nlp_parsed.csv'
yelp_rev = pd.read_csv(yelp_rev)

In [5]:
yelp_check = '/Users/Divya/desktop/DSI-SF-2/datasets/yelp_arizona_data/checkins_small_parsed.csv'
yelp_check = pd.read_csv(yelp_check)

yelp_tip = '/Users/Divya/desktop/DSI-SF-2/datasets/yelp_arizona_data/tips_small_nlp_parsed.csv'
yelp_tip = pd.read_csv(yelp_tip)

In [83]:
yelp_bus.shape

(152832, 11)

In [84]:
# Need to convert table to wide format

def same(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x

yelp_bus_wide = pd.pivot_table(yelp_bus,
                          index=['business_id', 'name', 'review_count', 'city', 'stars', 'categories'],
                          columns=['variable'],
                          values='value',
                          aggfunc=same)

In [6]:
yelp_bus_wide = yelp_bus_wide.reset_index()

In [85]:
yelp_bus_wide.shape[0]

4132

In [7]:
yelp_bus_wide.columns.values

array(['business_id', 'name', 'review_count', 'city', 'stars',
       'categories', 'attributes.Accepts Credit Cards',
       'attributes.Accepts Insurance', 'attributes.Ages Allowed',
       'attributes.Alcohol', 'attributes.Ambience.casual',
       'attributes.Ambience.classy', 'attributes.Ambience.divey',
       'attributes.Ambience.hipster', 'attributes.Ambience.intimate',
       'attributes.Ambience.romantic', 'attributes.Ambience.touristy',
       'attributes.Ambience.trendy', 'attributes.Ambience.upscale',
       'attributes.Attire', 'attributes.BYOB', 'attributes.BYOB/Corkage',
       'attributes.By Appointment Only', 'attributes.Caters',
       'attributes.Coat Check', 'attributes.Corkage',
       'attributes.Delivery', 'attributes.Dietary Restrictions.dairy-free',
       'attributes.Dietary Restrictions.gluten-free',
       'attributes.Dietary Restrictions.halal',
       'attributes.Dietary Restrictions.kosher',
       'attributes.Dietary Restrictions.soy-free',
       'attri

In [60]:
# Want to select only columns I think make sense to predict Vegas

select_bus = yelp_bus_wide[['business_id', 'name', 'review_count', 'city', 'stars','categories',
                            'attributes.Accepts Credit Cards', 'attributes.Alcohol', 'attributes.Ambience.touristy',
                       'attributes.Attire', 'attributes.Coat Check', 'attributes.Good For Dancing', 
                        'attributes.Good For Groups', 'attributes.Good For.latenight', 'attributes.Happy Hour',
                       'attributes.Music.dj', 'attributes.Music.live', 'attributes.Noise Level', 'attributes.Open 24 Hours',
                       'attributes.Smoking', 'hours.Friday.open', 'hours.Saturday.open', 'hours.Sunday.open',
                       'hours.Thursday.open']]

In [61]:
# Rename columns

select_bus.columns = ['business_id', 'name', 'review_count', 'city', 'stars','categories','credit_cards', 
                      'alcohol', 'touristy', 'attire', 'coat_check', 'dancing', 'groups', 'latenight', 
                  'happy_hour', 'music_dj', 'music_live', 'noise_level', '24_hours', 'smoking', 'friday_open', 
                  'saturday_open', 'sunday_open', 'thursday_open']

In [62]:
select_bus.head()

Unnamed: 0,business_id,name,review_count,city,stars,categories,credit_cards,alcohol,touristy,attire,...,happy_hour,music_dj,music_live,noise_level,24_hours,smoking,friday_open,saturday_open,sunday_open,thursday_open
0,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",True,none,,casual,...,,,,,,,,,,
1,-0HGqwlfw3I8nkJyMHxAsQ,McDonald's,9,Phoenix,3.0,"['Burgers', 'Fast Food', 'Restaurants']",True,none,,casual,...,,,,quiet,,,,,,
2,-0VK5Z1BfUHUYq4PoBYNLw,T Spot,5,Las Vegas,3.5,"['Bars', 'Nightlife', 'Lounges']",True,full_bar,False,,...,True,,False,loud,,yes,,,,
3,-0bUDim5OGuv8R0Qqq6J4A,IHOP,8,Phoenix,2.0,"['Bakeries', 'Food', 'Breakfast & Brunch', 'Re...",True,,,casual,...,,,,,,,,,,
4,-1bOb2izeJBZjHC7NWxiPA,First Watch,120,Phoenix,4.0,"['Breakfast & Brunch', 'Cafes', 'American (Tra...",True,none,False,casual,...,,,,average,,,06:30,06:30,06:30,06:30


In [63]:
select_bus.shape

(4132, 24)

In [64]:
select_bus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4132 entries, 0 to 4131
Data columns (total 24 columns):
business_id      4132 non-null object
name             4132 non-null object
review_count     4132 non-null int64
city             4132 non-null object
stars            4132 non-null float64
categories       4132 non-null object
credit_cards     3896 non-null object
alcohol          3050 non-null object
touristy         2873 non-null object
attire           2650 non-null object
coat_check       1260 non-null object
dancing          1277 non-null object
groups           3362 non-null object
latenight        2507 non-null object
happy_hour       1321 non-null object
music_dj         1052 non-null object
music_live       709 non-null object
noise_level      2805 non-null object
24_hours         54 non-null object
smoking          1187 non-null object
friday_open      2518 non-null object
saturday_open    2464 non-null object
sunday_open      2194 non-null object
thursday_open    2489 

In [65]:
# Want to check unique values in each of the selected columns

print "Credit Cards:", select_bus.credit_cards.unique()
print "Alcohol:", select_bus.alcohol.unique()
print "Touristy:", select_bus.touristy.unique()
print "Attire:", select_bus.attire.unique()
print "Coat Check:", select_bus.coat_check.unique()
print "Dancing:", select_bus.dancing.unique()
print "Groups:", select_bus.groups.unique()
print "Late night:", select_bus.latenight.unique()
print "Happy hour:", select_bus.happy_hour.unique()
print "Music DJ:", select_bus.music_dj.unique()
print "Music Live:", select_bus.music_live.unique()
print "Noise Level:", select_bus.noise_level.unique()
print "Smoking:", select_bus.smoking.unique()
print "Friday Open:", select_bus.friday_open.unique()
print "Sat Open:", select_bus.saturday_open.unique()
print "Sun Open:", select_bus.sunday_open.unique()
print "Thur Open:", select_bus.thursday_open.unique()

Credit Cards: ['True' 'False' None]
Alcohol: ['none' 'full_bar' None 'beer_and_wine']
Touristy: [None 'False' 'True']
Attire: ['casual' None 'dressy' 'formal']
Coat Check: [None 'False' 'True']
Dancing: [None 'True' 'False']
Groups: ['True' None 'False']
Late night: ['False' None 'True']
Happy hour: [None 'True' 'False']
Music DJ: [None 'False' 'True']
Music Live: [None 'False' 'True']
Noise Level: [None 'quiet' 'loud' 'average' 'very_loud']
Smoking: [None 'yes' 'no' 'outdoor']
Friday Open: [None '06:30' '09:00' '10:00' '17:30' '06:00' '00:00' '07:00' '17:00'
 '05:00' '11:00' '21:00' '16:00' '12:00' '08:00' '10:30' '22:30' '07:30'
 '20:00' '22:00' '23:30' '14:00' '18:00' '19:00' '11:30' '21:30' '16:30'
 '15:00' '05:30' '03:00' '09:30' '13:00' '20:30' '12:30' '00:30' '08:30'
 '04:30' '01:00' '05:45' '04:00' '19:30' '23:00' '14:30' '15:30']
Sat Open: [None '06:30' '08:00' '10:00' '09:00' '17:30' '06:00' '00:00' '17:00'
 '05:00' '11:00' '21:00' '07:00' '16:00' '12:00' '07:30' '10:30' '22:

In [59]:
select_bus.credit_cards.unique()

array([0])

In [66]:
# Want to convert the colums with True False or None to binary

select_bus.credit_cards = select_bus.credit_cards.map(lambda x: 1 if x == 'True' else 0)
select_bus.touristy = select_bus.touristy.map(lambda x: 1 if x == 'True' else 0)
select_bus.coat_check = select_bus.coat_check.map(lambda x: 1 if x == 'True' else 0)
select_bus.dancing = select_bus.dancing.map(lambda x: 1 if x == 'True' else 0)
select_bus.groups = select_bus.groups.map(lambda x: 1 if x == 'True' else 0)
select_bus.latenight = select_bus.latenight.map(lambda x: 1 if x == 'True' else 0)
select_bus.happy_hour = select_bus.happy_hour.map(lambda x: 1 if x == 'True' else 0)
select_bus.music_dj = select_bus.music_dj.map(lambda x: 1 if x == 'True' else 0)
select_bus.music_live = select_bus.music_live.map(lambda x: 1 if x == 'True' else 0)

In [67]:
select_bus.head()

Unnamed: 0,business_id,name,review_count,city,stars,categories,credit_cards,alcohol,touristy,attire,...,happy_hour,music_dj,music_live,noise_level,24_hours,smoking,friday_open,saturday_open,sunday_open,thursday_open
0,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",1,none,0,casual,...,0,0,0,,,,,,,
1,-0HGqwlfw3I8nkJyMHxAsQ,McDonald's,9,Phoenix,3.0,"['Burgers', 'Fast Food', 'Restaurants']",1,none,0,casual,...,0,0,0,quiet,,,,,,
2,-0VK5Z1BfUHUYq4PoBYNLw,T Spot,5,Las Vegas,3.5,"['Bars', 'Nightlife', 'Lounges']",1,full_bar,0,,...,1,0,0,loud,,yes,,,,
3,-0bUDim5OGuv8R0Qqq6J4A,IHOP,8,Phoenix,2.0,"['Bakeries', 'Food', 'Breakfast & Brunch', 'Re...",1,,0,casual,...,0,0,0,,,,,,,
4,-1bOb2izeJBZjHC7NWxiPA,First Watch,120,Phoenix,4.0,"['Breakfast & Brunch', 'Cafes', 'American (Tra...",1,none,0,casual,...,0,0,0,average,,,06:30,06:30,06:30,06:30


In [68]:
yelp_rev.head()

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1AqEqmmVHgYCuzcMrF4h2g,2aGafu-x7onydGoDgDfeQQ,0,EmzaQR5hQlF0WIl24NxAZA,2,2,2009-11-16,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,pv82zTlB5Txsu2Pusu__FA,CY4SWiYcUZTWS_T_cGaGPA,4,EmzaQR5hQlF0WIl24NxAZA,9,2,2010-08-16,6,0,0,...,0,0,0,0,0,0,0,0,0,0
4,jlr3OBS1_Y3Lqa-H3-FR1g,VCKytaG-_YkxmQosH4E0jw,0,EmzaQR5hQlF0WIl24NxAZA,1,4,2010-12-04,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
yelp_rev.shape

(322398, 108)

In [90]:
# We want to find the top businesses based on the number of reviews they receive

users = yelp_rev.groupby(['business_id'])[['user_id']].count().reset_index()
top_bus = users[users['user_id'] >= 50]
top_bus.columns = ['business_id', 'number_of_reviews']

top_bus.head()

Unnamed: 0,business_id,number_of_reviews
4,-1bOb2izeJBZjHC7NWxiPA,107
7,-3xbryp44xhpN4BohxXDdQ,202
9,-584fn2GxYe9sLsgN2WeQA,241
10,-5RN56jH78MV2oquLV_G8g,534
20,-E4VgEmeRZegu1BomYtyQQ,137


In [94]:
user_list = list(top_bus.business_id)

user_mask = yelp_rev['business_id'].isin(user_list)

top_bus2 = yelp_rev[user_mask]
top_bus2.head(2)

Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
41,9H-Xsn5sHtYU-4cA8bj7uQ,SQcCX5h5mMZWB-pOJjTgZg,0,e5kc0CQ4R-PCCDgb274gSg,0,4,2006-05-08,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,N_o8Yhuw1afXs20jBcU9Sw,t6lAiiw3c9bQByChPHOy2Q,1,e5kc0CQ4R-PCCDgb274gSg,0,5,2007-01-17,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [95]:
top_bus2.columns.values

array(['user_id', 'review_id', 'votes.cool', 'business_id', 'votes.funny',
       'stars', 'date', 'votes.useful', '10 minutes', '15 minutes',
       '20 minutes', '30 minutes', 'bar food', 'beer selection', 'best ve',
       'bloody mary', 'bottle service', 'chicken waffles',
       'customer service', 'dance floor', 'decided try', 'definitely come',
       'definitely recommend', 'didn want', 'don know', 'don like',
       'don think', 'don want', 'eggs benedict', 'fast food', 'feel like',
       'felt like', 'fish chips', 'food amazing', 'food came',
       'food delicious', 'food good', 'food great', 'food just',
       'food service', 'french fries', 'french toast', 'friday night',
       'fried chicken', 'friendly staff', 'good food', 'good place',
       'good service', 'good thing', 'good time', 'great atmosphere',
       'great experience', 'great food', 'great place', 'great service',
       'great time', 'happy hour', 'hash browns', 'highly recommend',
       'hip hop', 'ice

In [96]:
top_bus2 = top_bus2[['user_id', 'review_id', 'votes.cool', 'business_id', 'votes.funny',
                      'stars', 'date','bar food', 'beer selection', 'best ve',
                      'bloody mary', 'bottle service','dance floor','friday night',
                      'happy hour','hip hop','late night','red velvet','saturday night']]

In [97]:
top_bus2.columns = ['user_id', 'review_id', 'votes_cool', 'business_id', 'votes_funny',
                      'rev_stars', 'date','bar_food', 'beer_selection', 'best_ve',
                      'bloody_mary', 'bottle_service','dance_floor','friday_night',
                      'happy_hour','hip_hop','late_night','red_velvet','saturday_night']

In [98]:
yelp1 = select_bus.merge(top_bus2, how='left', on='business_id')

In [99]:
yelp1.head()

Unnamed: 0,business_id,name,review_count,city,stars,categories,credit_cards,alcohol,touristy,attire,...,best_ve,bloody_mary,bottle_service,dance_floor,friday_night,happy_hour_y,hip_hop,late_night,red_velvet,saturday_night
0,--jFTZmywe7StuZ2hEjxyA,Subway,7,Las Vegas,3.5,"['Fast Food', 'Sandwiches', 'Restaurants']",1,none,0,casual,...,,,,,,,,,,
1,-0HGqwlfw3I8nkJyMHxAsQ,McDonald's,9,Phoenix,3.0,"['Burgers', 'Fast Food', 'Restaurants']",1,none,0,casual,...,,,,,,,,,,
2,-0VK5Z1BfUHUYq4PoBYNLw,T Spot,5,Las Vegas,3.5,"['Bars', 'Nightlife', 'Lounges']",1,full_bar,0,,...,,,,,,,,,,
3,-0bUDim5OGuv8R0Qqq6J4A,IHOP,8,Phoenix,2.0,"['Bakeries', 'Food', 'Breakfast & Brunch', 'Re...",1,,0,casual,...,,,,,,,,,,
4,-1bOb2izeJBZjHC7NWxiPA,First Watch,120,Phoenix,4.0,"['Breakfast & Brunch', 'Cafes', 'American (Tra...",1,none,0,casual,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
yelp1.columns.values

array(['business_id', 'name', 'review_count', 'city', 'stars',
       'categories', 'credit_cards', 'alcohol', 'touristy', 'attire',
       'coat_check', 'dancing', 'groups', 'latenight', 'happy_hour_x',
       'music_dj', 'music_live', 'noise_level', '24_hours', 'smoking',
       'friday_open', 'saturday_open', 'sunday_open', 'thursday_open',
       'user_id', 'review_id', 'votes_cool', 'votes_funny', 'rev_stars',
       'date', 'bar_food', 'beer_selection', 'best_ve', 'bloody_mary',
       'bottle_service', 'dance_floor', 'friday_night', 'happy_hour_y',
       'hip_hop', 'late_night', 'red_velvet', 'saturday_night', 'vegas'], dtype=object)

In [104]:
yelp1['vegas'] = yelp1.city.map(lambda x: 1 if x == 'Las Vegas' else 0)

In [103]:
yelp1.corr()[['vegas']].sort_values('vegas', ascending = False).head(10)

Unnamed: 0,vegas
vegas,1.0
review_count,0.230769
dancing,0.162582
coat_check,0.155237
music_dj,0.140232
touristy,0.0986
groups,0.066707
dance_floor,0.064292
bottle_service,0.051252
hip_hop,0.046524


In [101]:
Xcopy = X.copy()
Xcopy['vegas'] = y
Xcopy.corr()

Unnamed: 0,credit_cards,touristy,coat_check,dancing,groups,latenight,happy_hour,music_dj,music_live,number_of_reviews,vegas
credit_cards,1.0,-0.014151,0.036946,0.011592,0.229118,0.054062,0.059978,0.03102,0.034241,-0.038921,-0.030159
touristy,-0.014151,1.0,0.086021,0.085708,0.021451,0.035741,0.035662,0.046211,0.097311,0.033567,0.097072
coat_check,0.036946,0.086021,1.0,0.355267,0.089224,-0.004747,0.025352,0.38796,-0.003132,0.137138,0.121309
dancing,0.011592,0.085708,0.355267,1.0,0.118832,0.120164,0.152704,0.740819,0.166496,0.068914,0.119518
groups,0.229118,0.021451,0.089224,0.118832,1.0,0.114507,0.250621,0.119978,0.082693,0.053757,-0.002577
latenight,0.054062,0.035741,-0.004747,0.120164,0.114507,1.0,0.300432,0.11726,0.117656,0.006977,0.024017
happy_hour,0.059978,0.035662,0.025352,0.152704,0.250621,0.300432,1.0,0.124073,0.237343,0.012284,-0.06853
music_dj,0.03102,0.046211,0.38796,0.740819,0.119978,0.11726,0.124073,1.0,-0.00676,0.075038,0.134402
music_live,0.034241,0.097311,-0.003132,0.166496,0.082693,0.117656,0.237343,-0.00676,1.0,0.001146,0.00746
number_of_reviews,-0.038921,0.033567,0.137138,0.068914,0.053757,0.006977,0.012284,0.075038,0.001146,1.0,0.115621


In [105]:
formula = "vegas ~ review_count + dancing + coat_check + music_dj + touristy + groups + dance_floor + bottle_service + hip_hop -1"
y, X    = patsy.dmatrices(formula, data = yelp1, return_type = 'dataframe')
X.head()
y = y.values.ravel()

In [106]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xn = ss.fit_transform(X)

In [107]:
y.shape

(278572,)

In [109]:
cross_val_score(LogisticRegression(), Xn, y, cv=5)

array([ 0.73240599,  0.69485776,  0.73409315,  0.7353089 ,  0.73530415])

In [110]:
lr = LogisticRegression()

lr_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.linspace(0.0001, 50, 20)
}

lr_gs = GridSearchCV(lr, lr_params, cv=5, verbose=1)

lr_gs.fit(Xn, y)

#taking too long to run

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:  2.1min


KeyboardInterrupt: 

In [86]:
params = {
   'n_neighbors':range(1,101),
   'weights':['uniform','distance']
}

knn = KNeighborsClassifier()

knn_gs = GridSearchCV(knn, params, cv=5, verbose=1)
knn_gs.fit(Xn, y)

print knn_gs.best_params_
best_knn = knn_gs.best_estimator_

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.8s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    2.9s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    7.3s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   14.9s


{'n_neighbors': 99, 'weights': 'uniform'}


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   19.9s finished


In [87]:
best_knn.score

<bound method KNeighborsClassifier.score of KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=99, p=2,
           weights='uniform')>

In [None]:
# need to do visualizations and write up

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Different categories of ratings

---

Yelp is finally ready to admit that their rating system sucks. No one cares about the ratings, they just use the site to find out what's nearby. The ratings are simply too unreliable for people. 

Yelp hypothesizes that this is, in fact, because different people tend to give their ratings based on different things. They believe that perhaps some people always base their ratings on quality of food, others on service, and perhaps other categories as well. 

1. Do some users tend to talk about service more than others in reviews/tips? Divide up the tips/reviews into more "service-focused" ones and those less concerned with service.
2. Create two new ratings for businesses: ratings from just the service-focused reviews and ratings from the non-service reviews.
3. Construct a regression model for each of the two ratings. They should use the same predictor variables (of your choice). 
4. Validate the performance of the models.
5. Do the models coefficients differ at all? What does this tell you about the hypothesis that there are in fact two different kinds of ratings?

In [33]:
yelp_rev.head()

Unnamed: 0,user_id,review_id,votes_cool,business_id,votes_funny,stars,date,votes_useful,10_minutes,15_minutes,...,service_great,staff_friendly,super_friendly,sweet_potato,tasted_like,time_vegas,try_place,ve_seen,ve_tried,wait_staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1AqEqmmVHgYCuzcMrF4h2g,2aGafu-x7onydGoDgDfeQQ,0,EmzaQR5hQlF0WIl24NxAZA,2,2,2009-11-16,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,pv82zTlB5Txsu2Pusu__FA,CY4SWiYcUZTWS_T_cGaGPA,4,EmzaQR5hQlF0WIl24NxAZA,9,2,2010-08-16,6,0,0,...,0,0,0,0,0,0,0,0,0,0
4,jlr3OBS1_Y3Lqa-H3-FR1g,VCKytaG-_YkxmQosH4E0jw,0,EmzaQR5hQlF0WIl24NxAZA,1,4,2010-12-04,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
yelp_rev.columns.values

array(['user_id', 'review_id', 'votes_cool', 'business_id', 'votes_funny',
       'stars', 'date', 'votes_useful', '10_minutes', '15_minutes',
       '20_minutes', '30_minutes', 'bar_food', 'beer_selection', 'best_ve',
       'bloody_mary', 'bottle_service', 'chicken_waffles',
       'customer_service', 'dance_floor', 'decided_try', 'definitely_come',
       'definitely_recommend', 'didn_want', 'don_know', 'don_like',
       'don_think', 'don_want', 'eggs_benedict', 'fast_food', 'feel_like',
       'felt_like', 'fish_chips', 'food_amazing', 'food_came',
       'food_delicious', 'food_good', 'food_great', 'food_just',
       'food_service', 'french_fries', 'french_toast', 'friday_night',
       'fried_chicken', 'friendly_staff', 'good_food', 'good_place',
       'good_service', 'good_thing', 'good_time', 'great_atmosphere',
       'great_experience', 'great_food', 'great_place', 'great_service',
       'great_time', 'happy_hour', 'hash_browns', 'highly_recommend',
       'hip_hop', 'ice

In [35]:
yelp_rev.columns = yelp_rev.columns.str.replace('.','_')
yelp_rev.columns = yelp_rev.columns.str.replace(' ','_')
yelp_rev.columns.values

array(['user_id', 'review_id', 'votes_cool', 'business_id', 'votes_funny',
       'stars', 'date', 'votes_useful', '10_minutes', '15_minutes',
       '20_minutes', '30_minutes', 'bar_food', 'beer_selection', 'best_ve',
       'bloody_mary', 'bottle_service', 'chicken_waffles',
       'customer_service', 'dance_floor', 'decided_try', 'definitely_come',
       'definitely_recommend', 'didn_want', 'don_know', 'don_like',
       'don_think', 'don_want', 'eggs_benedict', 'fast_food', 'feel_like',
       'felt_like', 'fish_chips', 'food_amazing', 'food_came',
       'food_delicious', 'food_good', 'food_great', 'food_just',
       'food_service', 'french_fries', 'french_toast', 'friday_night',
       'fried_chicken', 'friendly_staff', 'good_food', 'good_place',
       'good_service', 'good_thing', 'good_time', 'great_atmosphere',
       'great_experience', 'great_food', 'great_place', 'great_service',
       'great_time', 'happy_hour', 'hash_browns', 'highly_recommend',
       'hip_hop', 'ice

In [164]:
service_rev = []
food_rev = []

for column_name in yelp_rev.columns:
    
    if 'service' in column_name or 'staff' in column_name:
        service_rev.append(column_name)
    
    elif 'food' in column_name:
        food_rev.append(column_name)
    
print service_rev
print '---------------------\n'
print food_rev

['bottle_service', 'customer_service', 'food_service', 'friendly_staff', 'good_service', 'great_service', 'service_excellent', 'service_food', 'service_friendly', 'service_good', 'service_great', 'staff_friendly', 'wait_staff']
---------------------

['bar_food', 'fast_food', 'food_amazing', 'food_came', 'food_delicious', 'food_good', 'food_great', 'food_just', 'good_food', 'great_food', 'quality_food']


In [165]:
# def yelp_class(row, class1=food_rev, class2=service_rev):
#     if np.sum(row[class1]) > np.sum(row[class2]):
#         row['Class'] = 'food'
#     elif np.sum(row[class1]) < np.sum(row[class2]):
#         row['Class'] = 'service'
#     else:
#         row['Class'] = 'other'
#     return row

In [167]:
foodcols = yelp_rev[food_rev]
servicecols = yelp_rev[service_rev]
foodsum = foodcols.sum(axis=1)
servesum = servicecols.sum(axis=1)
id_food = foodsum > servesum
id_serve = servesum > foodsum
user_id_col = pd.Series(np.tile('other', foodsum.shape[0]))
user_id_col[id_food] = 'food'
user_id_col[id_serve] = 'service'
yelp_rev['Class']=user_id_col

In [169]:
yelp_rev.Class.value_counts()

other      238328
service     45771
food        38299
Name: Class, dtype: int64

In [171]:
yelp_tip.columns.values

array(['user_id', 'business_id', 'likes', 'date', '24_hours',
       'amazing_food', 'animal_style', 'awesome_food', 'awesome_place',
       'awesome_service', 'beef_hash', 'beer_selection', 'best_breakfast',
       'best_burger', 'best_burgers', 'best_place', 'bloody_mary',
       'bottle_service', 'carne_asada', 'cheese_fries', 'chicken_waffles',
       'come_early', 'cool_place', 'corned_beef', 'customer_service',
       'delicious_food', 'don_come', 'don_forget', 'eggs_benedict',
       'excellent_food', 'excellent_service', 'fast_food', 'fast_service',
       'favorite_place', 'feel_like', 'fish_chips', 'food_amazing',
       'food_awesome', 'food_drinks', 'food_good', 'food_great',
       'food_service', 'free_wifi', 'french_toast', 'fried_chicken',
       'friendly_service', 'friendly_staff', 'gluten_free', 'good_food',
       'good_place', 'good_service', 'great_atmosphere', 'great_beer',
       'great_breakfast', 'great_burgers', 'great_customer',
       'great_drinks', 'great

In [172]:
yelp_tip.columns = yelp_tip.columns.str.replace(' ','_')
yelp_tip.columns.values

array(['user_id', 'business_id', 'likes', 'date', '24_hours',
       'amazing_food', 'animal_style', 'awesome_food', 'awesome_place',
       'awesome_service', 'beef_hash', 'beer_selection', 'best_breakfast',
       'best_burger', 'best_burgers', 'best_place', 'bloody_mary',
       'bottle_service', 'carne_asada', 'cheese_fries', 'chicken_waffles',
       'come_early', 'cool_place', 'corned_beef', 'customer_service',
       'delicious_food', 'don_come', 'don_forget', 'eggs_benedict',
       'excellent_food', 'excellent_service', 'fast_food', 'fast_service',
       'favorite_place', 'feel_like', 'fish_chips', 'food_amazing',
       'food_awesome', 'food_drinks', 'food_good', 'food_great',
       'food_service', 'free_wifi', 'french_toast', 'fried_chicken',
       'friendly_service', 'friendly_staff', 'gluten_free', 'good_food',
       'good_place', 'good_service', 'great_atmosphere', 'great_beer',
       'great_breakfast', 'great_burgers', 'great_customer',
       'great_drinks', 'great

In [173]:
service_tip = []
food_tip = []

for column_name in yelp_tip.columns:
    
    if 'service' in column_name or 'staff' in column_name:
        service_tip.append(column_name)
    
    elif 'food' in column_name:
        food_tip.append(column_name)
    
print service_tip
print '---------------------\n'
print food_tip

['awesome_service', 'bottle_service', 'customer_service', 'excellent_service', 'fast_service', 'food_service', 'friendly_service', 'friendly_staff', 'good_service', 'great_service', 'great_staff', 'service_food', 'service_good', 'service_great', 'slow_service', 'staff_friendly', 'staff_great']
---------------------

['amazing_food', 'awesome_food', 'delicious_food', 'excellent_food', 'fast_food', 'food_amazing', 'food_awesome', 'food_drinks', 'food_good', 'food_great', 'good_food', 'great_food', 'love_food']


In [174]:
foodcols = yelp_tip[food_tip]
servicecols = yelp_tip[service_tip]
foodsum = foodcols.sum(axis=1)
servesum = servicecols.sum(axis=1)
id_food = foodsum > servesum
id_serve = servesum > foodsum
user_id_col = pd.Series(np.tile('other', foodsum.shape[0]))
user_id_col[id_food] = 'food'
user_id_col[id_serve] = 'service'
yelp_tip['Class']=user_id_col

In [175]:
yelp_tip.Class.value_counts()

other      94774
service     3753
food        3464
Name: Class, dtype: int64

In [180]:
yelp_rev2 = yelp_rev.groupby(['user_id', 'Class']).mean().reset_index()

In [181]:
yelp_rev2.head()

Unnamed: 0,user_id,Class,votes_cool,votes_funny,stars,votes_useful,10_minutes,15_minutes,20_minutes,30_minutes,...,service_great,staff_friendly,super_friendly,sweet_potato,tasted_like,time_vegas,try_place,ve_seen,ve_tried,wait_staff
0,--0HEXd4W6bJI8k7E0RxTA,other,1.0,0.0,5.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,--2QZsyXGz1OhiD4-0FQLQ,other,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,--4TkB_iDShmg41Y_QW9nw,food,0.0,0.4,4.6,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,--4TkB_iDShmg41Y_QW9nw,other,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,--4TkB_iDShmg41Y_QW9nw,service,0.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [182]:
yelp_tip2 = yelp_tip.groupby(['user_id', 'Class']).mean().reset_index()

In [183]:
yelp_tip2.head()

Unnamed: 0,user_id,Class,likes,24_hours,amazing_food,animal_style,awesome_food,awesome_place,awesome_service,beef_hash,...,service_good,service_great,slow_service,staff_friendly,staff_great,steak_eggs,super_friendly,sweet_potato,velvet_pancakes,worth_wait
0,--2QZsyXGz1OhiD4-0FQLQ,other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,--4fX3LBeXoE88gDTK6TKQ,other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,--65q1FpAL_UQtVZ2PTGew,other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,--FpxrGL-a82dkgrWZLn5Q,service,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,--JC6ri9lh1-tazGOFA3yg,other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [184]:
yelp2 = pd.merge(yelp_rev2, yelp_tip2, on=['user_id', 'Class'])

In [190]:
yelp2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
votes_cool,22621.0,0.569076,1.582946,0.0,0.000000,0.0,0.666667,54.0
votes_funny,22621.0,0.479484,1.471428,0.0,0.000000,0.0,0.500000,52.0
stars,22621.0,3.831058,1.138486,1.0,3.233333,4.0,5.000000,5.0
votes_useful,22621.0,0.954471,1.866884,0.0,0.000000,0.5,1.000000,60.0
10_minutes,22621.0,0.010811,0.085621,0.0,0.000000,0.0,0.000000,3.0
15_minutes,22621.0,0.009745,0.086029,0.0,0.000000,0.0,0.000000,4.0
20_minutes,22621.0,0.009077,0.078212,0.0,0.000000,0.0,0.000000,2.0
30_minutes,22621.0,0.007784,0.073657,0.0,0.000000,0.0,0.000000,2.0
bar_food,22621.0,0.003028,0.049125,0.0,0.000000,0.0,0.000000,3.0
beer_selection_x,22621.0,0.009022,0.077214,0.0,0.000000,0.0,0.000000,2.0


In [None]:
# My next steps would be to 

# group data sets on userid and class and .mean of it
# merge data sets on UserID and class
# on your service and food data sets run regressions, target is star
# linear regression will work because variables are continuos
# visualize my regression results

In [None]:
# The following was my first approach. I realized that this was not accurately separating the reviews into
# service and food. 

In [73]:
# yelp_service = yelp_rev[service_rev]

In [77]:
# yelp_service = yelp_service.drop(['good_service','service_food', 'service_friendly', 'service_great', 
                                  'staff_friendly'], axis=1)

In [80]:
# yelp_food = yelp_rev[food_rev]

In [86]:
# yelp_food = yelp_food.drop(['food_good','food_great'], axis=1)

In [93]:
# yelp_service2 = yelp_tip[service_tip]

In [99]:
# yelp_service2 = yelp_service2.drop(['service_food','service_good', 'service_great', 'staff_friendly', 
                                    'staff_great'], axis=1)

In [102]:
# yelp_food2 = yelp_tip[food_tip]

In [107]:
# yelp_food2 = yelp_food2.drop(['food_amazing','food_awesome', 'food_good', 'food_great'], axis=1)

In [151]:
# almost_clean1 = yelp_service.groupby(['user_id', 'business_id']).mean().reset_index()

In [152]:
# almost_clean2 = yelp_service2.groupby(['user_id', 'business_id']).mean().reset_index()

In [153]:
# almost_clean3 = yelp_food.groupby(['user_id', 'business_id']).mean().reset_index()

In [154]:
# almost_clean4 = yelp_food2.groupby(['user_id', 'business_id']).mean().reset_index()

In [155]:
# Merging my datasets after grouping by userid
# yelp2 = pd.merge(almost_clean1, almost_clean2, on='user_id')
# yelp2 = yelp_service.merge(yelp_service2, how='left', on='user_id')

In [156]:
# yelp3 = pd.merge(almost_clean3, almost_clean4, on='user_id')

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Identifying "elite" users

---

Yelp, though having their own formula for determining whether a user is elite or not, is interested in delving deeper into what differentiates an elite user from a normal user at a broader level.

Use a classification model to predict whether a user is elite or not. Note that users can be elite in some years and not in others.

1. What things predict well whether a user is elite or not?
- Validate the model.
- If you were to remove the "counts" metrics for users (reviews, votes, compliments), what distinguishes an elite user, if anything? Validate the model and compare it to the one with the count variables.
- Think of a way to visually represent your results in a compelling way.
- Give a brief write-up of your findings.


In [None]:
# Need to roadmap my plan. # prioritized capstone questions here. 

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Find something interesting on your own

---

You want to impress your superiors at Yelp by doing some investigation into the data on your own. You want to do classification, but you're not sure on what.

1. Create a hypothesis or hypotheses about the data based on whatever you are interested in, as long as it is predicting a category of some kind (classification).
2. Explore the data visually (ideally related to this hypothesis).
3. Build one or more classification models to predict your target variable. **Your modeling should include gridsearching to find optimal model parameters.**
4. Evaluate the performance of your model. Explain why your model may have chosen those specific parameters during the gridsearch process.
5. Write up what the model tells you. Does it validate or invalidate your hypothesis? Write this up as if for a non-technical audience.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. ROC and Precision-recall

---

Some categories have fewer overall businesses than others. Choose two categories of businesses to predict, one that makes your proportion of target classes as even as possible, and another that has very few businesses and thus makes the target varible imbalanced.

1. Create two classification models predicting these categories. Optimize the models and choose variables as you see fit.
- Make confusion matrices for your models. Describe the confusion matrices and explain what they tell you about your models' performance.
- Make ROC curves for both models. What do the ROC curves describe and what do they tell you about your model?
- Make Precision-Recall curves for the models. What do they describe? How do they compare to the ROC curves?
- Explain when Precision-Recall may be preferable to ROC. Is that the case in either of your models?