# Overview

One of the most challenging things about this project is to measure success.  I define the success metric $F(Rating, \frac{Number\_Reviews}{Month}, Time)$. The paramterization of this function, however, is far from clear.

## Desired Traits of the Function
The function should have the following traits:

1. A business that has been open for a long time and has a high rating should probably count as successful because it has stayed open for a long time and the people who review it like it. It could just be a low-volume, high-margin business or it could be that the people who frequent the establishment do not submit yelp views often. 
2. A business with a large number of reviews and high rating should always be rated as successful.
3. A business with high numbers of reviews over a short amount of time should be successful regardless of ratings because they are clearly making money despite the pitiful reviews. 


## standardize:

It makes sense to standardize so that every value is between 1 and 10 with the same standard deviation. Pick the highest two and multiply them together

# Doing it

In [23]:
import pandas as pd
%pylab inline
import arrow


Populating the interactive namespace from numpy and matplotlib


In [3]:
combined_data = pd.read_hdf('../data/restaurant_reviews.hdf')

In [4]:
combined_data.columns

Index(['date', 'review_id', 'text', 'user_id', 'city', 'latitude', 'longitude',
       'name', 'neighborhoods', 'stars', 'hours'],
      dtype='object')

## Calculating Reviews/Month

In [41]:
num_reviews = combined_data.pivot_table('stars', index='name', aggfunc=len)

In [42]:
num_reviews.head()

name
#1 Brothers Pizza       26
#1 Hawaiian Barbecue     9
#1 Sushi                 7
#1Brothers Pizza        25
1 Brother's Pizza        8
Name: stars, dtype: int64

In [20]:
earliest_review = combined_data.pivot_table('date', index='name', aggfunc=np.min)

In [43]:
earliest_review.head()

name
#1 Brothers Pizza      2010-01-31
#1 Hawaiian Barbecue   2014-08-21
#1 Sushi               2014-03-15
#1Brothers Pizza       2010-08-09
1 Brother's Pizza      2010-10-18
Name: date, dtype: datetime64[ns]

In [34]:
import datetime
age = datetime.now() - earliest_review

In [52]:
# Casts the age to months. Gives a more meaningful value. Also turns it into float
age = age.astype('timedelta64[M]')

In [53]:
reviews_per_month = num_reviews / age

In [54]:
reviews_per_month.describe()

count    15321.000000
mean         0.964815
std          2.171169
min          0.008772
25%          0.123077
50%          0.320000
75%          0.931034
max         69.921053
dtype: float64

In [205]:
normalized_review_freq = reviews_per_month.apply(lambda x: (x - reviews_per_month.mean() + 4) /
                        reviews_per_month.std()) * 2

In [178]:
normalized_review_freq.describe()

count    15321.000000
mean        59.211628
std          5.000000
min         57.009950
25%         57.273184
50%         57.726679
75%         59.133835
max        218.011423
dtype: float64

So now we have a normalized value with standard deviation 10, mean 0, and max 300.

## Getting Average Rating

In [77]:
avg_rating = combined_data.pivot_table('stars', index='name')

In [79]:
avg_rating.describe()

count    15321.000000
mean         3.604453
std          0.671107
min          1.000000
25%          3.200000
50%          3.666667
75%          4.055556
max          5.000000
Name: stars, dtype: float64

In [206]:
avg_rating = avg_rating.apply(lambda x: (x - avg_rating.mean()) / 
                             avg_rating.std()) * 2

In [176]:
avg_rating.describe()

count    15321.000000
mean        54.000000
std          5.000000
min         34.595844
25%         50.986673
50%         54.463516
75%         57.360885
max         64.397352
Name: stars, dtype: float64

## Getting Age

We already had to get this in order to calculate the reviews per month metric. 

In [84]:
age.head()

name
#1 Brothers Pizza       70
#1 Hawaiian Barbecue    15
#1 Sushi                20
#1Brothers Pizza        63
1 Brother's Pizza       61
Name: date, dtype: float64

In [207]:
# Add by 2 to get rid of negative values
age = age.apply(lambda x: (x - age.mean()) / age.std()) * 2 

In [190]:
age.describe()

count    1.532100e+04
mean     7.897142e-17
std      5.000000e+00
min     -9.620604e+00
25%     -4.077868e+00
50%      7.918323e-02
75%      3.889814e+00
max      1.151108e+01
Name: date, dtype: float64

# Coming up with our success metric

In [92]:
df = pd.DataFrame(dict(age=age, avg_rating=avg_rating, reviews_per_month=reviews_per_month))

In [97]:
df.head()

Unnamed: 0_level_0,age,avg_rating,reviews_per_month
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
#1 Brothers Pizza,1.197629,7.040162,0.371429
#1 Hawaiian Barbecue,-17.855524,-5.695525,0.6
#1 Sushi,-16.123419,3.765271,0.35
#1Brothers Pizza,-1.227317,-18.543287,0.396825
1 Brother's Pizza,-1.920159,-7.14421,0.131148


In [100]:
df.values

array([[  1.19762936e+00,   7.04016217e+00,   3.71428571e-01],
       [ -1.78555239e+01,  -5.69552521e+00,   6.00000000e-01],
       [ -1.61234191e+01,   3.76527113e+00,   3.50000000e-01],
       ..., 
       [ -8.84857873e+00,  -1.49671056e+01,   1.21951220e-01],
       [  4.31541808e+00,   2.07947046e+01,   1.26582278e-02],
       [ -2.95942226e+00,   1.43464249e+01,   2.94827586e+00]])

In [99]:
np.argsort(df.values, axis=1)

array([[2, 0, 1],
       [0, 1, 2],
       [0, 2, 1],
       ..., 
       [1, 0, 2],
       [2, 0, 1],
       [0, 2, 1]])

In [101]:
np.sort(df.values, axis=1)

array([[  3.71428571e-01,   1.19762936e+00,   7.04016217e+00],
       [ -1.78555239e+01,  -5.69552521e+00,   6.00000000e-01],
       [ -1.61234191e+01,   3.50000000e-01,   3.76527113e+00],
       ..., 
       [ -1.49671056e+01,  -8.84857873e+00,   1.21951220e-01],
       [  1.26582278e-02,   4.31541808e+00,   2.07947046e+01],
       [ -2.95942226e+00,   2.94827586e+00,   1.43464249e+01]])

array([[ -1.92412078e+01,  -1.92412078e+01,  -1.92412078e+01, ...,
          2.30221504e+01,   2.30221504e+01,   2.30221504e+01],
       [ -3.88083124e+01,  -3.88083124e+01,  -3.88083124e+01, ...,
          2.07947046e+01,   2.07947046e+01,   2.07947046e+01],
       [  8.77192982e-03,   9.09090909e-03,   9.52380952e-03, ...,
          5.68135593e+01,   6.48285714e+01,   6.99210526e+01]])

In [184]:
def define_success(age, avg_rating, reviews_per_month):
    """ 
    Defines success according to the following metric:
    
    Highest 2 of the 3: Age, Average Rating, Reviews/Month where each of the values
    are standardized such taht the standard deviation is 10 and the mean is 0.
    """
    df = pd.DataFrame(dict(age=age, avg_rating=avg_rating,
                           reviews_per_month=reviews_per_month))
    sorted_vals = np.sort(df.values, axis=1)
    df['success_metric'] = sorted_vals[:, -1] * sorted_vals[:, -2]
    
    # Deals with edge case where two highest success metrics are 0.
    df.loc[sorted_vals[:, -2] < 0, 'success_metric'] = 0
    
    return df
    

In [208]:
df = define_success(age, avg_rating, reviews_per_month)

In [209]:
df.describe()

Unnamed: 0,age,avg_rating,reviews_per_month,success_metric
count,15321.0,15321.0,15321.0,15321.0
mean,-3.177552e-17,6.188437000000001e-17,0.964815,1.872049
std,2.0,2.0,2.171169,4.842943
min,-3.848242,-7.761662,0.008772,0.0
25%,-1.631147,-1.205331,0.123077,0.039397
50%,0.03167329,0.1854064,0.32,0.397027
75%,1.555926,1.344354,0.931034,1.799411
max,4.60443,4.158941,69.921053,134.04398


In [196]:
success = df['success_metric']

In [199]:
combined_data = combined_data.join(success, on='name')

In [211]:
combined_data.to_hdf('../data/d_success', 'df')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['review_id', 'text', 'user_id', 'city', 'name', 'neighborhoods', 'hours']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)


# Need a Df with just business features

In [214]:
businesses = pd.read_hdf('../data/businesses.hdf')

In [215]:
businesses

Unnamed: 0,business_id,categories,city,full_address,hours,latitude,longitude,name,neighborhoods,open,review_count,stars,state,type
0,UsFtqoBl7naz8AVUBZMjQQ,[Nightlife],Dravosburg,"202 McClure St\nDravosburg, PA 15034",{},40.350519,-79.886930,Clancy's Pub,[],True,4,3.5,PA,business
1,cE27W9VPgO88Qxe4ol6y_g,"[Active Life, Mini Golf, Golf]",Bethel Park,"1530 Hamilton Rd\nBethel Park, PA 15234",{},40.356896,-80.015910,Cool Springs Golf Center,[],False,5,2.5,PA,business
2,HZdLhv6COCleJMo7nPl-RA,"[Shopping, Home Services, Internet Service Pro...",Pittsburgh,"301 S Hills Vlg\nPittsburgh, PA 15241","{'Thursday': {'open': '10:00', 'close': '21:00...",40.357620,-80.059980,Verizon Wireless,[],True,3,3.5,PA,business
3,mVHrayjG3uZ_RLHkLj-AMg,"[Bars, American (New), Nightlife, Lounges, Res...",Braddock,"414 Hawkins Ave\nBraddock, PA 15104","{'Thursday': {'open': '10:00', 'close': '19:00...",40.408735,-79.866351,Emil's Lounge,[],True,11,4.5,PA,business
4,KayYbHCt-RkbGcPdGOThNg,"[Bars, American (Traditional), Nightlife, Rest...",Carnegie,"141 Hawthorne St\nGreentree\nCarnegie, PA 15106",{},40.415517,-80.067534,Alexion's Bar & Grill,[Greentree],True,15,4.0,PA,business
5,b12U9TFESStdy7CsTtcOeg,"[Auto Repair, Automotive]",Carnegie,"718 Hope Hollow Rd\nCarnegie, PA 15106",{},40.394588,-80.084454,Flynn's E W Tire Service Center,[],True,5,1.5,PA,business
6,Sktj1eHQFuVa-M4bgnEh8g,"[Active Life, Mini Golf]",Carnegie,"920 Forsythe Rd\nCarnegie\nCarnegie, PA 15106",{},40.405404,-80.076267,Forsythe Miniature Golf & Snacks,[Carnegie],True,4,4.0,PA,business
7,3ZVKmuK2l7uXPE6lXY4Dbg,"[Home Services, Contractors]",Carnegie,"8 Logan St\nCarnegie\nCarnegie, PA 15106",{},40.406324,-80.090357,Quaker State Construction,[Carnegie],True,3,2.5,PA,business
8,wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{'Thursday': {'open': '08:00', 'close': '02:00...",40.387732,-80.092874,Kings Family Restaurant,[],True,8,3.5,PA,business
9,yXuao0pFz1AxB21vJjDf5w,"[Food, Grocery]",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106",{},40.387732,-80.092874,Shop N'save,[],True,3,3.5,PA,business
