# Final Exam
So, after a lot of messing with things, I realized that this needed to be reproducible (something a smarter person than myself would have known from the start). So I switched over to Jupyter Notebooks. I hope this is acceptable.

## The idea:

### Suggesting venue improvements:
I figured we could identify areas where low ranking (via number of stars) could improve upon in order to maximize their star increase. E.g, a coffeshop without wifi might want to get wifi. As it turns out, getting paid wifi is not a good idea for such a coffee shop (go big or go home, I guess).

### Accounting for venue style:
It might be that certain types of places (as seen through the 'Ambience' attribute) get rated lower due to the nature of the ambience. For example, if a yelper goes to a dive bar and finds that there are few if any amenities, (s)he might give the bar a low score. Being as how the general user expects dive bars to be amenity lacking, this review is not as informative as others with more aligned expectations.

On the other hand, another yelper might go to an amenity filled "dive bar" and rate it very high due to the amenities. This latter case suggests that ambience should be voted on and set by the yelp community rather than by the establishment itself.

Assuming that the categories are set as described above, we might be able to address the misaligned-expectation-review problem with a weighted average. If we want to get really crazy, which I do, we would add a bit of natural language processing to the mix and for instance, any negative reviews pertaining to dive bars with negative mentions of certain amenities would be weighed less; and any positive reviews with positive mentions of amenities (and not the lack thereof) might also be weighed less.

I suspect this would be a feature requiring some AB testing.

As far as incorporating this feature on more than one category, it could be done by surveys on user expectations. So, for instance, with the dive bar scenario, they could present the users who frequent dive bars in each region with questionaires asking what features they expect from a dive bar, and which they expect a dive bar not to have.

### Find [bars] within [0.25] miles of [Wi-Fi]:
Have you ever been working late at night and wished there was a bar or pub you could work from? I know I generally do my best work with a beer. The main deterrent is always the lack of Wi-Fi at these fine establishments.

After part 1, we've noticed that Wi-Fi availability did not really affect the star rating. So we can use our trusty 2dsphere index to find bars that have other locations close by with free Wi-Fi. The idea being that if the Wi-Fi signal is strong/close enough, one might conceivibly be able to use said Wi-Fi while patronizing said bar.

## Areas for improvement:

1. Add natural language processing (possibly via nltk available at nltk.org) to classify customer types from the language used in reviews, and suggest amenities that these customers tend to like.

## Areas where I'm exceedingly stupid:
1. Life
1. I should have realized that being as how this goal is so vague, I was going to spend most of my time with EDA. So I don't feel as though I've accomplished much outside of eliminating ideas.
2. I should have realized that long ago. I don't think I have too much time to work on number 2.

# Basics:

In [1]:
import numpy as np
import json
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
from pymongo import MongoClient
from bson import SON

In [2]:
db = MongoClient().yelp_aaron

In [3]:
business = db.business

In [4]:
def getDF(table, predicate, preprocessor=lambda x: x, projection=None):
    tmp = table.find(predicate, projection)
    maxInd = tmp.count()
    print("Found {r} records matching this predicate.".format(r=maxInd))
    retVal = []
    for i in range(maxInd):
        retVal.append(preprocessor(tmp[i]))
    del tmp
    del maxInd
    retVal = pd.DataFrame.from_records(retVal)
    return retVal

## How does Wi-Fi affect bars as opposed to coffee-shops?

In [5]:
def getWifi(rec):
    if 'Wi-Fi' in rec['attributes']:
        rec['wifi'] = rec['attributes']['Wi-Fi']
    else:
        rec['wifi'] = 'Unknown'
    rec.pop('attributes', None)
    return rec

### Bars:

In [6]:
bars = getDF(business,
             {'categories': {'$in': ['Bars']}},
             projection={'attributes.Wi-Fi': 1,
                         'city': 1,
                         'name': 1,
                         'stars': 1},
             preprocessor=getWifi
            )

Found 4727 records matching this predicate.


In [7]:
bars

Unnamed: 0,_id,city,name,stars,wifi
0,5853170222e2fc4bfd9471ef,Braddock,Emil's Lounge,4.5,no
1,5853170222e2fc4bfd9471f1,Carnegie,Alexion's Bar & Grill,4.0,free
2,5853170222e2fc4bfd9471fd,Carnegie,Rocky's Lounge,4.0,free
3,5853170222e2fc4bfd947201,Carnegie,Paddy's Pour House,3.5,Unknown
4,5853170222e2fc4bfd947210,Homestead,Randy's Beer Barrel Pub,2.5,Unknown
5,5853170222e2fc4bfd947219,Homestead,Duke's Upper Deck Cafe,3.5,free
6,5853170222e2fc4bfd94721a,Homestead,Dave & Buster's,2.5,free
7,5853170222e2fc4bfd947224,West Homestead,Bar Louie,2.5,free
8,5853170222e2fc4bfd947229,Homestead,TGI Fridays,2.5,free
9,5853170222e2fc4bfd94724c,McKees Rocks,Applebee's,3.5,Unknown


In [8]:
bars.groupby(by='wifi').mean()

Unnamed: 0_level_0,stars
wifi,Unnamed: 1_level_1
Unknown,3.550434
free,3.541452
no,3.49045
paid,3.473684


In [9]:
bars_lm = ols('stars ~ C(wifi)', data=bars).fit()

In [10]:
print(sm.stats.anova_lm(bars_lm, typ=2))

               sum_sq      df         F    PR(>F)
C(wifi)      2.128350     3.0  1.633234  0.179445
Residual  2051.593355  4723.0       NaN       NaN


So we see that the wifi availability is inconsequential with respect to the mean star rating.
### General food service industry:

In [11]:
food = getDF(business,
              {'categories': {'$in': ['Food']}},
              projection={'attributes.Wi-Fi': 1,
                          'city': 1,
                          'name': 1,
                          'stars': 1},
              preprocessor=getWifi
             )

Found 10143 records matching this predicate.


In [213]:
food_lm = ols('stars ~ C(wifi)', data=food).fit()
print(sm.stats.anova_lm(food_lm, typ=2))

               sum_sq       df         F    PR(>F)
C(wifi)      8.706789      3.0  4.562756  0.003375
Residual  6449.181902  10139.0       NaN       NaN


In [214]:
food.groupby(by='wifi').mean()

Unnamed: 0_level_0,stars
wifi,Unnamed: 1_level_1
Unknown,3.725182
free,3.765854
no,3.799819
paid,3.642857


Again, we find very little influence if any.

### Coffee-Shops:

In [118]:
coffee = getDF(business,
               {'categories': {'$in': ['Coffee & Tea']}},
               projection={'attributes.Wi-Fi': 1,
                           'city': 1,
                           'name': 1,
                           'stars': 1},
               preprocessor=getWifi
              )

Found 2399 records matching this predicate.


In [132]:
coffee.groupby(by='wifi').mean()

Unnamed: 0_level_0,stars
wifi,Unnamed: 1_level_1
Unknown,3.764012
free,3.709521
no,3.829114
paid,3.25


In [119]:
coffee_lm = ols('stars ~ C(wifi)', data=coffee).fit()
print(sm.stats.anova_lm(coffee_lm, typ=2))

               sum_sq      df         F    PR(>F)
C(wifi)      8.614933     3.0  5.058178  0.001709
Residual  1359.696655  2395.0       NaN       NaN


In [131]:
coffee_lm.pvalues

Intercept          0.000000
C(wifi)[T.free]    0.225360
C(wifi)[T.no]      0.243320
C(wifi)[T.paid]    0.007715
dtype: float64

So we can see that it's pretty bad to charge for wifi at your coffeeshop (the low p-value associated to 'paid' wifi suggests that this is not just spurious).

## Low ranking suggestions:
### Warning: This was a failure, but there are interesting threads I want to explore later, so I'm not deleting them. It might be best to move on to the next section.

In [139]:
lowRankCoffee = coffee[coffee['stars'] < 3]

In [197]:
def getAllAttrs(rec):
    rec = dict(rec, **rec['attributes'])
    rec.pop('attributes', None)
    return rec
def getAmbience(rec):
    if 'Ambience' in rec['attributes']:
        rec = dict(rec, **rec['attributes']['Ambience'])
        rec.pop('attributes', None) 
        return rec

In [146]:
coffee_full = getDF(business,
                    {'categories': {'$in': ['Coffee & Tea']}},
                    projection={'attributes': 1,
                                'city': 1,
                                'stars': 1},
                    preprocessor=getAllAttrs
                   )

Found 2399 records matching this predicate.


In [151]:
coffee_full

Unnamed: 0,Accepts Credit Cards,Accepts Insurance,Ages Allowed,Alcohol,Ambience,Attire,BYOB,BYOB/Corkage,By Appointment Only,Caters,...,Smoking,Take-out,Takes Reservations,Waiter Service,Wheelchair Accessible,Wi-Fi,_id,city,name,stars
0,True,,,none,,,,,False,,...,,True,,,True,free,5848a79739a36819a9b309e4,Homestead,Starbucks,3.5
1,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,False,...,,True,False,False,True,free,5848a79739a36819a9b30a55,Pittsburgh,Tazza D'oro Cafe & Espresso Bar,4.5
2,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,yes_corkage,,True,...,,True,False,True,True,free,5848a79739a36819a9b30a79,Pittsburgh,Quiet Storm Vegetarian & Vegan Cafe,4.0
3,True,,,,,,,,,,...,,,,,,no,5848a79739a36819a9b30abc,Pittsburgh,Cool Beans Coffee,5.0
4,False,,,,,,,,,,...,,,,,True,free,5848a79739a36819a9b30b16,Pittsburgh,Katerbean,1.5
5,True,,,,,,,,,,...,,True,,,True,free,5848a79739a36819a9b30b37,Pittsburgh,Starbucks,4.5
6,True,,,,,,,,,,...,,True,,,,free,5848a79739a36819a9b30b4b,Pittsburgh,Starbucks,3.5
7,True,,,,,,,,,,...,,True,,,True,no,5848a79739a36819a9b30b64,Pittsburgh,Nicholas Coffee,4.0
8,True,,,,,,,,,,...,,True,,,True,no,5848a79739a36819a9b30b79,Pittsburgh,La Prima Espresso Co,4.5
9,False,,,,,,,,,,...,,,,,,no,5848a79739a36819a9b30b7d,Pittsburgh,Fifth Avenue Beanery,3.5


In [201]:
coffeeAmbience = getDF(business,
                       {'categories': {'$in': ['Coffee & Tea']}, 'attributes.Ambience': {'$exists': True}},
                       projection={'attributes.Ambience': 1,
                                   'city': 1,
                                   'stars': 1},
                       preprocessor=getAmbience
                      )

Found 554 records matching this predicate.


In [204]:
coffeeAmbience

Unnamed: 0,_id,casual,city,classy,divey,hipster,intimate,romantic,stars,touristy,trendy,upscale
0,5848a79739a36819a9b30a55,True,Pittsburgh,False,False,False,False,False,4.5,False,False,False
1,5848a79739a36819a9b30a79,False,Pittsburgh,False,False,True,False,False,4.0,False,False,False
2,5848a79739a36819a9b30bc4,False,Pittsburgh,False,False,False,False,False,4.5,False,False,False
3,5848a79739a36819a9b30dd4,False,Charlotte,False,True,False,False,False,4.5,False,False,False
4,5848a79839a36819a9b3106d,True,Middleton,False,False,False,False,False,3.5,False,False,False
5,5848a79839a36819a9b31141,True,Madison,False,False,False,False,False,4.5,False,False,False
6,5848a79839a36819a9b31441,False,Phoenix,False,False,False,False,False,4.0,False,False,False
7,5848a79839a36819a9b31448,True,Phoenix,False,False,False,False,False,4.0,False,False,False
8,5848a79839a36819a9b3173b,False,Phoenix,False,False,False,False,False,4.0,False,False,False
9,5848a79839a36819a9b3176a,False,Phoenix,False,True,False,False,False,4.0,False,False,False


In [210]:
coffee_amb_lm = ols('stars ~ C(city)', data=coffeeAmbience).fit()
print(sm.stats.anova_lm(coffee_amb_lm, typ=2))

              sum_sq     df        F    PR(>F)
C(city)    25.047838   51.0  1.45532  0.025384
Residual  169.412451  502.0      NaN       NaN


In [211]:
coffeeAmbience.groupby(by="city").mean()

Unnamed: 0_level_0,casual,classy,intimate,romantic,stars,touristy,trendy
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ahwatukee,1.0,0.0,0.0,0.0,4.0,0.0,0.0
Avondale,0.0,0.0,0.0,0.0,4.5,0.0,0.0
Bellevue,0.0,0.0,0.0,0.0,4.5,0.0,0.0
Brossard,0.0,0.0,0.0,0.0,4.5,0.0,0.0
Buckeye,0.5,0.0,0.0,0.0,4.0,0.0,0.25
Carefree,1.0,0.0,0.0,0.0,4.0,0.0,0.0
Carnegie,1.0,0.0,0.0,0.0,4.0,0.0,0.0
Casa Grande,0.0,0.0,0.0,0.0,4.5,0.0,0.0
Cave Creek,0.8,0.0,0.0,0.0,4.4,0.0,0.0
Champaign,1.0,0.0,0.0,0.0,4.0,0.0,0.0


In [189]:
coffee_with_ambience = coffee_full[~ pd.isnull(coffee_full['Ambience'])]
coffee_with_ambience

Unnamed: 0,Accepts Credit Cards,Accepts Insurance,Ages Allowed,Alcohol,Ambience,Attire,BYOB,BYOB/Corkage,By Appointment Only,Caters,...,Smoking,Take-out,Takes Reservations,Waiter Service,Wheelchair Accessible,Wi-Fi,_id,city,name,stars
1,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,False,...,,True,False,False,True,free,5848a79739a36819a9b30a55,Pittsburgh,Tazza D'oro Cafe & Espresso Bar,4.5
2,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,yes_corkage,,True,...,,True,False,True,True,free,5848a79739a36819a9b30a79,Pittsburgh,Quiet Storm Vegetarian & Vegan Cafe,4.0
12,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,True,...,,True,False,False,True,no,5848a79739a36819a9b30bc4,Pittsburgh,Prestogeorge Coffee & Tea,4.5
24,True,,,none,"{'divey': True, 'romantic': False, 'hipster': ...",casual,,,,True,...,,True,False,False,False,no,5848a79739a36819a9b30dd4,Charlotte,Tic Toc Diner & Catering,4.5
31,False,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,False,...,,True,False,False,True,free,5848a79839a36819a9b3106d,Middleton,Prairie Cafe & Bakery,3.5
33,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,True,...,,True,False,False,False,free,5848a79839a36819a9b31141,Madison,Cool Beans Coffee Cafe,4.5
42,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,True,...,,True,False,False,,,5848a79839a36819a9b31441,Phoenix,Cranberry Hills Eatery & Catering,4.0
43,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,True,...,,True,False,False,True,no,5848a79839a36819a9b31448,Phoenix,Bitter Creek Cafe,4.0
52,True,,,none,"{'divey': False, 'romantic': False, 'hipster':...",casual,,,,,...,,True,False,False,True,free,5848a79839a36819a9b3173b,Phoenix,Rainbow Donuts,4.0
54,True,,,none,"{'divey': True, 'romantic': False, 'hipster': ...",casual,,,,False,...,,True,False,False,True,free,5848a79839a36819a9b3176a,Phoenix,Grinders Coffee Company,4.0


In [190]:
coffee_amb_lm = ols('stars ~ C(wifi)', data=coffee_with_ambience).fit()
print(sm.stats.anova_lm(coffee_amb_lm, typ=2))

PatsyError: Error evaluating factor: NameError: name 'wifi' is not defined
    stars ~ C(wifi)
            ^^^^^^^

In [188]:
tmp.iloc[0]

{'casual': True,
 'classy': False,
 'divey': False,
 'hipster': False,
 'intimate': False,
 'romantic': False,
 'touristy': False,
 'trendy': False,
 'upscale': False}

In [194]:
low_star_coffee = coffee_with_ambience[coffee_with_ambience['stars'] < 2.5]

## [Bars] near [Wi-Fi]:

### Warning: This was a late addition, so it's not working very well. I've got it working in the mongo shell, but that doesn't do you much good.
I'm out of time. Please check back later.

In [226]:
wifiLocs = getDF(business,
                 {'attributes.Wi-Fi': {'$exists': True, '$nin': ['no']}},
                 projection={'name': 1,
                             'attributes.Wi-Fi': 1,
                             'loc': 1},
                 preprocessor=getWifi
                )

Found 11148 records matching this predicate.


In [227]:
wifiLocs

Unnamed: 0,_id,loc,name,wifi
0,5848a79739a36819a9b309c6,"{'coordinates': [-80.0675491, 40.4154859], 'ty...",Alexion's Bar & Grill,free
1,5848a79739a36819a9b309d2,"{'coordinates': [-80.0849416, 40.3964688], 'ty...",Rocky's Lounge,free
2,5848a79739a36819a9b309d5,"{'coordinates': [-80.088557, 40.417419], 'type...",Extended Stay America - Pittsburgh - Carnegie,free
3,5848a79739a36819a9b309e3,"{'coordinates': [-79.9098858, 40.412012], 'typ...",McDonald's,free
4,5848a79739a36819a9b309e4,"{'coordinates': [-79.916958, 40.407091], 'type...",Starbucks,free
5,5848a79739a36819a9b309ef,"{'coordinates': [-79.9141865, 40.4092932], 'ty...",Dave & Buster's,free
6,5848a79739a36819a9b309f1,"{'coordinates': [-79.9017997, 40.4097098], 'ty...",Wendy's,free
7,5848a79739a36819a9b309f9,"{'coordinates': [-79.9173143507142, 40.4070737...",Bar Louie,free
8,5848a79739a36819a9b309fe,"{'coordinates': [-79.9155451418884, 40.4108743...",TGI Fridays,free
9,5848a79739a36819a9b309ff,"{'coordinates': [-79.9141364, 40.410967], 'typ...",Uno Pizzeria & Grill,free


In [230]:
tmp = business.find_one({'categories': {'$in': ['Bars']}},
                        {'loc': 1}
                       )

In [234]:
tmp['loc']

{'coordinates': [-79.8662107, 40.4088301], 'type': 'Point'}

In [239]:
tmpBus = business.find({
        'loc': SON([('$near',
                     {'$geometry': SON([('type', 'Point'),
                                        ('coordinates', tmp['loc']['coordinates'])
                                       ])}),
                    ('$maxDistance', 1)
                   ])
    })

In [243]:
tmpBus1 = list(tmpBus)

OperationFailure: database error: Can't canonicalize query: BadValue geo near accepts just one argument when querying for a GeoJSON point. Extra field found: $maxDistance: 1

In [236]:
db.command({'geoNear': 'academic_business',
            'near': tmp['loc']['coordinates'],
            'spherical': True
           })

OperationFailure: no such cmd: near

In [146]:
barsNearWifi = getDF(business,
                     {'categories': {'$in': ['Coffee & Tea']}},
                     projection={'attributes': 1,
                                 'city': 1,
                                 'stars': 1},
                     preprocessor=getAllAttrs
                    )

Found 2399 records matching this predicate.
