##Project Preliminary Data Analysis

###Problem Statement and Background (2 points)
A high-level statement of the problem you intend to address, e.g. finding correspondences between neural recordings and DNN layers. Try to translate the high-level into specific questions if you can.
Give background on the problem you are solving: why it is interesting, who is interested, what is known, some references about it, etc.

###The Data Source(s) You Are Using (2 points)
Describe the data source(s) you have. How much data you have now, and how much you expect to use for your final analysis. We will need that information soon so we can get the necessary data to you.

###Data Joining/Cleaning You Did (4 points)
If data is being joined, describe the joining process and any problems with it - explain the metric used for fuzzy joins.
Explain how you will handle missing or duplicate keys. Describe the tools you used to examine/repair/clean the data.
If you found any statistical anomalies last time, explain how you plan to deal with them.


In [2]:
import json
import pandas as pd
import sys
import os
import numpy as np
%matplotlib inline

In [3]:
#Load Biz Dataset into DF
biz_data = []
biz_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json'
with open(biz_fn) as data_file:
    for line in data_file:
        biz_data.append(json.loads(line))
biz_df = pd.DataFrame(biz_data)

In [26]:
#filtering businesses into restaurants
category_csv = 'restaurantcategories.csv'
all_categories = []
restaurants = []
with open(category_csv) as categories:
    for line in categories:
        all_categories.append(line)
biz_dict = biz_df.to_dict()
all_categories = all_categories[0].split('\r')
for index in biz_dict['categories'].keys():
    if len(biz_dict['categories'][index]) == 0:
        for col in biz_dict.keys():
                del biz_dict[col][index] 
    elif len(biz_dict['categories'][index]) > 0:
        allInCats = True
        for elem in biz_dict['categories'][index]:
            if elem not in all_categories:
                allInCats = False
                break;
        if not allInCats:
            for col in biz_dict.keys():
                del biz_dict[col][index] 
                
res_df = pd.DataFrame(biz_dict)    

In [27]:
res_df.shape

(17163, 15)

In [7]:
#Load review dataset into df
review_data = []
review_fn = 'yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json'
with open(review_fn) as data_file:
    for line in data_file:
        review_data.append(json.loads(line))
review_df = pd.DataFrame(review_data)

###Analysis Approach (3 points)
Describe what analysis you are doing: This will probably comprise:
Featurization: Explain how you generated features from the raw data. e.g. thresholding to produce binary features, binning, tf-idf, multinomial -> multiple binary features (one-hot encoding). Describe any value transformations you did, e.g. histogram normalization.
Modeling: Which machine learning models did you try? Which do you plan to try in the future?
Performance measurement: How will you evaluate your model and improve featurization etc.

In [23]:
review_df.shape

(1569264, 8)

In [28]:
#adds a column "distance" which is the distance of this business to its city center. 
#The result is new_res_df which is just res_df with this additional column

from haversine import haversine

cityCenterLocations = dict()

cityCenterLocations['Edinburgh'] = [55.9531, -3.1889]
cityCenterLocations['Karlsruhe'] =  [49.0092, 8.4040]
cityCenterLocations['Montreal'] = [45.5017, -73.5673]
cityCenterLocations['Waterloo'] = [43.4667, -80.5167]
cityCenterLocations['Pittsburgh'] = [40.4397, -79.9764]
cityCenterLocations['Charlotte'] = [35.2269, -80.8433]
cityCenterLocations['Urbana-Champaign'] = [40.1097, -88.2042]
cityCenterLocations['Phoenix'] = [33.4500, -112.0667]
cityCenterLocations['Las Vegas'] = [36.1215, -115.1739]
cityCenterLocations['Madison'] = [43.0667, -89.4000]

dictOfLists = {'business_id': [], 'distance': []}

for index, row in res_df.iterrows():
    location = (row['latitude'], row['longitude'])
    minDistance = float("inf")
    for cityCenter in cityCenterLocations:
        centerlocation = cityCenterLocations[cityCenter]
        distance = haversine(location, (centerlocation[0], centerlocation[1]), miles=True)
        if distance < minDistance:
            minDistance = distance
    dictOfLists['business_id'].append(row['business_id'])
    dictOfLists['distance'].append(minDistance)

distDF = pd.DataFrame(dictOfLists)

new_res_df = pd.merge(res_df, distDF, on='business_id')
new_res_df
    

Unnamed: 0,attributes,business_id,categories,city,full_address,hours,latitude,longitude,name,neighborhoods,open,review_count,stars,state,type,distance
0,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],True,8,3.5,PA,business,7.101866
1,"{u'Alcohol': u'none', u'Noise Level': u'averag...",b9WZJp5L1RZr4F1nxclOoQ,"[Breakfast & Brunch, Restaurants]",Carnegie,"1073 Washington Ave\nCarnegie, PA 15106","{u'Monday': {u'close': u'14:30', u'open': u'06...",40.396744,-80.084800,Gab & Eat,[],True,38,4.5,PA,business,6.428321
2,"{u'Take-out': True, u'Price Range': 1, u'Outdo...",zaXDakTd3RXyOa7sMrUE1g,"[Cafes, Restaurants]",Carnegie,"202 3rd Ave\nCarnegie\nCarnegie, PA 15106",{},40.404638,-80.089985,Barb's Country Junction Cafe,[Carnegie],True,5,4.0,PA,business,6.447019
3,"{u'Take-out': True, u'Accepts Credit Cards': T...",rv7CY8G_XibTx82YhuqQRw,[Restaurants],Carnegie,"Raceway Plz\nCarnegie, PA 15106",{},40.386891,-80.093704,Long John Silver's,[],True,3,3.5,PA,business,7.168961
4,"{u'Take-out': True, u'Alcohol': u'none', u'Noi...",SQ0j7bgSTazkVQlF5AnqyQ,"[Chinese, Restaurants]",Carnegie,"214 E Main St\nCarnegie\nCarnegie, PA 15106",{},40.408343,-80.084861,Don Don Chinese Restaurant,[Carnegie],True,8,2.5,PA,business,6.102425
5,"{u'Take-out': True, u'Accepts Credit Cards': T...",wqu7ILomIOPSduRwoWp4AQ,"[Breakfast & Brunch, American (Traditional), R...",Pittsburgh,"2180 Greentree Rd\nPittsburgh, PA 15220",{},40.391255,-80.073426,Denny's,[],True,7,4.0,PA,business,6.103717
6,"{u'Take-out': True, u'Accepts Credit Cards': T...",P1fJb2WQ1mXoiudj8UE44w,"[Restaurants, Italian]",Carnegie,"200 E Main St\nCarnegie\nCarnegie, PA 15106","{u'Monday': {u'close': u'22:00', u'open': u'11...",40.408257,-80.085458,Papa J's,[Carnegie],True,46,3.5,PA,business,6.133901
7,"{u'Take-out': True, u'Accepts Credit Cards': T...",PK6aSizckHFWk8i0oxt5DA,"[Burgers, Fast Food, Restaurants]",Homestead,"400 Waterfront Dr E\nHomestead\nHomestead, PA ...",{},40.412086,-79.910032,McDonald's,[Homestead],True,5,2.0,PA,business,3.978142
8,"{u'Attire': u'casual', u'Parking': {u'garage':...",sRqB6flj3GtTZIZJQxf_oA,[Restaurants],Homestead,"285 Waterfront Dr E\nHomestead\nHomestead, PA ...",{},40.411692,-79.912343,Eat'n Park Hospitality Group,[Homestead],True,3,2.5,PA,business,3.885432
9,"{u'Take-out': True, u'Accepts Credit Cards': T...",6ilJq_05xRgek_8qUp36-g,"[Burgers, Fast Food, Restaurants]",Munhall,"650 E Waterfront Dr\nHomestead\nMunhall, PA 1...","{u'Monday': {u'close': u'00:00', u'open': u'00...",40.413496,-79.904456,Steak 'n Shake,[Homestead],True,36,2.0,PA,business,4.194822


In [35]:
#Joins res_df and review_df on business_id

import numpy as np

per_biz = dict()
random_indices = np.random.choice(len(review_df), 100000)
RestaurantReviewsJoinedDict = pd.merge(res_df, review_df, on='business_id')

In [36]:
print RestaurantReviewsJoinedDict.shape
RestaurantReviewsJoinedDict

(738425, 22)


Unnamed: 0,attributes,business_id,categories,city,full_address,hours,latitude,longitude,name,neighborhoods,...,stars_x,state,type_x,date,review_id,stars_y,text,type_y,user_id,votes
0,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2009-12-10,XW_RwDoN9StbDt0Y1pc3VA,3,I have never seen a restaurant that has a frow...,review,T_wjLgPOPXry7Bea4MzoVQ,"{u'funny': 1, u'useful': 2, u'cool': 3}"
1,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2010-01-03,LOqF4d657XomJGRKpFKlhg,3,"So... back in the late 90s, there used to be t...",review,LaPatM6c289ClpysmzZpdQ,"{u'funny': 1, u'useful': 1, u'cool': 0}"
2,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2010-11-06,e2YxtsZJE3w6DdNP8yMs7w,4,Ive pretty much been eating at various Kings' ...,review,vGI3dbg5zFRXBg4eVVmGSg,"{u'funny': 0, u'useful': 0, u'cool': 0}"
3,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2011-11-21,RRddfCx_goh5UnEIwx9HMA,2,Hoofah.,review,9MmWbiE7txW_OplDAnoaqA,"{u'funny': 0, u'useful': 0, u'cool': 0}"
4,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2012-05-30,bgLHVU09FpJ-uEOFFor6uA,4,I heart King's. I've always been a fan and thi...,review,q7MrNVt1FE23rwtWmPYWHg,"{u'funny': 0, u'useful': 0, u'cool': 1}"
5,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2014-09-18,pLkbUd2H5ducEeJPi1BPGw,4,I arrived around 10 am on a Saturday morning. ...,review,7KoVg5QMjYu8taLFSE7hNA,"{u'funny': 0, u'useful': 0, u'cool': 0}"
6,"{u'Take-out': True, u'Drive-Thru': False, u'Ou...",wJr6kSA5dchdgOdwH6dZ2w,"[Burgers, Breakfast & Brunch, American (Tradit...",Carnegie,"2100 Washington Pike\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'08...",40.387732,-80.092874,Kings Family Restaurant,[],...,3.5,PA,business,2015-01-01,yJFDmUvxlPRdzG7GPSAXKw,5,"thisis not the closest Kings to us, but we oft...",review,G4PZXgVGd-6zG9jJBQNl5A,"{u'funny': 0, u'useful': 0, u'cool': 0}"
7,"{u'Alcohol': u'none', u'Noise Level': u'averag...",b9WZJp5L1RZr4F1nxclOoQ,"[Breakfast & Brunch, Restaurants]",Carnegie,"1073 Washington Ave\nCarnegie, PA 15106","{u'Monday': {u'close': u'14:30', u'open': u'06...",40.396744,-80.084800,Gab & Eat,[],...,4.5,PA,business,2007-03-31,Phd_OwFhKQptiVL5Tbl-Lw,3,If you want a true understanding of Pittsburgh...,review,PrMlXX6fbMsJie9ausN41g,"{u'funny': 0, u'useful': 2, u'cool': 1}"
8,"{u'Alcohol': u'none', u'Noise Level': u'averag...",b9WZJp5L1RZr4F1nxclOoQ,"[Breakfast & Brunch, Restaurants]",Carnegie,"1073 Washington Ave\nCarnegie, PA 15106","{u'Monday': {u'close': u'14:30', u'open': u'06...",40.396744,-80.084800,Gab & Eat,[],...,4.5,PA,business,2007-08-02,uSoZMwdnhiegEpbXCwWATw,4,"Good Luck getting a seat, that's all I have to...",review,FNbm3ycU2BF8C17UFfWzOg,"{u'funny': 0, u'useful': 0, u'cool': 0}"
9,"{u'Alcohol': u'none', u'Noise Level': u'averag...",b9WZJp5L1RZr4F1nxclOoQ,"[Breakfast & Brunch, Restaurants]",Carnegie,"1073 Washington Ave\nCarnegie, PA 15106","{u'Monday': {u'close': u'14:30', u'open': u'06...",40.396744,-80.084800,Gab & Eat,[],...,4.5,PA,business,2008-04-12,O7qgsnbzKdswwncjdJiyPw,5,Stick to basics and this is the best place in ...,review,EtSjuNdguCE9AcmCOVYqxw,"{u'funny': 0, u'useful': 0, u'cool': 0}"


In [37]:
'''
Outputs a dictionary per_biz which has a key business_id and value a list, where the first element is total number 
of words in reviews for that business, and the second element is number of reviews for that business.

Will use this dict to compute average number of words in a review for a business.
'''

for index, row in RestaurantReviewsJoinedDict.iterrows():
    if row['business_id'] not in per_biz.keys():
        per_biz[row['business_id']] = [0, 0]
    else:
        per_biz[row['business_id']][1] += 1
        wordcount = len(row['text'].encode('utf8').split())
        per_biz[row['business_id']][0] += wordcount
    

adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding business
adding b

{u'OlpyplEJ_c_hFxyand_Wxw': [6149, 58],
 u'_qvxFHGbnbrAPeWBVifJEQ': [1359, 17],
 u'3bwxfBvKABepxYWGz3pHXA': [3437, 34],
 u'ny1T3dzXf8_ySXkFv37qwQ': [790, 16],
 u's5yzZITWU_RcJzWOgjFecw': [10882, 72],
 u'fWQqaAaOon3XkFqPFYgo6Q': [342, 4],
 u'dbMpGh4p9dTxSn5lDSKE3w': [2014, 19],
 u'VZYMInkjRJVHwXVFqeoMWg': [551, 3],
 u'qcylQLL-fXdFHrdXC2jZFw': [9790, 73],
 u'yghonKu_3q4Iu08HbNsyOQ': [3391, 28],
 u'51QMwutUI352Gmb3t1aryw': [347, 5],
 u'DB5rZ9spvyhBJsqOwckncw': [1741, 14],
 u'8buIr1zBCO7OEcAQSZko7w': [164249, 1224],
 u'gq7u9uyOkpLCeh_lxR-WFw': [12009, 88],
 u'nZww1gdBAi9HtRKiQHL2qA': [4619, 29],
 u'Cko88SO9YEsg8jZEXke81w': [2076, 21],
 u'ke3RFq3mHEAoJE_kkRNhiQ': [46660, 315],
 u'SNpVV5viJ2aPylP6bkAx8Q': [2251, 29],
 u'fx4coO0OyW7Qe8vdLnlLiA': [2114, 9],
 u'3m7khDnqH9QOg8gu3Ymumw': [947, 5],
 u'3ntET8y1imh854cMN49WYg': [76, 2],
 u'5TBOg9Rf47SECB8gTNGqeQ': [6004, 56],
 u'7Y1lfzFkwBRoMZKIMMeZMw': [1043, 11],
 u'1HmLi5NNs0_FDYYxBgRxzw': [339, 4],
 u'APzio4blbje5mhMGJqP8Ew': [257, 4],
 u'i_H_yK

In [38]:
per_biz

{u'OlpyplEJ_c_hFxyand_Wxw': [6149, 58],
 u'_qvxFHGbnbrAPeWBVifJEQ': [1359, 17],
 u'3bwxfBvKABepxYWGz3pHXA': [3437, 34],
 u'ny1T3dzXf8_ySXkFv37qwQ': [790, 16],
 u's5yzZITWU_RcJzWOgjFecw': [10882, 72],
 u'fWQqaAaOon3XkFqPFYgo6Q': [342, 4],
 u'dbMpGh4p9dTxSn5lDSKE3w': [2014, 19],
 u'VZYMInkjRJVHwXVFqeoMWg': [551, 3],
 u'qcylQLL-fXdFHrdXC2jZFw': [9790, 73],
 u'yghonKu_3q4Iu08HbNsyOQ': [3391, 28],
 u'51QMwutUI352Gmb3t1aryw': [347, 5],
 u'DB5rZ9spvyhBJsqOwckncw': [1741, 14],
 u'8buIr1zBCO7OEcAQSZko7w': [164249, 1224],
 u'gq7u9uyOkpLCeh_lxR-WFw': [12009, 88],
 u'nZww1gdBAi9HtRKiQHL2qA': [4619, 29],
 u'Cko88SO9YEsg8jZEXke81w': [2076, 21],
 u'ke3RFq3mHEAoJE_kkRNhiQ': [46660, 315],
 u'SNpVV5viJ2aPylP6bkAx8Q': [2251, 29],
 u'fx4coO0OyW7Qe8vdLnlLiA': [2114, 9],
 u'3m7khDnqH9QOg8gu3Ymumw': [947, 5],
 u'3ntET8y1imh854cMN49WYg': [76, 2],
 u'5TBOg9Rf47SECB8gTNGqeQ': [6004, 56],
 u'7Y1lfzFkwBRoMZKIMMeZMw': [1043, 11],
 u'1HmLi5NNs0_FDYYxBgRxzw': [339, 4],
 u'APzio4blbje5mhMGJqP8Ew': [257, 4],
 u'i_H_yK

In [41]:
#This cell gets the 200 most frequent words to compute TFIDF. Not working yet, taking too long to run 

from collections import Counter

counter = Counter()
for index, row in RestaurantReviewsJoinedDict.iterrows():
    words = []
    text = row['text'].encode('utf8')
    line = text.split()
    for elem in line:
        if len(elem) >= 3:
            words.append(elem)
    counter += Counter(words)
    
counter.most_common(10)

KeyboardInterrupt: 

###Preliminary Results (6 Points)
Summarize the results you have so far: