# YELP Data Processing

* Converts `json` files into `csv` files by running `yelp_json_to_csv.py`
* Generate dummy variables when necessary
* Process each csv file and rename variables if necessary

[Yelp](https://www.yelp.com/dataset) dataset consists of six `json` files `business.json`, `review.json`, `user.json`, `checkin.json`, `tip.json` and `photo.json`. 

* `business.json` contains information about the businesses such as __name__ and __location__ of a business, it's __attributes__, and business __hours__ etc.


* `review.json` has information about each posted review such as __user id__, __star__ rating, the __review__ itself, number of __useful__ votes etc.


* `user.json` provides information about each YELP user such as first __name__, total number of __reviews__, list of __friends__, and average star __rating__ etc.


* `checkin.json` has information about check-in for each business such as __business id__ and __date__.


* `tip.json` _the shorter version of reviews and conveys quick suggestions_; provides information such as the __tip__ itself, the number of __compliments__ and __date__ etc.


* `photo.json` contains information about each photo uploaded to YELP such as __photo id__ and photo __label__ etc.

The main interest of this research is to identify the characteristics that makes a review useful and to use those characteristics to decide if a freshly posted review will be a useful one. For this reason, we will mostly interested in the following data: __businesses__, __reviews__, __users__ and __tips__.

The detailed information and full list of features for each data file can be reached from the YELP [documentation](https://www.yelp.com/dataset).

# Table of Contents

1. [Yelp Business Data](#Yelp-Business-Data)

    1.1. [Data Processing](#Data-Processing)
    
2. [Yelp Review Data](#Yelp-Review-Data)
3. [Yelp User Data](#Yelp-User-Data)

In [1]:
# wider screen
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:90% !important; }</style>'))
from collections import Counter
import json
import pandas as pd
import numpy as np

# Yelp Business Data

The features in `business.json` file

* `business_id`: (`str`) unique id of the business

* `name`: (`str)` the business' name

* `address`: (`str`) the full address of the business

* `city`: (`str`) the city where the business is located

* `state`: (`str`) the state where the business is located

* `postal code`: (`str`) the postal code of the business

* `latitude`: (`float`) latitude

* `longitude`: (`float`) longitude

* `stars`: (`float`) average star rating, rounded to half-stars

* `review_count`: (`str`) total number of reviews given to the business

* `is_open`: (`int`) (binary) indicates whether the business is still open

* `attributes`: (`json`) attributes of the business

* `categories`: (`list`) description of the business

* `hours`: (`json`) the working hours of the business

In [2]:
df = pd.read_json('yelp_data/yelp_academic_dataset_business.json', lines=True)
df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",
3,6OAZjbxqM5ol29BuHsil3w,Nevada House of Hose,1015 Sharp Cir,North Las Vegas,NV,89030,36.219728,-115.127725,2.5,3,0,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Hardware Stores, Home Services, Building Suppl...","{'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ..."
4,51M2Kk903DFYI6gnB5I6SQ,USE MY GUY SERVICES LLC,4827 E Downing Cir,Mesa,AZ,85205,33.428065,-111.726648,4.5,26,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Plumbing, Electricians, Handyma...","{'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '..."


## Data Processing

In [3]:
def get_list(x):
        """Returns a list of categories if applicable else returns a string."""
        try:
            return x.split(', ')
        except:
            return 'NA'
        
def clean_json(text):
    """Cleans the problems with the Attributes JSON objects"""
    if not isinstance(text, str):
        text = str(text)
    text = text.replace('True', '"True"')
    text = text.replace('False', '"False"')
    text = text.replace("'", '"')
    text = text.replace('None', '"False"')
    text = text.replace('"DriveThr', '"DriveThr"')
    text = text.replace('""', '"')
    text = text.replace('u"', '')
    text = text.replace('"{', '{').replace('}"', '}')
    return text

def make_json(text):
    """Transform a string object into a JSON object"""
    return json.loads(text)

In [4]:
# convert categories info from string to list
df.categories = df.categories.apply(get_list)

# https://www.yelp.com/developers/documentation/v3/all_category_list
# identify categories
d_categories = {
    'active': 'Active Life',
    'arts': 'Arts & Entertainment',
    'auto': 'Automotive',
    'beautysvc': 'Beauty & Spas',
    'education': 'Education',
    'eventservices': 'Event Planning & Services',
    'financialservices': 'Financial Services',
    'food': 'Food',
    'health': 'Health & Medical',
    'homeservices': 'Home Services',
    'hotelstravel': 'Hotels & Travel',
    'localflavor': 'Local Flavor',
    'localservices': 'Local Services',
    'massmedia': 'Mass Media',
    'nightlife': 'Nightlife',
    'pets': 'Pets',
    'professional': 'Professional Services',
    'publicservicesgovt': 'Public Services & Government',
    'realestate': 'Real Estate',
    'religiousorgs': 'Religious Organizations',
    'restaurants': 'Restaurants',
    'shopping': 'Shopping'
}

# convert categories into dummy variables
for category in d_categories:
    df[category] = df.categories.apply(
    lambda x: 1 if d_categories[category] in x else 0)

In [5]:
# convert attributes into dummy variables
# Extract business attributes
df.attributes = df.attributes.apply(clean_json)
df.attributes = df.attributes.apply(make_json)
df['has_attributes'] = df.attributes.apply(
                                lambda x: 1 if isinstance(x, dict) else 0)
mask = df.has_attributes == 1
df_attributes = pd.json_normalize(df.loc[mask, 'attributes'].values)
df_attributes['business_id'] = df[mask].business_id.values
df = df.merge(df_attributes, on='business_id', how='left')

In [6]:
# Identify businesses in the US
states = [
'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID',
'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK',
'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV',
'WI', 'WY']
# generate a dummy variable which indicates if a business located in the US
df['in_US'] = df.state.apply(lambda x: 1 if x in states else 0)

In [7]:
# complete list of columns after data processing
', '.join(list(df.columns))

'business_id, name, address, city, state, postal_code, latitude, longitude, stars, review_count, is_open, attributes, categories, hours, active, arts, auto, beautysvc, education, eventservices, financialservices, food, health, homeservices, hotelstravel, localflavor, localservices, massmedia, nightlife, pets, professional, publicservicesgovt, realestate, religiousorgs, restaurants, shopping, has_attributes, BusinessAcceptsCreditCards, BikeParking, GoodForKids, ByAppointmentOnly, RestaurantsPriceRange2, BusinessParking.garage, BusinessParking.street, BusinessParking.validated, BusinessParking.lot, BusinessParking.valet, DogsAllowed, WiFi, RestaurantsAttire, RestaurantsTakeOut, NoiseLevel, RestaurantsReservations, RestaurantsGoodForGroups, BusinessParking, HasTV, Alcohol, RestaurantsDelivery, OutdoorSeating, Caters, WheelchairAccessible, AcceptsInsurance, RestaurantsTableService, HappyHour, Ambience.touristy, Ambience.hipster, Ambience.romantic, Ambience.intimate, Ambience.trendy, Ambien

In [8]:
# Rename features
df.rename(columns={'name': 'business_name',
                   'stars': 'business_stars',
                   'review_count': 'business_review_count'},
          inplace=True)
df.to_csv('yelp_data/yelp_academic_dataset_business.csv', index=False)

# Yelp Review Data

The features in `reviews.json` file

* `review_id`: (`str`) unique id of the review

* `user_id`: (`str`) unique id of the user

* `business_id`: (`str`) unique id of the business

* `stars`: (`int`) star rating

* `date`: (`str`) date formatted `YYYY-MM-DD`

* `text`: (`str`) the review itself

* `useful`: (`int`) number of useful votes received

* `funny`: (`int`) number of funny votes received

* `cool`: (`int`) number of cool votes received

In [9]:
df = []
with open('yelp_data/yelp_academic_dataset_review.json', encoding='utf-8') as fin:
    for jsonline in fin:
        df.append(json.loads(jsonline))
df = pd.json_normalize(df)
# generate dummy features
df['is_useful'] = np.where(df.useful != 0, 1, 0)
df['is_funny'] = np.where(df.funny != 0, 1, 0)
df['is_cool'] = np.where(df.cool != 0, 1, 0)
# Rename features
df.rename(columns={'stars': 'review_stars',
                   'useful': 'review_useful',
                   'funny': 'review_funny',
                   'cool': 'review_cool',
                   'text': 'review',
                   'date': 'review_date'},
          inplace=True)
df.to_csv('yelp_data/yelp_academic_dataset_review.csv', index=False)

# Yelp User Data

The features in `user.json` file

* `user_id`: (`str`) unique id of the user

* `name`: (`str`) user's first name

* `review_count`: (`int`) total number of reviews of the user

* `yelping_since`: (`str`) when the user joined Yelp, formatted `YYYY-MM-DD`

* `friends`: (`list`) list of user's friends as `user_id`s

* `useful`: (`int`) number of `useful` votes sent by the user

* `funny`: (`int`) number of `funny` votes sent by the user

* `cool`: (`int`) number of `cool` votes sent by the user

* `fans`: (`int`) number of fans the user has

* `elite`: (`list`) list of the years the user was elite

* `average_stars`: (`float`) average star rating of all reviews of the user

* `compliment_hot`: (`int`) number of hot compliments recieved by the user

* `compliment_more`: (`int`) number of more compliments received by the user

* `compliment_profile`: (`int`) number of profile compliments received by the user

* `compliment_cute`: (`int`) number of cute compliments received by the user

* `compliment_list`: (`int`) number of list compliments received by the user

* `compliment_note`: (`int`) number of note compliments received by the user

* `compliment_plain`: (`int`) number of plain compliments received by the user

* `compliment_cool`: (`int`) number of cool compliments received by the user

* `compliment_funny`: (`int`) number of funny compliments received by the user

* `compliment_writer`: (`int`) number of writer compliments received by the user

* `compliment_photos`: (`int`) number of photo compliments received by the user

In [10]:
df = []
with open('yelp_data/yelp_academic_dataset_user.json', encoding='utf-8') as fin:
    for jsonline in fin:
        df.append(json.loads(jsonline))
df = pd.json_normalize(df)
# Rename features
df.rename(columns={'name': 'user_name',
                   'review_count': 'user_review_count',
                   'useful': 'user_useful',
                   'funny': 'user_funny',
                   'cool': 'user_cool',
                   'average_stars': 'user_average_stars'},
          inplace=True)
df.to_csv('yelp_data/yelp_academic_dataset_user.csv', index=False)