# YELP Data Processing

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from collections import Counter
import json
import pandas as pd
import numpy as np
# from IPython.core.display import display, HTML
# display(HTML('<style>.container { width:90% !important; }</style>'))

In [3]:
def clean_json(text):
    """Clears the problems with the Attributes JSON objects"""
    if not isinstance(text, str):
        text = str(text)
    text = text.replace('True', '"True"')
    text = text.replace('False', '"False"')
    text = text.replace("'", '"')
    text = text.replace('None', '"False"')
    text = text.replace('"DriveThr', '"DriveThr"')
    text = text.replace('""', '"')
    text = text.replace('u"', '')
    text = text.replace('"{', '{').replace('}"', '}')
    return text

def make_json(text):
    """Transform a string object into a JSON object"""
    return json.loads(text)

def get_list(x):
    """Returns a list of categories if applicable else returns a string."""
    try:
        return x.split(', ')
    except:
        return 'NA'
    
def get_degree(d): # if a JSON object then gives the degree of nestedness
    if isinstance(d, dict):
        i = 1
        for k, v in d.items():
            if isinstance(v, dict):
                i += 1
                get_degree(v)
        return i
    return 0

def print_key_value(text, indent=0):
    """Prints key and value pair of a JSON object. If a nested JSON object, then
    nested values are indented"""
    if not isinstance(text, dict):
        try:
            text = json.loads(text)
        except:
            return
    for k, v in text.items():
        if isinstance(v, dict):
            print('\t' * indent + str(k) + ':')
            iterate = indent + 1
            print_key_value(v, iterate)
        else:
            print('\t' * indent + str(k) + ' --> ' + str(v))

[Yelp](https://www.yelp.com/dataset) dataset consists of six `json` files `business.json`, `review.json`, `user.json`, `checkin.json`, `tip.json` and `photo.json`. 

* `business.json` contains information about the businesses such as __name__ and __location__ of a business, it's __attributes__, and business __hours__ etc.


* `review.json` has information about each posted review such as __user id__, __star__ rating, the __review__ itself, number of __useful__ votes etc.


* `user.json` provides information about each YELP user such as first __name__, total number of __reviews__, list of __friends__, and average star __rating__ etc.


* `checkin.json` has information about check-in for each business such as __business id__ and __date__.


* `tip.json` _the shorter version of reviews and conveys quick suggestions_; provides information such as the __tip__ itself, the number of __compliments__ and __date__ etc.


* `photo.json` contains information about each photo uploaded to YELP such as __photo id__ and photo __label__ etc.

The main interest of this research is to identify the characteristics that makes a review useful and to use those characteristics to decide if a freshly posted review will be a useful one. For this reason, we will mostly interested in the following data: __businesses__, __reviews__, __users__ and __tips__.

The detailed information and full list of features for each data file can be reached from the YELP [documentation](https://www.yelp.com/dataset).

# Business Data

The features in `business.json` file

* `business_id`: (`str`) unique id of the business

* `name`: (`str)` the business' name

* `address`: (`str`) the full address of the business

* `city`: (`str`) the city where the business is located

* `state`: (`str`) the state where the business is located

* `postal code`: (`str`) the postal code of the business

* `latitude`: (`float`) latitude

* `longitude`: (`float`) longitude

* `stars`: (`float`) average star rating, rounded to half-stars

* `review_count`: (`str`) total number of reviews given to the business

* `is_open`: (`int`) (binary) indicates whether the business is still open

* `attributes`: (`json`) attributes of the business

* `categories`: (`list`) description of the business

* `hours`: (`json`) the working hours of the business

* The businesses in the `business.json` are located in North America, mainly in the US and Canada. The main subject of this study is the restaurants in the US. For this reason, we will focus on those in the US and discard all others. However, businesses in the US are mainly clustered around Arizona-Nevada, Ohio-Pennsylvania and North Carolina rather than scatter across the country. Accordingly, when we mention 'the business in the US', we mean the business that are located in those states. 
* The feature, `is_open`, provides information about if a business is still open or closed. In the study our sample will cover the business which are in the restaurant category and are still in the business. For this reason, we will discard all businesses that are closed.
* The feature, `attributes`, contains the attributes of a business. However, those attributes are not consistent among all businesses. In the first place, we acknowledge that there will be the attributes for the businesses that are in different categories will be, indeed, different. However, those attributes are not, even, consistent among the restaurant businesses. For this reason, we will keep the attributes which have %25 or less missing values.
* Finally, `hours` feature indicates the business hours for each business. Unfortunately, it has a great amount of missing values. For this reason, it will also be discarded from the study.

In [4]:
df = pd.read_json('/content/drive/MyDrive/yelp/yelp_academic_dataset_business.json', lines=True)
print('First 5 rows of the business data which provides information about', end='')
print(' {:,} businesses with {} features:'.format(df.shape[0], df.shape[1]))
display(df.head())

First 5 rows of the business data which provides information about 209,393 businesses with 14 features:


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",
3,6OAZjbxqM5ol29BuHsil3w,Nevada House of Hose,1015 Sharp Cir,North Las Vegas,NV,89030,36.219728,-115.127725,2.5,3,0,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Hardware Stores, Home Services, Building Suppl...","{'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ..."
4,51M2Kk903DFYI6gnB5I6SQ,USE MY GUY SERVICES LLC,4827 E Downing Cir,Mesa,AZ,85205,33.428065,-111.726648,4.5,26,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Plumbing, Electricians, Handyma...","{'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '..."


In [5]:
df.drop(['postal_code', 'hours'], axis=1, inplace=True)
# removes the erronous characters in the `attributes` feature
# converts `categories` into a list
df.attributes = df.attributes.apply(clean_json)
df.attributes = df.attributes.apply(make_json)
df.categories = df.categories.apply(get_list)

## Extract Categories

In [6]:
# classify businesses based on their categories
# all major categories in US
# https://www.yelp.com/developers/documentation/v3/category_list
d_categories = {
    'active': 'Active Life',
    'arts': 'Arts & Entertainment',
    'auto': 'Automotive',
    'beautysvc': 'Beauty & Spas',
    'education': 'Education',
    'eventservices': 'Event Planning & Services',
    'financialservices': 'Financial Services',
    'food': 'Food',
    'health': 'Health & Medical',
    'homeservices': 'Home Services',
    'hotelstravel': 'Hotels & Travel',
    'localflavor': 'Local Flavor',
    'localservices': 'Local Services',
    'massmedia': 'Mass Media',
    'nightlife': 'Nightlife',
    'pets': 'Pets',
    'professional': 'Professional Services',
    'publicservicesgovt': 'Public Services & Government',
    'realestate': 'Real Estate',
    'religiousorgs': 'Religious Organizations',
    'restaurants': 'Restaurants',
    'shopping': 'Shopping'
}

# generates a dummy variable for each category
for category in d_categories:
    df[category] = df.categories.apply(
        lambda x: 1 if d_categories[category] in x else 0)

## Identify Businesses in the US

In [7]:
# generate a dummy variable which indicates if a business is located in the US
states = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID',
    'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
    'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK',
    'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV',
    'WI', 'WY'
]

df['in_US'] = df.state.apply(lambda x: 1 if x in states else 0)

In [8]:
df.to_csv('business_data.csv', index=False)

## Extract Business Attributes

In [9]:
df = df.loc[(df.is_open == 1) & (df.categories != 'NA')
            & (df.in_US == 1) & (df.shopping == 1)]
# find the degree of nestedness of each json object in the business attributes
nested_degree = [get_degree(d) for d in df.attributes.values]

# one of the shortest json - only the json object itself
print('Example of the attributes of {}.'.format(
    df.name.values[nested_degree.index(1)]), end='')
print(' The attributes does not have any child.')
print('-'*90, end='\n\n')
print_key_value(list(df.attributes.values)[nested_degree.index(1)])

# one of the longest json - json object and its childs
print('\n\nExample of the attributes of {}.'.format(
    df.name.values[nested_degree.index(max(nested_degree))]), end='')
print(' The attributes have several childs.')
print('-'*81, end='\n\n')
print_key_value(list(df.attributes.values)[
    nested_degree.index(max(nested_degree))])

Example of the attributes of Bloom & Blueprint. The attributes does not have any child.
------------------------------------------------------------------------------------------

BusinessAcceptsCreditCards --> True
RestaurantsDelivery --> True


Example of the attributes of Wild Wing Cafe. The attributes have several childs.
---------------------------------------------------------------------------------

RestaurantsPriceRange2 --> 2
RestaurantsGoodForGroups --> True
WiFi --> no
RestaurantsTakeOut --> True
Caters --> True
OutdoorSeating --> True
BikeParking --> True
GoodForKids --> True
BusinessAcceptsCreditCards --> True
HasTV --> True
NoiseLevel --> loud
Alcohol --> full_bar
BusinessParking:
	garage --> False
	street --> False
	validated --> False
	lot --> True
	valet --> False
RestaurantsAttire --> casual
GoodForMeal:
	dessert --> False
	latenight --> True
	lunch --> True
	dinner --> True
	brunch --> False
	breakfast --> False
BestNights:
	monday --> False
	tuesday --> True
	frida

In [10]:
df['has_attributes'] = df.attributes.apply(lambda x: 1 if isinstance(x, dict) else 0)
mask = df.has_attributes == 1
df_attributes = pd.json_normalize(df.loc[mask, 'attributes'].values)
df_attributes['business_id'] = df[mask].business_id.values
df = df.merge(df_attributes, on='business_id', how='left')
display(df_attributes.head())

Unnamed: 0,BusinessAcceptsCreditCards,BikeParking,GoodForKids,ByAppointmentOnly,RestaurantsPriceRange2,BusinessParking.garage,BusinessParking.street,BusinessParking.validated,BusinessParking.lot,BusinessParking.valet,RestaurantsDelivery,BusinessAcceptsBitcoin,WheelchairAccessible,Alcohol,RestaurantsTakeOut,Caters,OutdoorSeating,WiFi,RestaurantsAttire,NoiseLevel,RestaurantsReservations,RestaurantsGoodForGroups,RestaurantsTableService,HasTV,Ambience.romantic,Ambience.intimate,Ambience.classy,Ambience.hipster,Ambience.divey,Ambience.touristy,Ambience.trendy,Ambience.upscale,Ambience.casual,DogsAllowed,BusinessParking,AcceptsInsurance,HappyHour,GoodForMeal.dessert,GoodForMeal.latenight,GoodForMeal.lunch,GoodForMeal.dinner,GoodForMeal.brunch,GoodForMeal.breakfast,CoatCheck,Smoking,HairSpecializesIn.straightperms,HairSpecializesIn.coloring,HairSpecializesIn.extensions,HairSpecializesIn.africanamerican,HairSpecializesIn.curly,HairSpecializesIn.kids,HairSpecializesIn.perms,HairSpecializesIn.asian,GoodForDancing,BYOB,Corkage,Music.dj,Music.background_music,Music.no_music,Music.jukebox,Music.live,Music.video,Music.karaoke,BestNights.monday,BestNights.tuesday,BestNights.friday,BestNights.wednesday,BestNights.thursday,BestNights.sunday,BestNights.saturday,HairSpecializesIn,Ambience,DriveThr,BYOBCorkage,AgesAllowed,Music,Open24Hours,business_id
0,True,True,False,False,3.0,False,False,False,True,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,f9NumwFMBDn751xgFiRbNA
1,True,,,,,,,,,,True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,nIEhsGbw0vJuYl05bzzj6Q
2,True,True,,False,2.0,False,False,False,True,False,,False,True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,UqBTL1dq9QcOISikgeknow
3,True,,,,2.0,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,PslhllUwcQFavRHp-lyMOQ
4,False,False,True,,4.0,False,False,False,False,False,False,,,none,False,True,True,no,formal,very_loud,False,False,False,True,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,lK-wuiq8b1TuU7bfbQZgsg


In [11]:
temp = pd.DataFrame({'business_attributes': df_attributes.columns,
                     '%_missing': [df[column].isnull().sum() / df.shape[0]
                                   for column in df_attributes.columns]})
temp.sort_values('%_missing', inplace=True)
temp.reset_index(drop=True, inplace=True)
with pd.option_context('display.max_row', None):
    display(temp)

Unnamed: 0,business_attributes,%_missing
0,business_id,0.0
1,BusinessAcceptsCreditCards,0.098448
2,BusinessParking.garage,0.282615
3,BusinessParking.valet,0.282754
4,BusinessParking.street,0.282801
5,BusinessParking.validated,0.282801
6,BusinessParking.lot,0.282801
7,RestaurantsPriceRange2,0.285495
8,BikeParking,0.348773
9,ByAppointmentOnly,0.615081


In [12]:
to_drop = temp.loc[temp['%_missing'] > 0.25, 'business_attributes'].values
df.drop(to_drop, axis=1, inplace=True)
df = df.loc[:, ['business_id', 'name', 'address', 'city', 'state', 'stars',
                'review_count', 'BusinessAcceptsCreditCards']]
df.BusinessAcceptsCreditCards.replace(to_replace=['True', 'False'], value=[1, 0],
                                                  inplace=True)
display(df.head())

Unnamed: 0,business_id,name,address,city,state,stars,review_count,BusinessAcceptsCreditCards
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,3.5,36,1.0
1,nIEhsGbw0vJuYl05bzzj6Q,Bloom & Blueprint,"2115 E Cedar St, Unit 3",Tempe,AZ,4.5,7,1.0
2,UqBTL1dq9QcOISikgeknow,GK's Vapor Pub,"17058 W Bell Raod, Ste 101",Surprise,AZ,5.0,8,1.0
3,PslhllUwcQFavRHp-lyMOQ,Disney Store,"8111 Concord Mills Blvd, Space 214",Concord,NC,3.5,6,1.0
4,lK-wuiq8b1TuU7bfbQZgsg,Hingetown,,Cleveland,OH,3.0,4,0.0


In [13]:
df.rename(columns={'name': 'business_name', 'stars': 'business_stars', 'review_count': 'business_review_count'}, inplace=True)
df.to_csv('business_processed.csv', index=False)

# Reviews Data

The features in `reviews.json` file

* `review_id`: (`str`) unique id of the review

* `user_id`: (`str`) unique id of the user

* `business_id`: (`str`) unique id of the business

* `stars`: (`int`) star rating

* `date`: (`str`) date formatted `YYYY-MM-DD`

* `text`: (`str`) the review itself

* `useful`: (`int`) number of useful votes received

* `funny`: (`int`) number of funny votes received

* `cool`: (`int`) number of cool votes received

In [14]:
reviews = []
with open('/content/drive/MyDrive/yelp/yelp_academic_dataset_review.json') as fin:
    for jsonline in fin:
        reviews.append(json.loads(jsonline))
reviews = pd.json_normalize(reviews)
business_ids = df.loc[:, 'business_id'].to_frame()
df_reviews = reviews.merge(business_ids, on='business_id', validate='many_to_one')
display(df_reviews.head())
del reviews

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2.0,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16
1,t7xOZF5UKXjSpVcXLOSAgw,owbC7FP8SNAlwv6f9S5Stw,-MhfebM0QIsKt87iDN-FNw,2.0,2,2,0,I have been there. I believe more than once. \...,2014-03-14 08:24:25
2,MimB5Xh85rG7phUMPrShag,v9vGnjphb0Hta0lvtf5haA,-MhfebM0QIsKt87iDN-FNw,3.0,5,3,3,I haven't been to Las Vegas in about 15 years....,2015-10-07 22:16:59
3,sLkT7J06L4TK4PiRUFax2g,AXuHgGQoNPkiSXTxHlQc0A,-MhfebM0QIsKt87iDN-FNw,2.0,0,0,0,One of the few places in town you can view wor...,2015-11-18 22:20:55
4,cnV5xtm6WuyaLfot9uWbDg,LkWNo83Lg92C5V4JEyxOZA,-MhfebM0QIsKt87iDN-FNw,3.0,3,1,2,This is a hard one to review. I was excited to...,2010-10-10 01:27:31


In [15]:
# generate dummy variables
df_reviews['is_useful'] = np.where(df_reviews.useful != 0, 1, 0)
df_reviews['is_funny'] = np.where(df_reviews.funny != 0, 1, 0)
df_reviews['is_cool'] = np.where(df_reviews.cool != 0, 1, 0)
df_reviews.date = pd.to_datetime(df_reviews.date)
df_reviews['review_years'] = [date.year for date in df_reviews.date]
df_reviews['review_months'] = [date.month for date in df_reviews.date]
df_reviews.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,is_useful,is_funny,is_cool,review_years,review_months
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2.0,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16,1,0,0,2015,4
1,t7xOZF5UKXjSpVcXLOSAgw,owbC7FP8SNAlwv6f9S5Stw,-MhfebM0QIsKt87iDN-FNw,2.0,2,2,0,I have been there. I believe more than once. \...,2014-03-14 08:24:25,1,1,0,2014,3
2,MimB5Xh85rG7phUMPrShag,v9vGnjphb0Hta0lvtf5haA,-MhfebM0QIsKt87iDN-FNw,3.0,5,3,3,I haven't been to Las Vegas in about 15 years....,2015-10-07 22:16:59,1,1,1,2015,10
3,sLkT7J06L4TK4PiRUFax2g,AXuHgGQoNPkiSXTxHlQc0A,-MhfebM0QIsKt87iDN-FNw,2.0,0,0,0,One of the few places in town you can view wor...,2015-11-18 22:20:55,0,0,0,2015,11
4,cnV5xtm6WuyaLfot9uWbDg,LkWNo83Lg92C5V4JEyxOZA,-MhfebM0QIsKt87iDN-FNw,3.0,3,1,2,This is a hard one to review. I was excited to...,2010-10-10 01:27:31,1,1,1,2010,10


In [16]:
df_reviews.rename(columns={'stars': 'review_stars', 'useful': 'review_useful', 'funny': 'review_funny',
                           'cool': 'review_cool', 'text': 'review', 'date': 'review_date'}, inplace=True)
df_reviews.to_csv('reviews_data.csv', index=False)

# User Data

The features in `user.json` file

* `user_id`: (`str`) unique id of the user

* `name`: (`str`) user's first name

* `review_count`: (`int`) total number of reviews of the user

* `yelping_since`: (`str`) when the user joined Yelp, formatted `YYYY-MM-DD`

* `friends`: (`list`) list of user's friends as `user_id`s

* `useful`: (`int`) number of `useful` votes sent by the user

* `funny`: (`int`) number of `funny` votes sent by the user

* `cool`: (`int`) number of `cool` votes sent by the user

* `fans`: (`int`) number of fans the user has

* `elite`: (`list`) list of the years the user was elite

* `average_stars`: (`float`) average star rating of all reviews of the user

* `compliment_hot`: (`int`) number of hot compliments recieved by the user

* `compliment_more`: (`int`) number of more compliments received by the user

* `compliment_profile`: (`int`) number of profile compliments received by the user

* `compliment_cute`: (`int`) number of cute compliments received by the user

* `compliment_list`: (`int`) number of list compliments received by the user

* `compliment_note`: (`int`) number of note compliments received by the user

* `compliment_plain`: (`int`) number of plain compliments received by the user

* `compliment_cool`: (`int`) number of cool compliments received by the user

* `compliment_funny`: (`int`) number of funny compliments received by the user

* `compliment_writer`: (`int`) number of writer compliments received by the user

* `compliment_photos`: (`int`) number of photo compliments received by the user

In [17]:
users = []
with open('/content/drive/My Drive/yelp/yelp_academic_dataset_user.json') as fin:
    for jsonline in fin:
        users.append(json.loads(jsonline))
users = pd.json_normalize(users)
user_ids = pd.Series(df_reviews.loc[:, 'user_id'].unique()).to_frame().rename(columns={0:'user_id'})
df_users = users.merge(user_ids, on='user_id', validate='one_to_one')
display(df_users.head())
del users

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,ntlvfPzc8eglqvk92iDIAw,Rafael,553,2007-07-06 03:27:11,628,225,227,,"oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg...",14,3.57,3,2,1,0,1,11,15,22,22,10,0
1,FOBRPlBHa3WPHFB5qYDlVg,Michelle,564,2008-04-28 01:29:25,790,316,400,200820092010201120122013,"ly7EnE8leJmyqyePVYFlug, pRlR63iDytsnnniPb3AOug...",27,3.84,36,4,5,2,1,33,37,63,63,21,5
2,f4_MRNHvN-yRn7EA8YWRxg,Jennifer,822,2011-01-17 00:18:23,4127,2446,2878,20112012201320142015201620172018,"c-Dja5bexzEWBufNsHfRrQ, 02HJNyOzzYXvEKVApJb8GQ...",137,3.63,483,81,62,35,24,193,541,623,623,293,172
3,UYACF30806j2mfbB5vdmJA,Justin,14,2007-07-24 23:55:21,68,21,34,,"YwaKGmRNnSa3R3N4Hf9jLw, v9YpDzYkJarRbzvVIY-63g...",4,3.75,0,3,0,0,0,3,4,0,0,2,1
4,q-v8elVPvKz0KvK69QSj1Q,Lisa Marie,666,2009-05-19 01:42:25,2993,1281,1832,20112012201320142015201620172018,"rt1KveqwFMnkN6dXKg5Qyg, NfnKx3z7zFottS3yHabw1g...",197,3.37,212,11,11,13,7,120,150,135,135,42,72


In [18]:
df_users.rename(columns={'name': 'user_name', 'review_count': 'user_review_count',
                         'useful': 'user_useful', 'funny': 'user_funny',
                         'cool': 'user_cool', 'average_stars': 'user_average_stars'}, inplace=True)
df_users.to_csv('users_data.csv', index=False)

# Tips Data

In [19]:
tips = []
with open('/content/drive/My Drive/yelp/yelp_academic_dataset_tip.json') as fin:
    for jsonline in fin:
        tips.append(json.loads(jsonline))
tips = pd.json_normalize(tips)
df_tips = tips.merge(business_ids, on='business_id', validate='many_to_one')
display(df_tips.head())
del tips

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,hf27xTME3EiCp6NL6VtWZQ,UYX5zL_Xj9WEc_Wp-FrqHw,Here for a quick mtg,2013-11-26 18:20:08,0
1,Wi0VgIrbb8vqU6weyVw6tg,UYX5zL_Xj9WEc_Wp-FrqHw,Surprised by the inventory! Was looking for a ...,2011-07-25 02:03:06,0
2,5J4uykvpVCZ3pwceIKf45g,UYX5zL_Xj9WEc_Wp-FrqHw,Thomas train table is a lot of fun!,2012-04-15 16:15:12,0
3,dWkaK0k-5WSY4BJny1BtHw,UYX5zL_Xj9WEc_Wp-FrqHw,Scanning books I want for Lydia and adding the...,2011-10-13 21:35:53,0
4,8HXpvdxGR_yBQuA_T23Cxw,UYX5zL_Xj9WEc_Wp-FrqHw,Mockingjay!!,2012-04-11 03:33:53,0


In [20]:
df_tips.rename(columns={'text': 'tip', 'date': 'tip_date', 'compliment_count': 'tip_compliment_count'}, inplace=True)
df_tips.to_csv('tips_data.csv', index=False)