#  Analyzing Yelp Business Dataset

DATS 6103 - Individual Project 2 - Yushuang Zhao

### Data Source

This dataset comes from Yelp Open Dataset:

- https://www.yelp.com/dataset

- The Yelp dataset contains 5 JavaScript Object Notation (JSON) files: business, review, checkin, user, tip.
- Here used the 'business' subset dataset in this analysis.


# 1. Import Packages

In [114]:
!pip install chart-studio



In [115]:
import json 
import csv
import pandas as pd
import numpy as np
import re

import plotly
import plotly.express as px
import plotly.graph_objects as go
import chart_studio.plotly as py


from IPython.display import display

import warnings
warnings.filterwarnings('ignore')

## Sign in Plotly API


In [116]:
py.sign_in('ZoeyZhao','MLSjL0pWFICdbutvXcpq')

# 2. Data Collection and Preprocessing

In [117]:
#Converting JSON file to CSV file 

pd.set_option('display.max_columns', 50)  # display columns
df = pd.read_json(r'yelp_business.json',lines = True)
df.to_csv("yelp_business.csv")

In [118]:
#read CSV file

business = pd.read_csv("yelp_business.csv")

In [119]:
#read yelp business dataset from JSON file 

#business_json_path = 'yelp_business.json'
#business = pd.read_json(business_json_path, lines=True)

In [120]:
#take a brief look at the dataset

business.head(3)

Unnamed: 0.1,Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",


In [121]:
business = df.iloc[:,0:]

In [122]:
business.head(3)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",


In [123]:
#remove the irrelevant columns

drop_col = ['hours']
business = business.drop(drop_col, axis = 1)

In [124]:
#remove the businesses that are not open anymore
#only keep the businesses that are still open

business = business[business['is_open']==1]

In [125]:
#check data structure

business.shape

(168903, 13)

In [126]:
#check NA for the needed column - 'categories'

business['categories'].isna().value_counts() 

False    168401
True        502
Name: categories, dtype: int64

In [127]:
#remove NA values

business = business[business['categories'].notna()] #take out missing ones
business.shape

(168401, 13)

In [128]:
business.head(3)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,..."
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers"


# 3. Preprocessing and Visulization 

## 3.1 Number of Businesses by States

In [129]:
#sort out the total number of business by state
state_business_count = business[['state', 'business_id']].groupby(['state'])\
['business_id'].agg('count').sort_values(ascending=False)

#make it as a dataframe 
state_business_count = pd.DataFrame(data=state_business_count)

#rename the column
state_business_count.rename(columns={'business_id' : 'total_business'}, inplace=True)

#sort out top 10 states
#state_business_count = state_business_count


In [130]:
# same methods appy to number of reviews
# sort out the total number of review also by state

state_review_count = business[['state', 'review_count', 'stars']].groupby(['state']).\
agg({'review_count': 'sum', 'stars': 'mean'}).sort_values(by='review_count', ascending=False)


In [131]:
#combine the two dataframes together and name the new dataframe "biz_state"

biz_state = pd.concat([state_business_count,state_review_count.iloc[:,0:]],axis=1) 
biz_state.head(3)

Unnamed: 0,total_business,review_count,stars
AZ,49266,2084524,3.693693
NV,31085,2318129,3.68747
ON,28348,725330,3.327501


In [132]:
biz_state = biz_state.reset_index()
biz_state.rename(columns={'index' : 'state'}, inplace=True)
biz_state.head(3)

Unnamed: 0,state,total_business,review_count,stars
0,AZ,49266,2084524,3.693693
1,NV,31085,2318129,3.68747
2,ON,28348,725330,3.327501


In [133]:
#A brief look at business geographically

fig = go.Figure(data=go.Choropleth(
    locations=biz_state['state'], # Spatial coordinates
    z = biz_state['total_business'].astype(float), 
    locationmode = 'USA-states', # set of locations 
    colorscale = 'Blues',
    colorbar_title = "Number of Businesses",
))

fig.update_layout(
    title_text = 'Total Number of Businesses by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()


- Arizona has the most businesses in Yelp，about 49k.
- Nevada contains the second most, around 31k businesses.


## 3.2 Number of Businesses by Categories

In [134]:
business["categories"].shape

(168401,)

In [135]:
#split the businesses by category

category_sort = ';'.join(business['categories'])
category_sort_2 = re.split(';|,', category_sort)
biz_category_trim = [item.lstrip() for item in category_sort_2]
biz_category = pd.DataFrame(biz_category_trim,columns=['category'])

In [136]:
biz_category_count = biz_category.category.value_counts()
biz_category_count = biz_category_count.sort_values(ascending = False)


In [137]:
biz_category_count

Restaurants                    43965
Shopping                       28480
Food                           24844
Home Services                  20653
Health & Medical               17626
                               ...  
Calabrian                          1
Stonemasons                        1
Sauna Installation & Repair        1
Kiosk                              1
Linens                             1
Name: category, Length: 1324, dtype: int64

In [138]:
biz_category = biz_category_count

- There are total 1324 business categories in yelp dataset.

In [139]:
#filter out the TOP 10 categories 

biz_category = biz_category.iloc[0:10]
biz_category = pd.DataFrame(data=biz_category)
biz_category

Unnamed: 0,category
Restaurants,43965
Shopping,28480
Food,24844
Home Services,20653
Health & Medical,17626
Beauty & Spas,17293
Local Services,14319
Automotive,13149
Nightlife,9818
Event Planning & Services,9500


In [140]:
#rename the columns

biz_category = biz_category.reset_index()
biz_category.rename(columns={'category' : 'total_business'}, inplace=True)

In [141]:
biz_category.rename(columns={'index' : 'category'}, inplace=True)
biz_category

Unnamed: 0,category,total_business
0,Restaurants,43965
1,Shopping,28480
2,Food,24844
3,Home Services,20653
4,Health & Medical,17626
5,Beauty & Spas,17293
6,Local Services,14319
7,Automotive,13149
8,Nightlife,9818
9,Event Planning & Services,9500


In [142]:
#sum total number of business among all categories 

total_num_category = biz_category["total_business"].sum()
total_num_category 

199647

In [143]:
#create a new column for total number of business to category, called "business%"

biz_category["business%"] = 100* biz_category["total_business"]/total_num_category 
biz_category.head(3)

Unnamed: 0,category,total_business,business%
0,Restaurants,43965,22.021368
1,Shopping,28480,14.265178
2,Food,24844,12.443964


In [144]:
#Total number of business % in top 10 categories  

fig = px.pie(biz_category, 
             values=biz_category["business%"], 
             names="category", 
             color="category")

fig.update_layout(title = "Total Business VS Top 10 Categories") 
fig.show()

In [145]:
#Overall total number of business to top 10 categories

fig = px.bar(biz_category, 
             x = "category", 
             y = "total_business", 
             color = "category")

fig.update_layout(title = "Total Number of Business in Category") 
fig.update_xaxes(title_text = "Business Category" )
fig.update_yaxes(title_text = 'Total Number of Business')              
fig.show()


- Restaurants of course have the greatest number of business among categories, more than 43k businesses.
- However, not just restaurants!
- There are total 1324 businesses categories in yelp dataset.

## 3.2 Number of Businesses by Cities

In [146]:
#sort out the total number of business by city
city_business_count = business[['city', 'business_id']].groupby(['city'])\
['business_id'].agg('count').sort_values(ascending=False)

#make it as a dataframe 
city_business_count = pd.DataFrame(data=city_business_count)

#rename the column
city_business_count.rename(columns={'business_id' : 'total_business'}, inplace=True)

#sort out top 10 cities
city_business_count = city_business_count[:10]

In [147]:
#rename the columns

city_business_count = city_business_count.reset_index()
city_business_count
#city_business_count.rename(columns={'categories' : 'total_business'}, inplace=True)

Unnamed: 0,city,total_business
0,Las Vegas,24971
1,Phoenix,16242
2,Toronto,14928
3,Charlotte,8478
4,Scottsdale,7332
5,Calgary,6752
6,Pittsburgh,6090
7,Mesa,5460
8,Montréal,5356
9,Henderson,4279


In [148]:
#Overall total number of business to top 10 cities

fig = px.scatter(city_business_count, 
                 x = "city", 
                 y = "total_business", 
                 color = "city", 
                 size = "total_business")

fig.update_layout(title = "Total Number of Businesses in Top 10 Cities") 
fig.update_xaxes(title_text = "City Name" )
fig.update_yaxes(title_text = 'Total Number of Businesses')              
fig.show()

- Las Vegas is the city that has the MOST number of businessees among the 10 cities.
- Total number of business in Las Vegas almost reaches 25k.

## 3.3 Number of Reviews by Cities

In [149]:
#same methods apply to number of reviews
#sort out the total number of review also by city, keep top 10 city

city_review_count = business[['city', 'review_count', 'stars']].groupby(['city']).\
agg({'review_count': 'sum', 'stars': 'mean'}).sort_values(by='review_count', ascending=False)
city_review_count = city_review_count[:10]

In [150]:
#rename the columns

city_review_count = city_review_count.reset_index()
city_review_count

Unnamed: 0,city,review_count,stars
0,Las Vegas,2016814,3.686036
1,Phoenix,732850,3.612886
2,Toronto,467905,3.427954
3,Scottsdale,369028,3.946877
4,Charlotte,321873,3.514508
5,Henderson,228783,3.771325
6,Pittsburgh,224886,3.618144
7,Tempe,196052,3.706746
8,Mesa,174497,3.617582
9,Chandler,155558,3.74249


In [151]:
#Overall total number of review to top 10 cities

fig = px.bar(city_review_count, 
             x = "city", 
             y = "review_count", 
             color = "city")

fig.update_layout(title = "Total Number of Reviews in Top 10 Cities") 
fig.update_xaxes(title_text = "City Name" )
fig.update_yaxes(title_text = 'Total Number of Reviews')              
fig.show()

- Not surprised, Las Vegas not just has the most number of business, but also overall has more than 2 million reviews.
- However, it is still hard to tell if there is a correlation between the number of businesses and reviews.

## 3.4 Business in Las Vages

In [152]:
#subset a dataframe only contains data in Las Vegas

bizLV = business[business.iloc[:,3] == "Las Vegas"]
bizLV.head(3) #check to make sure have the right dataframe

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories
6,oiAlXZPIFm2nBCt0DHLu_Q,Green World Cleaners,"6870 S Rainbow Blvd, Ste 117",Las Vegas,NV,89118,36.063977,-115.241463,3.5,81,1,"{'BusinessParking': '{'garage': False, 'street...","Dry Cleaning & Laundry, Local Services, Laundr..."
51,5XMKDYmMGSKkCkrYoELxzg,Annette Thomas Hair Colorist Specialist,"101 S Rainbow Blvd, Ste 23, Darby's Hairitage",Las Vegas,NV,89107,36.172534,-115.244762,5.0,7,1,"{'GoodForKids': 'True', 'BusinessParking': '{'...","Hair Stylists, Hair Salons, Beauty & Spas"
62,7uYJJpwORUbCirC1mz8n9Q,Gallo Law Office,818 S Casino Center Blvd,Las Vegas,NV,89101,36.161979,-115.150088,2.5,3,1,,"Lawyers, Professional Services, DUI Law, Crimi..."


### 3.4.1 Number of Reviews vs Rating

In [153]:
#Overall business in Las Vegas

fig = px.scatter(bizLV, 
                 x = "stars", 
                 y = "review_count",
                 color = "stars", 
                 size = "review_count")

fig.update_layout(title = "Reviews VS Rating in Las Vegas") 
fig.update_xaxes(title_text = "Rating(star)" )
fig.update_yaxes(title_text = "Number of Review")              
fig.show()

- Most number of reviews rate in 4-star.
- The total number of 1-star businesses in Las Vegas are very low, only 78 count.

### 3.4.2 Review vs Average Rating 

In [154]:
#Average rating of businesses 

fig = px.scatter(city_review_count, 
                 x = "city", 
                 y = "stars", 
                 color = "city", 
                 size = "review_count")

fig.update_layout(title = "Average Rating in Top 10 Cities") 
fig.update_xaxes(title_text = "City Name" )
fig.update_yaxes(title_text = 'Average Star')              
fig.show()

- The average rating is around 3.7 stars.

# 4. Conclusion

- Higher rating(4-star)with large number of review in Las Vegas.
- The average rating among top 10 cities is around 3.7 stars
- The total number of 1 star businesses in Las Vegas are very low.
- Scottsdale has the highest average of rating among the 10 cities, around 4 stars. While, toronto has the lowest average rating.



### Publish:


- 
- 