# Notebook 1 - Project Introduction and Data Loading

*Allistair Cota*

*BrainStation Data Science Bootcamp*

*September 2021 Cohort*

This is Notebook 1 of the restaurant rating predictor capstone project, titled **Rate My Restaurant**. This notebook discusses the project goal, data acquisition and exploration of supplementary data sources. The data will then be loaded in from the original files, filtered and then exported for use in the next notebook.

Below is the reminder of the notebook sequence. Please read the notebooks in sequence.

- **Current Notebook:** NB1-Project_Intro_and_Data_Loading

- **Next Notebook:** NB2-Data_Cleaning_EDA_Feature_Engineering

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Project-Goal" data-toc-modified-id="Project-Goal-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Project Goal</a></span></li><li><span><a href="#Data-Acquisition" data-toc-modified-id="Data-Acquisition-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Acquisition</a></span></li><li><span><a href="#Supplementary-Data-Sources" data-toc-modified-id="Supplementary-Data-Sources-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Supplementary Data Sources</a></span></li><li><span><a href="#Methodology" data-toc-modified-id="Methodology-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Methodology</a></span></li><li><span><a href="#Import-the-Required-Libraries" data-toc-modified-id="Import-the-Required-Libraries-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Import the Required Libraries</a></span></li><li><span><a href="#Load-the-Business-Dataset" data-toc-modified-id="Load-the-Business-Dataset-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Load the Business Dataset</a></span></li><li><span><a href="#Load-the-Reviews-Dataset" data-toc-modified-id="Load-the-Reviews-Dataset-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Load the Reviews Dataset</a></span></li></ul></div>

## Introduction

The restaurant industry in the United States generated \\$833.1 billion in revenue in 2018 and was projected to reach $863 billion in 2019 [1]. Restaurants reviews and ratings on popular review websites are important factors in attracting new customers. Yelp is one of the most popular review websites, including over 135 million restaurant and business reviews and over 90 million users visiting the website per month. A study found that a one star increase in a Yelp rating leads to a 5% - 9% increase in revenue for a restaurant [2]. This underlines the financial importance of Yelp ratings on restaurants.

## Project Goal

The goal of this project is to help aspiring restaurant owners understand the ratings their planned restaurant might get on a review website, given the amenities they plan to include, and feedback received from soft openings. By using a combination of numerical, categorical, and text data, we utilize Natural Language Processing (NLP) and machine learning techniques to build a regression model that can predict the rating of a restaurant assigned by a customer. This will give restauranteurs an advanced awareness of customers’ sentiment towards the restaurants, allowing them to make any changes prior to opening to avoid receiving low ratings once in operation.

## Data Acquisition

The data for this project was acquired from the Yelp Open Dataset [3], which is published by Yelp on its [website](https://www.yelp.com/dataset) for educational and non-commercial purposes. It consists of a subset of Yelp's businesses, reviews, and user data. The entire dataset is divided among separate JSON files, namely:
- Business dataset, which contains information about each business including ID, name, address, category, attributes and operating hours
- Reviews dataset, which contains reviews by Yelp users on different businesses
- User dataset, which contains information about the Yelp activity of each user
- Tips dataset, which contains tips on different businesses
- Check-ins dataset, which contains the timestamps for each time a Yelp user checks in at a business

For this project, I will only utilize the business and review datasets.

## Supplementary Data Sources

Additional data sources were researched in the early stages of the project to gain information about restaurant menu items, since this information is not included in the Yelp dataset. The only source found with this information was [Documenu API](https://documenu.com/) which claims to provide access to over 600,000 US restaurant menus with over 40 million individual menu items. However, after trialling this service, it was found that the service limited  menu query results to a maximum of 25 restaurants per API call, with some restaurants returning missing menu information. In addition, the service limits the user to just 25 API calls per month before introducing paid tiers to increase the number of API calls. Due to these limitations, exploring this data source was abandoned early in the project due to the unknown value that this information would bring. Given more time, budget, and also foresight about what the size of the final filtered dataset for this project would look like, it would have been interesting to supplement our main data source with the Documenu data.

## Methodology

In this notebook, we will load the required JSON files from the Yelp Open dataset, keeping only the required records that pertain to restaurants in Massachusetts, and export the data into CSV files for the next step in this project which is data cleaning and EDA.

## Import the Required Libraries

We will require the Pandas and JSON libraries for this notebook.

In [2]:
import pandas as pd
import json

## Load the Business Dataset

We will begin by loading business dataset JSON file into a Pandas data frame.

In [3]:
%%time
business_df = pd.read_json('../data/business.json', lines=True)
business_df.head()

CPU times: user 3.09 s, sys: 780 ms, total: 3.86 s
Wall time: 3.94 s


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Antiques, Fashion, Used, Vintage & Consignment...","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0..."
3,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Beauty & Spas, Hair Salons",
4,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,14,1,"{'GoodForKids': 'False', 'BusinessParking': '{...","Gyms, Active Life, Interval Training Gyms, Fit...","{'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'..."


We can see that there is a `categories` column that includes string descriptions of the business category. We will filter for the word *'restaurant'* which is the category of interest for this project.

In [4]:
# Filter for restaurnts that contain a restauarant description in the categories column
# and save to new dataframe
restaurant_df = business_df[business_df['categories'].str.contains('restaurant', case=False)==True].reset_index().drop('index', axis=1)
restaurant_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,D4JtQNTI4X3KcbzacDJsMw,Bob Likes Thai Food,3755 Main St,Vancouver,BC,V5V,49.251342,-123.101333,3.5,169,1,"{'GoodForKids': 'True', 'Alcohol': 'u'none'', ...","Restaurants, Thai","{'Monday': '17:0-21:0', 'Tuesday': '17:0-21:0'..."
3,jFYIsSb7r1QeESVUnXPHBw,Boxwood Biscuit,740 S High St,Columbus,OH,43206,39.947007,-82.997471,4.5,11,1,,"Breakfast & Brunch, Restaurants","{'Saturday': '8:0-14:0', 'Sunday': '8:0-14:0'}"
4,HPA_qyMEddpAEtFof02ixg,Mr G's Pizza & Subs,474 Lowell St,Peabody,MA,01960,42.541155,-70.973438,4.0,39,1,"{'RestaurantsGoodForGroups': 'True', 'HasTV': ...","Food, Pizza, Restaurants","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."


In [5]:
restaurant_df.shape

(50793, 14)

The resulting shape of our restaurant data frame is 50,793 rows and 14 columns.

We now look at the value counts for the different states in this data frame.

In [6]:
restaurant_df['state'].value_counts()

MA     10551
FL      7711
BC      7508
OR      7402
GA      6142
TX      5452
OH      4380
CO       866
WA       774
KS         1
MN         1
VA         1
WY         1
KY         1
NH         1
ABE        1
Name: state, dtype: int64

We can see that the state of Massachusetts (abbreviated as MA) has the most restaurants in the dataset. We will now filter for only Massachusetts restaurants.

In [7]:
# Extract restaurants in Massachusetts
restaurant_df = restaurant_df[restaurant_df['state']=='MA'].reset_index(drop=True)

# Delete the business data frame from memory
del business_df

# View the info summary
restaurant_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10551 entries, 0 to 10550
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   10551 non-null  object 
 1   name          10551 non-null  object 
 2   address       10551 non-null  object 
 3   city          10551 non-null  object 
 4   state         10551 non-null  object 
 5   postal_code   10551 non-null  object 
 6   latitude      10551 non-null  float64
 7   longitude     10551 non-null  float64
 8   stars         10551 non-null  float64
 9   review_count  10551 non-null  int64  
 10  is_open       10551 non-null  int64  
 11  attributes    10481 non-null  object 
 12  categories    10551 non-null  object 
 13  hours         8701 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 1.1+ MB


## Load the Reviews Dataset

We now read in the reviews dataset JSON file into a Pandas data frame. Since the JSON file is large (6.94 GB), we will read the file contents in chunks to reduce the data loading time, as proposed by Eve Law in this blog post:
https://towardsdatascience.com/load-yelp-reviews-or-other-huge-json-files-with-ease-ad804c2f1537

We will only add the records for businesses that are present in the restaurant data frame.

In [8]:
%%time
data_lines = []

with open("../data/review.json", "r") as f:
    reader = pd.read_json(f, orient="records", lines=True, 
                          chunksize=1000)
        
    for chunk in reader:
        # Only include records for businesses in restaurant data frame
        chunk = chunk[chunk['business_id'].isin(restaurant_df['business_id'])]
        data_lines.append(chunk)
    
review_df = pd.concat(data_lines, ignore_index=True)

CPU times: user 2min 37s, sys: 8.46 s, total: 2min 45s
Wall time: 2min 47s


In [9]:
review_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4,3,1,1,Apparently Prides Osteria had a rough summer a...,2014-10-11 03:34:02
1,J4a2TuhDasjn2k3wWtHZnQ,RNm_RWkcd02Li2mKPRe7Eg,xGXzsc-hzam-VArK6eTvtw,1,2,0,0,"This place used to be a cool, chill place. Now...",2018-01-21 04:41:03
2,28gGfkLs3igtjVy61lh77Q,Q8c91v7luItVB0cMFF_mRA,EXOsmAB1s71WePlQk0WZrA,2,0,0,0,"The setting is perfectly adequate, and the foo...",2006-04-16 02:58:44
3,KKVFopqzcVfcubIBxmIjVA,99RsBrARhhx60UnAC4yDoA,EEHhKSxUvJkoPSzeGKkpVg,5,0,0,0,I work in the Pru and this is the most afforda...,2014-05-07 18:10:21
4,btNWW2kdJYfwpTDyzJO3Iw,DECuRZwkUw8ELQZfNGef2Q,zmZ3HkVCeZPBefJJxzdJ7A,4,0,0,0,Nothing special but good enough. I like anoth...,2012-12-04 04:29:47


In [10]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1365575 entries, 0 to 1365574
Data columns (total 9 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   review_id    1365575 non-null  object        
 1   user_id      1365575 non-null  object        
 2   business_id  1365575 non-null  object        
 3   stars        1365575 non-null  int64         
 4   useful       1365575 non-null  int64         
 5   funny        1365575 non-null  int64         
 6   cool         1365575 non-null  int64         
 7   text         1365575 non-null  object        
 8   date         1365575 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(4)
memory usage: 93.8+ MB


In [11]:
review_df.shape

(1365575, 9)

We can now export our data frames to CSV files for use in upcoming notebooks for this project.

In [40]:
%%time
restaurant_df.to_csv('../data/ma_restaurant.csv', index=False)
review_df.to_csv('../data/ma_review.csv', index=False)


CPU times: user 32.6 s, sys: 1.62 s, total: 34.2 s
Wall time: 35.1 s


Please proceed to the next notebook in the sequence - **NB2-Data_Cleaning_EDA_Feature_Engineering**.

# References

[1] R. Ruggless, "U.S. restaurant sales to reach record $863B in 2019, NRA says," Nation's Restaurant News, 5 April 2019. [Online]. Available: [https://www.nrn.com/sales-trends/us-restaurant-sales-reach-record-863b-2019-nra-says](https://www.nrn.com/sales-trends/us-restaurant-sales-reach-record-863b-2019-nra-says).

[2] M. Luca, "Reviews, Reputation, and Revenue: The Case of Yelp.Com," Harvard Business School NOM Unit Working Paper No. 12-016, vol. 12, no. 06, 2016. 

[3] Yelp Inc., "Yelp Open Dataset," n.d.. [Online]. Available: [https://www.yelp.com/dataset](https://www.yelp.com/dataset). [Accessed 21 October 2021].

