## Introduction

The restaurant industry in Canada generated close to **$79 billion** in 2019 which accounted for **1.5%** of the country's GDP. More than **97,000 businesses**, which employ **6.4% (approximately 1.2 million people)** of country's total workforce constituted this 1.5% in 2019.  Given the significance of this industry, it’s size and employment numbers, I decided to create a model that could help business owners and investors determine the essential features that predict a restaurant’s success or failure.

Based on this analysis, investors can decide whether they should invest at a particular business based on the likelihood that it is going to get closed in the future; existing businesses could intervene and improve upon those parameters whereas new businesses could analyze the potential before entering the market.

The dataset for this project was collected from **Yelp Open Dataset** which is publicly available as of Mar 16, 2021 for educational and academic purposes. The dataset's size is 11 GB with 5 json files on businesses, reviews, users, tips and checkin data. For this project, I have focused on businesses' and their reviews data. There are 160,585 businesses and 8,635,403 reviews in the complete dataset, but for this project, I have utilized the data for restaurants within Vancouver, British Columbia.

Initially, in part 1, we will start by importing the dataset and extracting the required data from .json files. In part 2, the data will be cleaned, primarily consisting of removing/imputing duplicates and null values and subsequently, project feature engineering will be performed to create custom data columns that can help in predicting success/failure of a restaurant. This is then followed by an Exploratory Data Analysis (EDA) to understand the final cleaned dataset. In part 3 of the project, models will be build and evaluated on various performance parameters.

## Importing

**Importing libraries:** The first step is to import all modules and libraries which will be used in this notebook.

In [1]:
# Import the relevant libraries and modules which will be utilized in this notebook
import pandas as pd 
import json

**Importing data:** The data will be imported from input folder which has .json files that were extracted from the downloaded __[Yelp Dataset](https://www.yelp.com/dataset/download)__

**1. Restaurants in Vancouver, British Columbia:** As a first step, we will load the dataset available and then filter out the businesses based on category('Restaurant'), city('Vancouver') and state('BC')

**2. Reviews:** We will load the reviews dataset and filter the reviews based on  business_id of restaurants in Vancouver, BC.

In [2]:
# Data about all 160,585 businesses
business_df = pd.read_json(r'input\yelp_academic_dataset_business.json', lines=True, encoding='utf-8')

# check the count
print(f'Number of businesses: {business_df.shape[0]: ,}')
print(f'Number of features: {business_df.shape[1]: ,}')

# view the data
business_df.head(3)

Number of businesses:  160,585
Number of features:  14


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u...","Gastropubs, Food, Beer Gardens, Restaurants, B...","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
1,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Salad, Soup, Sandwiches, Delis, Restaurants, C...","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ..."
2,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Antiques, Fashion, Used, Vintage & Consignment...","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0..."


Focus of my analysis is restaurants in Vancouver city. Therefore, in the next step, data will be filtered based on these conditions, namely category of business under `categories`, city in the field `city` and British Columbia can be selected from `state` column with the value as'BC'.

In [3]:
# extracting the Vancouver restaurants data
vancouver_restaurants = business_df[(business_df['city'].str.lower().str.contains('vancouver')) &\
                                    (business_df['state'] == 'BC') &\
                                    (business_df['categories'].str.lower().str.contains('restaurant'))].reset_index(drop=True)

# checking number of restaurants
print(f'Number of Restaurants: {vancouver_restaurants.shape[0]: ,}')
print(f'Number of features: {vancouver_restaurants.shape[1]: ,}')

Number of Restaurants:  4,749
Number of features:  14


This gives us data for 4,749 restaurants to analyze. Next, we will import the reviews related to these restaurants from reviews data downloaded from yelp.

In [4]:
# open the review file in input folder
data_file = open(r'input\yelp_academic_dataset_review.json', encoding="utf8")
data = []
i = 0

# read data line by line and append if it's for the restaurants in Vancouver
for line in data_file:
    review = json.loads(line)
    if review['business_id'] in vancouver_restaurants['business_id'].values:
        data.append(review)
        print(i, end='\r')
    i = i+1

# add data in dataframe
review_df = pd.DataFrame(data)

# close the file
data_file.close()

# check size of data
print(f'Number of Reviews: {review_df.shape[0]: ,}')
print(f'Number of features: {review_df.shape[1]: ,}')

Number of Reviews:  322,251
Number of features:  9


We will now export these datasets to work further.

In [5]:
# save the filtered content as csv file in data folder
vancouver_restaurants.to_csv(r'data\vancouver_restaurants.csv', encoding='utf-8', index=False)
review_df.to_csv(r'data\vancouver_restaurants_reviews.csv', encoding='utf-8', index=False)

We will perform Data cleaning and Exploratory Data Analysis in _**'2_Data_Cleaning_EDA_and_Features_Engineering'**_ notebook.