## Step 1: Data Preprocessing NoteBook
Preprocessing of Dataset for Yelp business

In [1]:
# Import the modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import json_normalize

In [21]:
business_df = pd.read_json("../data/raw_data/yelp_academic_dataset_business.json",lines=True)

# Intital view of Data, its attributes and information on the type of data, size

In [22]:
business_df.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [23]:
# Size of the dataset, it has 100K rows and 14 Columns
business_df.shape, business_df.business_id.nunique()

((150346, 14), 150346)

In [24]:
# Brief information on the columns
business_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


In [25]:
# Info for numerical columns of the dataset
business_df.describe()

Unnamed: 0,latitude,longitude,stars,review_count,is_open
count,150346.0,150346.0,150346.0,150346.0,150346.0
mean,36.67115,-89.357339,3.596724,44.866561,0.79615
std,5.872759,14.918502,0.974421,121.120136,0.40286
min,27.555127,-120.095137,1.0,5.0,0.0
25%,32.187293,-90.35781,3.0,8.0,1.0
50%,38.777413,-86.121179,3.5,15.0,1.0
75%,39.954036,-75.421542,4.5,37.0,1.0
max,53.679197,-73.200457,5.0,7568.0,1.0


## Handling Null values
#### The most important part of data preprocessing is handling missing values in order to get clean and transformed data to perform analysis and data modelling
#### The below step helps us understand the percentage of missing values in each attribute of dataset.

In [26]:
# Null values in the dataset
null_perc=(business_df.isnull().sum()/business_df.isnull().count()) *100
null_perc

business_id      0.000000
name             0.000000
address          0.000000
city             0.000000
state            0.000000
postal_code      0.000000
latitude         0.000000
longitude        0.000000
stars            0.000000
review_count     0.000000
is_open          0.000000
attributes       9.141580
categories       0.068509
hours           15.446370
dtype: float64

Based on above observation we got to know that 3 columns have null values that are `attributes`, `categories` and `hours`. Based on the number we can see that the percenatage of null values overall in categories is approx 0 where as for attributes and categories it is 9% and 15% and since these values are specfic information related to business any fill method would either skew data to a particular class or might randomize the decision so since the overall percentage is less that 25% we can drop the data will null values

In [27]:
# Replace string 'None' with NaN and drop it using dropna function
business_df.replace('None', np.nan, inplace=True)
business_df.dropna(inplace=True)
business_df.shape

(117618, 14)

Post dropping the value we can see there are **no null values** in the data and we can move ahead with tranforming it

In [28]:
business_df['attributes'].isnull().sum()

0

In [29]:
null_perc=(business_df.isnull().sum()/business_df.isnull().count()) *100
null_perc

business_id     0.0
name            0.0
address         0.0
city            0.0
state           0.0
postal_code     0.0
latitude        0.0
longitude       0.0
stars           0.0
review_count    0.0
is_open         0.0
attributes      0.0
categories      0.0
hours           0.0
dtype: float64

Categories column had string values which categorized each business in a type with each business can belong to various categories. Since the values where in string it made each value as single string thereby making it difficult to break down categories or to understand different categories so to handle this used the apply funtion of Data frame to do a string comprehension and convert every record in a list of value therby making a single list entity and easier to traverse and work on

In [30]:
business_df['categories'] = business_df['categories'].apply(lambda x: sorted(x.split(", ")) if x is not None else list())

In [31]:
business_df['categories'].head()

1    [Local Services, Mailbox Centers, Notaries, Pr...
2    [Department Stores, Electronics, Fashion, Furn...
3    [Bakeries, Bubble Tea, Coffee & Tea, Food, Res...
4                          [Breweries, Brewpubs, Food]
5    [Burgers, Fast Food, Food, Ice Cream & Frozen ...
Name: categories, dtype: object

In [32]:
business_df['categories'][1]

['Local Services',
 'Mailbox Centers',
 'Notaries',
 'Printing Services',
 'Shipping Centers']

# Conclusion
* Data preprocessing is the process of handling missing data, making it clean and tranformed as it acts as base for decision making. As part of Data preprocessing major focus was to handle null and noise in data and tranform into standard format.
* Storing the data in a csv which can be used further for exploratory data analysis and feature engineering

In [33]:
business_df.to_csv("../data/processed_data/preprocessed_business.csv", index=False)