# Data Pre-Processing: Removing Unrelated Business Categories

---

## Problem Description:
* The dataset ```yelp_academic_dataset_business.json``` has many business categories unrelated to food reviews.
* Since our project focuses on food review text, we need to identify and remove unrelated business categories from our dataset during preprocessing.

In [53]:
import pandas as pd

# Do not truncate column content and show all columns
pd.set_option('display.max_colwidth', None)

# Create a dataframe containing all business categories
dataframe = pd.read_json("yelp_academic_dataset_business.json", lines=True)

# Print n=10 business category
print(dataframe['categories'].head(n=10).to_string(index=False).replace(' ',''))

ActiveLife,Gun/RifleRanges,Guns&Ammo,Shopping
Health&Medical,Fitness&Instruction,Yoga,ActiveLife,Pilates
Pets,PetServices,PetGroomers
HardwareStores,HomeServices,BuildingSupplies,Home&Garden,Shopping
HomeServices,Plumbing,Electricians,Handyman,Contractors
AutoRepair,Automotive,OilChangeStations,TransmissionRepair
DryCleaning&Laundry,LocalServices,LaundryServices
AutoRepair,OilChangeStations,Automotive,Tires
EthnicFood,FoodTrucks,SpecialtyFood,ImportedFood,Argentine,Food,Restaurants,Empanadas
MartialArts,Gyms,Fitness&Instruction,ActiveLife


## Business Categories Documentation:

* The yelp developer documentation has a businesses [category list](https://www.yelp.com/developers/documentation/v3/category_list). 


* The category```Restaurant (restaurant)``` sub-divides into cuisine types, e.g.: 
    * Soul Food
    * Tapas
    * ...


* The category```Food (food)``` sub-divides into food vendors, e.g.:
    * Gelato
    * Food Trucks
    * ...
    

* We should be good to go by just filtering out all businesses that do not contain 'Food' or 'Restaurant' using regular expressions.

---


## Solution 1: 

In [62]:
import re

# Use the regular expression above to create a boolean mask for each row in the dataframe. Note: na=False also filters NaN values
boolMask = dataframe['categories'].str.contains(r'\bfood\b|\brestaurants\b',flags=re.IGNORECASE, na=False)

# Apply the boolean mask to filter out all unrelated business categories
restaurants = dataframe[boolMask]

# Check the results
print(restaurants['categories'].head(n=10).to_string(index=False).replace(' ',''))
print('--'*50)
print(f'\nFound {len(restaurants)} restaurants or food related vendors')

EthnicFood,FoodTrucks,SpecialtyFood,ImportedFood,Argentine,Food,Restaurants,Empanadas
Desserts,Food,IceCream&FrozenYogurt
Food,Restaurants,Grocery,MiddleEastern
Hotels&Travel,Transportation,Taxis,Beer,Wine&Spirits,Food,FoodDeliveryServices
Restaurants,Cheesesteaks,Poutineries
Japanese,FastFood,FoodCourt,Restaurants
Shopping,Food,OrganicStores,SpecialtyFood,HealthMarkets,FoodDeliveryServices,Grocery,FarmersMarket
Persian/Iranian,Turkish,MiddleEastern,Restaurants,Kebab
Food,Grocery
Donuts,JuiceBars&Smoothies,Food,Coffee&Tea
----------------------------------------------------------------------------------------------------

Found 80492 restaurants or food related vendors


---

## Solution 2:

* Similar to solution 1: Use regex and bool mask but code may be more readable

In [64]:
# Filter the original dataframe categories using a bool mask for 'Restaurants' or 'Food' and str.contains()
restaurants_2 = dataframe[
        (dataframe['categories'].str.contains('Restaurants')==True) | 
        (dataframe['categories'].str.contains('Food')==True) 
]

# Check the results
print(restaurants_2['categories'].head(n=10).to_string(index=False).replace(' ',''))
print('--'*50)
print(f'\nFound {len(restaurants)} restaurants or food related vendors')

EthnicFood,FoodTrucks,SpecialtyFood,ImportedFood,Argentine,Food,Restaurants,Empanadas
Desserts,Food,IceCream&FrozenYogurt
Food,Restaurants,Grocery,MiddleEastern
Hotels&Travel,Transportation,Taxis,Beer,Wine&Spirits,Food,FoodDeliveryServices
Restaurants,Cheesesteaks,Poutineries
Japanese,FastFood,FoodCourt,Restaurants
Shopping,Food,OrganicStores,SpecialtyFood,HealthMarkets,FoodDeliveryServices,Grocery,FarmersMarket
Persian/Iranian,Turkish,MiddleEastern,Restaurants,Kebab
Food,Grocery
Donuts,JuiceBars&Smoothies,Food,Coffee&Tea
----------------------------------------------------------------------------------------------------

Found 80492 restaurants or food related vendors


---

## Export to JSON

* Apply the restaurant filter to all datapoints in ```yelp_academic_dataset_business.json``` then export the processed data as a JSON 

In [65]:
# Export to JSON
restaurants.to_json(r'yelp_academic_dataset_restaurants.json')

## Summary:
    
* According to the Yelp developer documentation, regular expressions should be sufficient to remove unrelated business categories from our dataset.

* If our data set were not structured and well documented, we would likely need to consider a classification technique for businesses such as **Named Entiy Recognition (NER)**.
---