# Data Pre-Processing: Removing Unrelated Business Categories

---

### Problem Description:
* The dataset ```yelp_academic_dataset_business.json``` has many business categories unrelated to food reviews.
* Since our project focuses on food review text, we need to identify and remove unrelated business categories from our dataset during preprocessing.

In [82]:
import pandas as pd

# Do not truncate column content and show all columns
pd.set_option('display.max_colwidth', None)

# Create a dataframe containing the first fifty datapoints
dataframe = pd.read_json("yelp_academic_dataset_business.json", lines=True, nrows=50)

# Print each business category
print(dataframe['categories'].head(n=10).to_string(index=False).replace(' ',''))

ActiveLife,Gun/RifleRanges,Guns&Ammo,Shopping
Health&Medical,Fitness&Instruction,Yoga,ActiveLife,Pilates
Pets,PetServices,PetGroomers
HardwareStores,HomeServices,BuildingSupplies,Home&Garden,Shopping
HomeServices,Plumbing,Electricians,Handyman,Contractors
AutoRepair,Automotive,OilChangeStations,TransmissionRepair
DryCleaning&Laundry,LocalServices,LaundryServices
AutoRepair,OilChangeStations,Automotive,Tires
EthnicFood,FoodTrucks,SpecialtyFood,ImportedFood,Argentine,Food,Restaurants,Empanadas
MartialArts,Gyms,Fitness&Instruction,ActiveLife


### Regular Expressions Solution

* The yelp developer documentation has a businesses [category list](https://www.yelp.com/developers/documentation/v3/category_list). 


* The category```Restaurant (restaurant)``` sub-divides into cuisine types, e.g.: 
    * Soul Food
    * Tapas


* The category```Food (food)``` sub-divides into food vendors, e.g.:
    * Gelato
    * Food Trucks


* Ignore unrelated business categories involving food:
    * Food Safety Training (foodsafety)
    * Food Tours (foodtours)
    * Food Banks (foodbanks)
    


* Use a regular expression to select targets categories for ```Food (food)``` and ```Restaurant (restaurant)```:

In [83]:
import re

# Create a dataframe containing the fifty datapoints of the dataset
dataframe = pd.read_json("yelp_academic_dataset_business.json", lines=True, nrows=50)

# Split the business category string by newline character
businesses = dataframe['categories'].to_string(index=False).replace(' ','').splitlines()

# Construct our regular expression
foodieRegex = re.compile(r'\bfood\b|\brestaurants\b', flags=re.IGNORECASE)

# Filter our business categories using the regular expression
print(list(filter(foodieRegex.match, businesses)))

['Food,Restaurants,Grocery,MiddleEastern', 'Restaurants,Cheesesteaks,Poutineries', 'Food,Grocery', 'Food,Pretzels,Bakeries,FastFood,Restaurants', 'Restaurants,Burgers,Food', 'Restaurants,Vietnamese,Soup', 'Restaurants,Lebanese,MiddleEastern', 'Restaurants,FastFood,Burgers']


### Data Reduction: Remove Unrelated Business Categories

In [92]:
# Use the regular expression above to create a boolean mask for each row in the dataframe. Note: na=False also filters NaN values
boolMask = dataframe['categories'].str.contains(r'\bfood\b|\brestaurants\b',flags=re.IGNORECASE, na=False)

# Apply the boolean mask to filter out all unrelated business categories
restaurants = dataframe[boolMask]

# Check the results
print(restaurants['categories'].head(n=10).to_string(index=False).replace(' ',''))

EthnicFood,FoodTrucks,SpecialtyFood,ImportedFood,Argentine,Food,Restaurants,Empanadas
Desserts,Food,IceCream&FrozenYogurt
Food,Restaurants,Grocery,MiddleEastern
Hotels&Travel,Transportation,Taxis,Beer,Wine&Spirits,Food,FoodDeliveryServices
Restaurants,Cheesesteaks,Poutineries
Japanese,FastFood,FoodCourt,Restaurants
Shopping,Food,OrganicStores,SpecialtyFood,HealthMarkets,FoodDeliveryServices,Grocery,FarmersMarket
Persian/Iranian,Turkish,MiddleEastern,Restaurants,Kebab
Food,Grocery
Donuts,JuiceBars&Smoothies,Food,Coffee&Tea


## Export to JSON

* Apply the restaurant filter to all datapoints in ```yelp_academic_dataset_business.json``` then export the processed data as a JSON 

In [149]:
# Load the data
dataframe = pd.read_json("yelp_academic_dataset_business.json", lines=True)

# Use the regular expression above to create a boolean mask for each row in the dataframe. Note: na=False also filters NaN values
boolMask = dataframe['categories'].str.contains(r'\bfood\b|\brestaurants\b',flags=re.IGNORECASE, na=False)

# Apply the boolean mask to filter out all unrelated business categories
restaurants = dataframe[boolMask]

# Export to JSON
restaurants.to_json(r'yelp_academic_dataset_restaurants.json')

## Check the updated dataframe for restaurants and food vendor locations in the state of Arizona

In [195]:
import matplotlib.pyplot as plt
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

# Apply a filter to select all restaurants in the state of Arizona
arizonaFilter = restaurants['state'].str.contains(r'\bAZ\b', na=False)
arizonaRestaurants = restaurants[arizonaFilter]

# How many restaurants / food related vendors are there in our dataset for the state of Arizona?
print(f"This dataset contains {arizonaRestaurants.shape[0]} restaurants or food related vendors in the state of Arizona.\n\n")

# Get the longitude and latitude data
geometry = [Point(xy) for xy in zip(arizonaRestaurants['longitude'], arizonaRestaurants['latitude'])]
geo_df = gpd.GeoDataFrame(arizonaRestaurants, geometry=geometry)
geo_df.head(n=3)

This dataset contains 15706 restaurants or food related vendors in the state of Arizona.




Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,geometry
32,DCsS3SgVFO56F6wRO_ewgA,Missy Donuts & Coffee,1255 W Main St,Mesa,AZ,85201,33.414409,-111.858378,2.5,7,0,"{'BikeParking': 'True', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}', 'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '1'}","Donuts, Juice Bars & Smoothies, Food, Coffee & Tea","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'Wednesday': '0:0-0:0', 'Thursday': '0:0-0:0', 'Friday': '0:0-0:0', 'Saturday': '0:0-0:0', 'Sunday': '0:0-0:0'}",POINT (-111.85838 33.41441)
33,vjTVxnsQEZ34XjYNS-XUpA,Wetzel's Pretzels,"4550 East Cactus Rd, #KSFC-4",Phoenix,AZ,85032,33.602822,-111.983533,4.0,10,1,"{'GoodForKids': 'True', 'RestaurantsTakeOut': 'True', 'RestaurantsPriceRange2': '1', 'BusinessAcceptsCreditCards': 'True', 'OutdoorSeating': 'False', 'BikeParking': 'True', 'RestaurantsAttire': 'u'casual'', 'RestaurantsReservations': 'False', 'Ambience': '{'romantic': False, 'intimate': False, 'touristy': False, 'hipster': False, 'divey': False, 'classy': False, 'trendy': False, 'upscale': False, 'casual': False}', 'RestaurantsGoodForGroups': 'True', 'Alcohol': 'u'none'', 'RestaurantsDelivery': 'False', 'HasTV': 'False', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}'}","Food, Pretzels, Bakeries, Fast Food, Restaurants","{'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0', 'Wednesday': '10:0-21:0', 'Thursday': '10:0-21:0', 'Friday': '10:0-21:0', 'Saturday': '10:0-21:0', 'Sunday': '11:0-18:0'}",POINT (-111.98353 33.60282)
44,Ga2Bt7xfqoggTypWD5VpoQ,Amando's Bros,2602 W Southern Ave,Tempe,AZ,85282,33.393199,-111.97627,4.0,9,0,"{'Caters': 'False', 'RestaurantsGoodForGroups': 'True', 'RestaurantsAttire': 'u'casual'', 'Alcohol': 'u'none'', 'Ambience': '{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}', 'RestaurantsTakeOut': 'True', 'NoiseLevel': 'u'quiet'', 'BusinessAcceptsCreditCards': 'True', 'WiFi': 'u'no'', 'RestaurantsReservations': 'False', 'HasTV': 'True', 'RestaurantsPriceRange2': '1', 'GoodForKids': 'True', 'RestaurantsDelivery': 'False', 'OutdoorSeating': 'False', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}'}","Mexican, Restaurants","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-20:0', 'Saturday': '7:0-20:0', 'Sunday': '7:0-20:0'}",POINT (-111.97627 33.39320)


## Summary:
    
* According to the Yelp developer documentation, regular expressions should be sufficient to remove unrelated business categories from our dataset.

* If our data set were not structured and well documented, we would likely need to consider a classification technique for businesses such as **Named Entiy Recognition (NER)**.
---