# Data Pre-Processing: Removing Unrelated Business Categories

---

## Problem Description:
* The dataset ```yelp_academic_dataset_business.json``` has many business categories unrelated to food reviews.
* Since our project focuses on food review text, we need to identify and remove unrelated business categories from our dataset during preprocessing.

___

In [64]:
import pandas as pd
import re

# Do not truncate column content and show all columns
pd.set_option('display.max_colwidth', None)

In [16]:
prefix = '../../data/'
files = {
    "reviewsPath": prefix + 'yelp_academic_dataset_review.json',
    "businessesPath": prefix + 'yelp_academic_dataset_business.json',
    "checkinsPath": prefix + 'yelp_academic_dataset_checkin.json',
    "tipsPath": prefix + 'yelp_academic_dataset_tip.json',
    "usersPath": prefix + 'yelp_academic_dataset_user.json'
}

In [56]:
# Create a dataframe containing all business categories
df = pd.read_json(
    files["businessesPath"], 
    lines=True
)

## Business Categories Documentation:

* The yelp developer documentation has a businesses [category list](https://www.yelp.com/developers/documentation/v3/category_list). 


* The category```Restaurant (restaurant)``` sub-divides into cuisine types, e.g.: 
    * Soul Food
    * Tapas
    * ...


* The category```Food (food)``` sub-divides into food vendors, e.g.:
    * Gelato
    * Food Trucks
    * ...
    

* We should be good to go by just filtering out all businesses that do not contain 'Food' or 'Restaurant' using regular expressions.

---


## Solution 1: 

In [72]:
def regexMask():
    # Use the regular expression above to create a boolean mask for each row in the dataframe. Note: na=False also filters NaN values
    boolMask = dataframe['categories'].str.contains(r'\bfood\b|\brestaurants\b',flags=re.IGNORECASE, na=False)
#     print(boolMask[:5])

    # Apply the boolean mask to filter out all unrelated business categories
    restaurants = dataframe[boolMask]
    return restaurants

    # Check the results
#     print(restaurants['categories'].head(n=10).to_string(index=False).replace(' ',''))
#     print('--'*50)
#     print(f'\nFound {len(restaurants)} restaurants or food related vendors')

---

## Solution 2:

* Similar to solution 1: Use regex and bool mask but code may be more readable

In [73]:
def containsMask():
    # Filter the original dataframe categories using a bool mask for 'Restaurants' or 'Food' and str.contains()
    restaurants_2 = dataframe[
            (dataframe['categories'].str.contains('Restaurants')==True) | 
            (dataframe['categories'].str.contains('Food')==True) 
    ]
    return restaurants_2
    # Check the results
#     print(restaurants_2['categories'].head(n=10).to_string(index=False).replace(' ',''))
#     print('--'*50)
#     print(f'\nFound {len(restaurants)} restaurants or food related vendors')

### Performance Comparison

`%timeit -n100 -r3 <statement>`
> The statement will get executed 3 times * 100 loops per execution = 300 times

In [63]:
%timeit -n100 -r3 regexMask()
%timeit -n100 -r3 containsMask()

341 ms ± 3.2 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)
214 ms ± 1.48 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)


---

## Export to JSON

* Apply the restaurant filter to all datapoints in ```yelp_academic_dataset_business.json``` then export the processed data as a JSON 

In [65]:
# Export to JSON
restaurants.to_json(r'yelp_academic_dataset_restaurants.json')

## Summary:
    
* According to the Yelp developer documentation, regular expressions should be sufficient to remove unrelated business categories from our dataset.

* If our data set were not structured and well documented, we would likely need to consider a classification technique for businesses such as **Named Entiy Recognition (NER)**.
---