# Capstone Project : Yelp's Customized Recommender (Part 1) - Data Transformation I

**Prepared by:** Daniel Han<br>
**Prepared for:** Brainstation

## Contents

Below is a list of contents showing how this and the following notebooks documenting the project execution are structured:

1. Introduction<br>
2. Procedure<br>
3. Limitations and Assumptions<br>
4. Disclaimer<br>
5. Project Execution
    - 5.1 Data Acquisition
    - 5.2 Data Transformation & Cleaning
        - 5.2.1 File Type Conversion
        - 5.2.2 Preliminary Data Cleaning and Combination of Multiple Datasets
        - 5.2.3 Further Data Cleaning
    - 5.3 Exploratory Data Analysis (EDA) & Preparation for Modelling
        - 5.3.1 General Understanding of Data
        - 5.3.2 Text Data Processing for Calibration of User Rating
        - 5.3.3 Datetime Data & Time Series Analysis
        - 5.3.4 Multi-Variate Relationships
        - 5.3.5 Final Dataframe to Model
    - 5.4 Baseline Recommenders
        - 5.4.1 Cluster Review with TSN-E Decomposition
        - 5.4.2 Content-Based Filtering : Finidng Most Similar Businesses to Favourite
        - 5.4.3 Collaborative Filtering : User-Based KNN Approach
        - 5.4.4 Collaborative Filtering : Model-Based SVD Approach
        - 5.4.5 Baseline Model Comparison
    - 5.5 Neural Network
        - 5.5.1 Train-Test Split
        - 5.5.2 Scaling
        - 5.5.3 Model Construction
        - 5.5.4 Model Performance Evaluation
        - 5.5.5 NN  Recommendations
    - 5.6 Final Model Selection & Optimization
        - 5.6.1 Model Comparison
        - 5.6.2 Model Optimization (Hybrid Recommendation)
6. Summary

## 1. Introduction

Yelp is an online platform where users can rate and write reviews on local businesses. Users can also browse the website and/or search local businesses of their interest.

To facilitate the user's browsing on Yelp's website, a system providing customized recommendations on local businesses to each user is made by predicting how much a given user will be satisfied with a given business.

In this hypothetical project, approximately 300k Yelp consumer reviews on local businesses are analyzed to study the general attributes of successful local businesses and understand the consumer preferences.

Following the exploratory analyses, multiple recommenders are constructed and their performances are compared. The model construction begins with relatively simple recommenders such as content-based and collaborative filtering models, followed by a more computationaly intense model employing neural network.

## 2. Procedure

While the process of analysis and modelling were iterative, the following general procedure was taken throughout:

**1. Data Acquisition:** Several datasets available in json file format are downloaded from the Yelp website, upon which insights are drawn and models trained.

**2. Data Transformation & Cleaning:** The obtained raw datasets are converted to a comma-separated values (csv) format, combined, and transformed into a wrangled dataset prepared for  analysis.

**3. Exploratory Data Analysis (EDA)**: The general attributes of businesses and consumer info associated with high ratings are reviewed. Additional variables are created through feature engineering. Some of the findings are illustrated. 

**4. Baseline Recommenders:** Three baseline recommenders are created: 1) content-based filtering, 2) user-based collaborative filtering, and 3) collaborative filtering with matrix-factorization. 50 recommendations from each recommender are are reviewed and compared against each other. While the content-based filtering takes into account the attributes and categories of every business, collaborative filtering only uses the rating from every available reviewer-business pair to make the recommendations.   

**5. Neural Network:** A neural network model trained on all the business and user data is used to find the mathematical relationship between the rating and features of businesses and user profiles, taking the regression approach. Using this relationship, the rating of every user-business pair is predicted and the top 50 businesses are recommended for a given user.

**6. Final Model Selection & Optimization:**  All the constructed recommenders are compared against each other and the model with the highest score for a given metric is selected as the final model. Subsequently, the final model is optimized to provide the best results (i.e., most relevant recommendations).

## 3. Limitations and Assumptions

- As previously noted, this project was conducted without the knowledge of the accuracy/relevance metrics of the existing recommendation algorithm in place. Any accuracy/relevance metrics are advised to be compared between the existing and this proposed algorithms.


- When combining the existing star ratings with the sentiment scores to create a continuous scale from 0 to 5, 70% of the new rating was taken from the star ratings and the remaining 30% from the sentiment secores. While the reasoning behind is that the star rating would still be the better indication of one's experience with the business, it is to be noted that a bias has been introduced while deciding the relative proportion to be 7:3 and shall be modified as required in the later process to better reflect the true customer satisfication.


- Business attributes and categories containing more than 75% of unknown values were deemed not indicative of the business rating, therefore were dropped.  

Other limitations and assumptions are discussed throughout the analysis.

## 4. Disclaimer

This project was conducted using the dataset available at the time of preparation of this document on the Yelp website for academic purposes.This project was conducted as the capstone project for BrainStation Data Science Diploma program. This document was not created for commercial purposes, and is not to be redistributed for such uses.

The dataset was downloaded at the following URL:

https://www.yelp.com/dataset/download

To the best of the writer's knowledge and understanding, this document does not violate the terms of use for these datasets issued by Yelp.

Also, this report is not indended to publicly criticize Yelp or any of its services and/or features. This is a hypothetical project conducted as the capstone project for BrainStation Data Science Diploma Program. 

## 5. Project Execution

This Section of the notebook documents the execution of this project.

### 5.1 Data Acquisition

Four dataset downloaded from Yelp were used for this project. The four datasets are:
   - business.json: Contains business data including location data, attributes, and categories
   - review.json: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
   - user.json: User data including the user's friend mapping and all the metadata associated with the user.
   - checkin.json: Checkins on a business.

The data dictionary for each dataset is available in the following URL: https://www.yelp.com/dataset/documentation/main

### 5.2 Data Transformation and Cleaning

After having downloaded the data as json files, the data were then converted to csv, combined, and wrangled into a more useable format.

   ### 5.2.1 File Type Conversion

In this subesction, each dataset in json format was converted to a comma-separated values (csv) file.

#### Import Libraries

First, any library or libraries required to read in the dataset and convert to csv is imported. In this case, Pandas is the only tool needed.

In [2]:
# Import pandas
import pandas as pd

#### Read in datasets

Each dataset is read in using the *pandas.read_json* method. Then, the first five rows are displayed.

##### Business

The `business` dataset contains the following fields:

- business_id : unique Business ID
- name : name of business
- address : business address
- city : city in which business is located in
- state : state in which business is located in
- postal_code : postal code of business
- latitude : latitude of business
- longitude : longitude of business
- stars : average star rating of the business
- review_count : number of reviews business received
- is_open : whether business is currently open
- attributes : business attributes
- categories : business categories
- hours : business hours

In [4]:
# Read in business.json
df_business = pd.read_json('yelp_dataset/yelp_academic_dataset_business.json', lines = True)

# Display business dataframe
df_business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


##### Checkin

The `checkin` dataset contains the following fields:

- business_id : Unique Business ID
- date : dates of all check-ins at the business on social media

In [6]:
# Read in checkin.json
df_checkin = pd.read_json('yelp_dataset/yelp_academic_dataset_checkin.json', lines = True)

# Display Checkin Dataframe
df_checkin.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


##### User

The `user` dataset contains the following fields:

- user_id : unique User ID
- name : name of user
- review_count : number of reviews written by user
- yelping_since : when the user joined Yelp, formatted like YYYY-MM-DD
- useful : number of useful votes sent by the user
- funny : number of funny votes sent by the user
- cool : number of cool votes sent by the user
- elite : the years the user was elite
- friends : an array of the user's friend as user_ids
- fans : number of fans the user has
- average_stars : average rating of all reviews
- compliment_hot : number of hot compliments received by the user
- compliment_more : number of more compliments received by the user
- compliment_profile : number of profile compliments received by the user
- compliment_cute : number of cute compliments received by the user
- compliment_list : number of list compliments received by the user
- compliment_note : number of note compliments received by the user
- compliment_plain : number of plain compliments received by the user
- compliment_cool : number of cool compliments received by the user
- compliment_funny : number of funny compliments received by the user
- compliment_writer : number of writer compliments received by the user
- compliment_photos : number of photo compliments received by the user

Due to the sizes of the files, the `user` dataset is read in in chunks to avoid **MemoryError**.

In [8]:
# Read in user.json

MyList = [] # Create an empty List
ChunkSize = 100000 # Read in and append 100,000 rows at a time to MyList

# Loop over chunks and append
for chunk in pd.read_json('yelp_dataset/yelp_academic_dataset_user.json', lines = True, chunksize = ChunkSize):
    MyList.append(chunk)

In [9]:
# Vertically concatenate all individual chunks
df_user = pd.concat(MyList, axis = 0)

# Display User Dataframe
df_user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0


##### Review

The `review` dataset contains the following fields:

- review_id : unique review id
- user_id : unique user id
- business_id : unique business id
- stars : star rating
- useful : number of useful votes received
- funny : number of funny votes received
- cool : number of cool votes received
- text : the review itself
- date : date formatted YYYY-MM-DD

Due to the sizes of the files, the `review` dataset is read in in chunks to avoid **MemoryError**.

In [10]:
# Read in review.json

MyList = [] # Create an empty List
ChunkSize = 100000 # Read in and append 100,000 rows at a time to MyList

# Loop over chunks and append
for chunk in pd.read_json('yelp_dataset/yelp_academic_dataset_review.json', lines = True, chunksize = ChunkSize):
    MyList.append(chunk)

In [11]:
# Vertically concatenate all individual chunks
df_review = pd.concat(MyList, axis = 0)

# Display User Dataframe
df_review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


#### Save Datasets as csv

As a checkpoint, each dataframe is saved as a csv file in the local directory.

In [12]:
# Save df_business as csv
df_business.to_csv('business.csv')

# Save df_checkin as csv
df_checkin.to_csv('checkin.csv')

# Save df_user as csv
df_user.to_csv('user.csv')

# Save df_review as csv
df_review.to_csv('review.csv')