# 1. Capstone Project : Customized Recommendation System with Yelp Dataset
## Data Transformation and Cleaning (Part 1)

**Prepared by:** Daniel Han<br>
**Prepared for:** Brainstation

## 0. Contents

Below is a list of contents showing how this and the following notebooks documenting the project execution are structured:

1. Introduction<br>
2. Procedure<br>
3. Limitations and Assumptions<br>
4. Disclaimer<br>
5. Project Execution
    - 5.1 Data Acquisition
    - 5.2 Data Transformation & Cleaning
        - 5.2.1 File Type Conversion
        - 5.2.2 Preliminary Data Wrangling and Combination of Multiple Datasets
        - 5.2.3 Further Data Wrangling
    - 5.3 Exploratory Data Analysis (EDA)
        - 5.3.1 General Understanding of Data
        - 5.3.2 Text Data Processing & Analysis
        - 5.3.3 Time Data Distribution & Time Series Analysis
        - 5.3.4 Multi-Variate Relationships
        - 5.3.5 Final Dataframe to Model
    - 5.4 Preliminary Prediction Model
    - 5.5 Advanced Machine Learning Models
    - 5.6 Customized User Recommendations

## 1. Business Brief

Yelp is an online platform where users can rate and write reviews on local businesses. Users can also browse the website and/or search local businesses of their interest.

To facilitate the user's browsing on Yelp's website, it has been decided to make customized recommendations on local businesses to each user by predicting how much they will be satisfied with the businesses.

In this hypothetical project, upwards of 120,000 consumer reviews on Yelp are analyzed to study the attributes of successful local businesses and understand the consumer preferences.

Following some exploratory and statistical analyses, how much a given consumer will be satisfied with a business is predicted on a continuous scale from 0 to 5.

Finally, as the final product of this project, a customized recommendation system is programmed using the prediction model above such that every user is recommended 10 nearby business locations. This and the following notebooks summarize the findings and results. 

## 2. Procedure

While the process of analysis and modelling were iterative, the following general procedure was taken throughout:

**1. Data Acquisition:** Several datasets available in json format were downloaded from the Yelp website. This project was exectued on these datasets.

**2. Data Transformation & Cleaning:** The obtained raw datasets were converted to a comma-separated values (csv) format, combined, transformed, and wrangled using Python and its libraries into a cleaner dataset, i.e., data wireframe, ready for  analysis. The Pandas and Numpy libraries were used throughout the process.

**3. Exploratory Data Analysis (EDA)**: Key findings on the attributes of most-liked businesses and the consumer behahaviours are reported. Feature engineering was performed in this process to create other meaningful variables. Several noticeable patterns were illustrated using the visualization libraries in Python, including the Matplotlib, Seaborn, and Plotly libraries. 

**4. Preliminary Prediction Model:** (To be completed - Currently in Progress)

**5. Advanced Machine Learning Models:** (To be Completed)

**6. Customized User Recommendations**: (To be Completed)

## 3. Limitations and Assumptions

To precisely score a user's satisfaction with a business, a continuous scale from 0 to five was established based on a sentimental analysis on the reviews instead of using the five-star rating. 

Other limitations and assumptions are discussed throughout the analysis.

## 4. Disclaimer

As mentioned earlier, this hypothetical project was conducted using the dataset available at the time of preparation of this document on the Yelp website for academic purposes.This project is solely for academic and personal purposes, namely to submit as a project in an academic institute (BrainStation) and to create a personal portfolio. This document was not created for commercial purposes, and is not to be used or redistributed for generating any form of income.

The dataset was downloaded at the following URL:

https://www.yelp.com/dataset/download

To the best of the author's knowledge and understanding, this document does not violate the terms of use for these datasets  issued by Yelp.

While the datasets used belong to Yelp, the entire execution shown in this document is the original work of the author of this document (i.e., Daniel Han) and shall not be published elsewhere without appropriately citing the source.

## 5. Project Execution

This Section of the notebook documents the entire project execution process and discuss the findings.

### 5.1 Data Acquisition

Four dataset downloaded from Yelp were used for this project. The four datasets are:
   - business.json: Contains business data including location data, attributes, and categories
   - review.json: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
   - user.json: User data including the user's friend mapping and all the metadata associated with the user.
   - checkin.json: Checkins on a business.

The data dictionary for each dataset is available in the following URL: https://www.yelp.com/dataset/documentation/main

### 5.2 Data Transformation and Cleaning

After having downloaded the data as json files, the data were then converted to csv, combined, and wrangled into a more useable format.

   ### 5.2.1 File Type Conversion

In this subesction, each dataset in json format was converted to a comma-separated values (csv) file.

#### Import Libraries

First, any library or libraries required to read in the dataset and convert to csv is imported. In this case, Pandas is the only tool needed.

In [2]:
# Import pandas
import pandas as pd

#### Read in datasets

Each dataset is read in using the *pandas.read_json* method. Then, the first five rows are displayed.

##### Business

In [4]:
# Read in business.json
df_business = pd.read_json('yelp_dataset/yelp_academic_dataset_business.json', lines = True)

# Display business dataframe
df_business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


##### Checkin

In [6]:
# Read in checkin.json
df_checkin = pd.read_json('yelp_dataset/yelp_academic_dataset_checkin.json', lines = True)

# Display Checkin Dataframe
df_checkin.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


Due to the sizes of the files, the 'user' and 'review' dataset are read in in chunks to avoid **MemoryError**.

##### User

In [8]:
# Read in user.json

MyList = [] # Create an empty List
ChunkSize = 100000 # Read in and append 100,000 rows at a time to MyList

# Loop over chunks and append
for chunk in pd.read_json('yelp_dataset/yelp_academic_dataset_user.json', lines = True, chunksize = ChunkSize):
    MyList.append(chunk)

In [9]:
# Vertically concatenate all individual chunks
df_user = pd.concat(MyList, axis = 0)

# Display User Dataframe
df_user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0


##### Review

In [10]:
# Read in review.json

MyList = [] # Create an empty List
ChunkSize = 100000 # Read in and append 100,000 rows at a time to MyList

# Loop over chunks and append
for chunk in pd.read_json('yelp_dataset/yelp_academic_dataset_review.json', lines = True, chunksize = ChunkSize):
    MyList.append(chunk)

In [11]:
# Vertically concatenate all individual chunks
df_review = pd.concat(MyList, axis = 0)

# Display User Dataframe
df_review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


#### Save Datasets as csv

As a checkpoint, each dataframe is saved as a csv file in the local directory.

In [12]:
# Save df_business as csv
df_business.to_csv('business.csv')

# Save df_checkin as csv
df_checkin.to_csv('checkin.csv')

# Save df_user as csv
df_user.to_csv('user.csv')

# Save df_review as csv
df_review.to_csv('review.csv')