# Capstone Project : Yelp's Customized Recommender (Part 1) - Data Transformation I

**Prepared by:** Daniel Han<br>
**Prepared for:** Brainstation

## Executive Summary Report

### 1. Business Brief

#### Background

<img src =https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Yelp_Logo.svg/1200px-Yelp_Logo.svg.png width = "600">

Yelp is an online platform which allows users to interactively share information about local businesses. Users can rate their local businesses based on their experience, on a scale of 1 star to 5 stars, which can in turn be helpful to other users on the platform who might use the service of the businesses.

A feature about the Yelp reviews, as is common in other platforms, is that the raters must also provide their reviews on the businesses they rate to provide further explanations on their experience such that users can consider several factors, such as the price, atmosphere, customer service, parking availabilities, etc., before making the decision on whether to use the businesses (Figure 1).

Also, users on Yelp can vote the reviews on various criteria such as 'useful', 'funny', and 'cool' to provide their impressions on the reviews. Users can also choose to be fans or friends of other users to constantly follow their activities, etc.

**Figure 1 - Yelp Reviews and Ratings**

<img src = https://www.wordstream.com/wp-content/uploads/2021/07/yelp-reviews-filtered-1.png width = "800">

Another feature about Yelp is that each user is given a profile and other users can provide compliments such as 'funny', 'hot', 'cute', 'plain', etc. on the user.

#### Problem Statement

On Yelp, recommendations are provided on the main page under a tab reading "Your Next Review Awaits" (Figure 2). While this provides some guidance as to which places they might try, each user may have unique preferences and it may be unclear as to under why those local businesses are recommended to the users.

**Figure 2 - Current Yelp Recommendations**


<img src="Your%20Next%20Review%20Swaits.jpg" width="600">


While some users may take up on the suggestions, others may perceive these recommendations as "randomly drawn" from the pool of businesses and may choose to bypass the recommendations.

The relevance of recomendations is crucial in ensuring the user experience which subsequently impacts the reliability or reputation of the service. It may also be more effective to inform the users as to the **"why"** - that is, why they are recommended such businesses.

Many reputable organizations known for their recommendation algorithms, such as Netflix, provides users with a short explanation for their recommendations which could provide a credibility and interest users to try the suggested videos (Figure 3). In light of this, a potential room for improvement for Yelp's recommender is to precisely study each user's preferences and to productionsize a recommender to convince the users that these recommendations are indeed for themselves. That is, to convey the message that **Yelp knows the places the users themselves did not know they wanted to visit**.

**Figure 3. Netflix Recommendations**
<img src="https://miro.medium.com/max/1024/1*jlQxemlP9Yim_rWTqlCFDQ.png" width="600">

Based on the hypothesis that highly relevant recommendations accompanied by a description of the reasons will lead to a higher customer click-through-rate (CTR), a new customer analysis project is proposed and demostrated through out the remainder of this report, which entails:

- an exploratory data analysis to determine whether user profile information and business attributes are related to the user preferences or business rating,   
- developing the prototype of a new customized recommender with high performance metrics (i.e., accuracy & relevance), and
- translating the algorithms behind the recommender to brief descriptions that are easily understood by end users and provide credibility of the recommendations.

The recommendation algorithm is to be such that, given a user id unique to each user, ten recommendations are made along with the reason(s) for such recommendations.

#### Challenges

The current rating system in place has multiple caveats for the execution of the project. Firstly, the star rating is discrete, therefore poses difficulty in ranking businesses in the order of satisfaction. Also, every user has a different perception in the way they rate the businesses. To address these two problems, sentiment is analyzed from the users' reviews and are combined to the rating system to create a new rating on a continuous scale from 0 to 5. 

#### Values

Upon successfully deploying the new recommender that makes relevant recommendations customized for each individual, the user experience and public reputation of Yelp are expected to be increased, both of which can lead to a higher user retention and lower churn.

#### Scope

While the prototype is created such that the performance metrics are maximized as allowed by the natural and computational limitations, it is to be noted that this project is proposed without the knowledge of the accuracy/relevance metrics of any existing recommendation algorithm in place.

Also, it is important to validate the effectiveness of the proposed change before the deployment. Therefore, in this project, a subset of the data, namely the data for the municipality of Santa Clara, CA, is used to create the database behind several algorithms used. Conducting an A/B testing between this subset and another mutually exclusive subset is strongly recommended.

### 2. Data Preprocessing

The original dataset was obtained from [yelp's dataset API](https://www.yelp.com/dataset/download) made available in json format for academic, educational, and personal purposes. This data contained the information about all the businesses, users, check-ins, and user reviews which were leveraged to create a dataset on which exploratory data analysis was conducted and recommendation models created. 

#### Data Dictionary

The `business` dataset contains the following fields:

- business_id : unique Business ID
- name : name of business
- address : business address
- city : city in which business is located in
- state : state in which business is located in
- postal_code : postal code of business
- latitude : latitude of business
- longitude : longitude of business
- stars : average star rating of the business
- review_count : number of reviews business received
- is_open : whether business is currently open
- attributes : business attributes
- categories : business categories
- hours : business hours

The `checkin` dataset contains the following fields:

- business_id : Unique Business ID
- date : dates of all check-ins at the business on social media

The `user` dataset contains the following fields:

- user_id : unique User ID
- name : name of user
- review_count : number of reviews written by user
- yelping_since : when the user joined Yelp, formatted like YYYY-MM-DD
- useful : number of useful votes sent by the user
- funny : number of funny votes sent by the user
- cool : number of cool votes sent by the user
- elite : the years the user was elite
- friends : an array of the user's friend as user_ids
- fans : number of fans the user has
- average_stars : average rating of all reviews
- compliment_hot : number of hot compliments received by the user
- compliment_more : number of more compliments received by the user
- compliment_profile : number of profile compliments received by the user
- compliment_cute : number of cute compliments received by the user
- compliment_list : number of list compliments received by the user
- compliment_note : number of note compliments received by the user
- compliment_plain : number of plain compliments received by the user
- compliment_cool : number of cool compliments received by the user
- compliment_funny : number of funny compliments received by the user
- compliment_writer : number of writer compliments received by the user
- compliment_photos : number of photo compliments received by the user

The `review` dataset contains the following fields:

- review_id : unique review id
- user_id : unique user id
- business_id : unique business id
- stars : star rating
- useful : number of useful votes received
- funny : number of funny votes received
- cool : number of cool votes received
- text : the review itself
- date : date formatted YYYY-MM-DD

The overall data preprocessing was executed as illustrated in Figure 4.

1. The `attributes`, `categories`, `hours` fields of the `business` table are in a dictionary-like string format in a somewhat inconsistent manner and was expanded such that each attribute, category, and opening hour is a separate column.

2. Key features in the `business` table are joined with the `review` table on the `business_id` field. 

3. Key features in the `users` table are joined with the `review` table on the `user_id` field.

4. The discrete 5-star rating system is combined with the sentiment score of the review (i.e. the `text` field of the `review` table) to be a continuous score.

5. Unneccesary columns, namely `review_id` and `text` of the `review` table, are dropped. For the regression algorithm using neural-network, `user_id`and `business_id` are additionally dropped.

In addition to the steps above, the `date` field of the `checkin` table is feature engineered such that the total number of check-ins is calculated from the individual check-in dates for each business. Then, the check-in counts are added as a new field to the `review` table.

Furthermore, the individual fields in the table are corrected to the appropriate types. Also,the table is checked for duplicated rows and null values (i.e. NaN) are imputed/dropped as appropriate.

Finally, feature selection is conducted. For the business attributes and categories, those containing more than 75% of unknown values are not considered indicative of the target variable (i.e., the business rating) and are dropped from the dataset.

Note that steps 4. and 5. are demonstrated in the following section, which is 5.3 Exploratory Data Analysis & Preparation for Modelling.

**Figure 4. Data Preprocessing Overview**
<div>
<img src="data_cleaning_process.jpg" width="600">
</div>

The data saved as csv files are read-in as dataframes and cleaned. Also, the four datasets are combined into a single dataset.

### 3. Exploratory Data Analysis

#### Procedure

Following the majority of data preprocessing, an exploratory data analysis was performed to obtain key insights about the customer preferences and successful businesses, which formed the basis for the subsequent modelling process. In particular, the following questions were addressed:

- What are the general distributions of business and user features? Do most people tend to behave in a certain way? Do most businesses tend to have a certain characteristic? If so, what are those tendencies?


- What are the words most commonly found in the positive reviews? What are commonly found in the negative reviews?


- Is there any relationship between the time a review was written and the rating?


- Which features about a business or user are related to a high user rating? What are related to a low rating?


- Is there a relationship between the geographic location of a business and the user rating on the business?


- Are some of the independent features correlated with other features, creating redundancy in the independent variables (i.e. is there any multicollinearity present?)? 

#### Insights

The insights drawn from the exploratory data analysis are summarized as below.

| Item | Problem | Findings / Insights |
| :--- | :--- | :--- |
| 1 | General Distribution | Users mostly give 2-4 ratings. Most businesses still open, provide bike parking, accept creditcard, do not have parking garage or other parking options (street, valet, validated parking) but have parking lots. Business average stars mostly normally distributed.|
| 2 | Key Words in Positive/Negative Reviews | Quality of the food (for restaurants), cutomer service, communications (i.e. wrong orders, unpleasant dialogues, etc. |
| 3 | Review Time vs Rating | Users tend to rate around 4.1 recently. Lower rates on Mondays and Sundays, and at 11 a.m. |
| 4 | Features Most Related to Rating | **Business reputation (i.e., average rating) & user average rating**. Aside from these, ratings are highly depedent on **personal preferences** rather than certain features/attributes. |
| 5 | Geographic Location vs Rating | Low ratings are mostly found in the urban, crowded areas. High ratings on the outskirts. |
| 6 | Multicollinear Features | Opening/closing hours among business days, types of compliments, `bars` and `night life` |

### 4. Modelling

From the exploratory data analysis (EDA), several the insights were drawn, most noteable of which were:

- Average business/user ratings are the two best predictors of the individual ratings. In other words, the higher the reputation of a business, or the more generous a user is to rate, the higher the rating user would give a business. (This is rather unsurprising because an average business or user rating is the result of all the individual ratings for the business/user.)


- Apart from the two features above, no feature is a good predictor of the rating. Rather, the personal preference of each user plays a bigger role in deciding the final rating.


The above insights were used to form the basis of the recommendation algorithms. Several algorithms considered and the logic behind the algorithms are as follows: 

- **Content-based filtering** : Of all the businesses a user rated, select the user's favourite (i.e., most highly rated). If the favourite business is sufficiently highly rated, then recommend n businesses most similar to the favourite. (Figure 5). The content-based filtering approach attempts to answer the following questions - ***What are the places most similar to your favourite?***

**Figure 5. Content-Based Filtering**

<div>
<img src="contentbased.jpg" width="500"/>
</div>

- **User-based Collaborative Filtering (using K Nearest Neighbor)** : Given a user U, all the businesses rated by the user, and a business (or item) I not yet rated by U, select all other users who have rated I and at least one of the business rated by U. Then, based on the ratings between U and the other users, select *K* users most similar to user U. Subsequently, the user U's likely rating on I is estimated by the **average** ratings of the K users on business I. This way, predict all items not yet reviewed by U, sort them in order, then make n recommendations (Figure 6). The user-based collaborative filtering approach attempts to answer the following questions - ***What are the places people most similar to you enjoyed?***

**Figure 6. User-Based Collaborative Filtering**

<div>
<img src="userbased.jpg" width="500"/>
</div>

- **Model-Based Collaborative Filtering (Singular-Vector Decomposition)** : Based on all the user-business ratings, train a machine to learn p **arbitrary** business characteristics to measure every businesses against (i.e., how much a given business corresponds to each of the p characteristics), and how much every user prefers those p characteristics. With this extensive information on business characteristics and user preferences, predict the pairwise rating between every user and business. Then, make n recommendations for a given user based on the predicted ratrings. The model-based filtering approach attempts to answer the following questions - ***What are your preferences for a business, and which businesses most closely meet those preferences?***

- **Regression (with Neural Network)** : Given all users, businesses, and ratings thereof, train a machine to learn a complex mathematical relationship between all the users and businesses. Then, make n recommendations for a given user based on the predicted ratrings. The regression approach attempts to answer the following questions - ***Assuming there is a complex mathematical relationship between your behaviour (i.e. your Yelp profile info), business features, and your rating on the business thereof, which businesses are you most likely to enjoy based on that relationship?***

Both the model-based collaborative filtering and regression approaches can be illustrated as shown in Figure 7.

**Figure 7. SVD Decomposition & NN Regression**

<div>
<img src="regression.jpg" width="800"/>
</div>

#### Performance Metrics

To evaluate the performances of these algorithms, or to compare algorithms to one another, it was important to decide the performance metrics. In this process, it was determined that two considerations were necessary : accuracy and relevance.

- Accuracy : how accurately did the model predict the businesses a given user would enjoy? And
- Relevance : how similar are the recommended businesses to the user's favourite businesses?

For accuracy, **Root Mean Square Error (RMSE)** and **Mean Absolute Error (RAE)** were used, which measure the deviation of predictions from the actual ratings in the unit of rating.

For relevance, especially important for the content-based algorithm, the **cosine similarity** between the recommended businesses and a given user's favourite business(es) were reviewed.

### 5. Insights and Final Model

Based on the review of each recommendation algorithm, insights on each model's accuracy and relevance were drawn. In particular, a **T-SNE plot**, which is the 2D representation of every business based its business features, was utilized to see how close the recommended businesses were to the other businesses a given user has tried, including their favourite.

The biggest red translucent dot represents a user's most highly rated business, orange represents the other businesses already visited by the user, blue not yet visited, and green recommended to the user.

#### Most Relevant & Customized Recommender - Content-Based Filtering

As can be seen in the T-SNE plot (Figure 8), all businesses recommended by the content-based filtering algorithm were located close to the user's favourite.

The average cosine distance of the recommended businesses to 1000 randomly drawn users' 50 most liked businesses was calculated to be **0.19**, which was close to zero and indicated that the recommendations were indeed similar to the users' favourites. 

The main difference between content-based filtering and the other methods was that the content-based modelling only selected businesses based on similarity and did not involve making predictions on ratings. As such, it was impossible to calculate the RMSE and RAE metrics for accuracy. That said, the model was found to to be effective in selecting businesses similar to one's favourite thereby making the recommendations indeed customized.

While this would be helpful in recommending based on the users' favourites, a few caveats were identified and needed to be addressed:
- If a given user's highest rating is low (i.e., the user has not been satisfied with any businesses they have tried), selecting businesses most similar to such business would not be helpful, and


- if a given user wanted to try something new, recommending businesses similar to the ones already tried would not be helpful.

**Figure 8. T-SNE Plot for Content-Based Filtering**<br><br>
<div>
<img src="tsne content based.jpg" width="500"/>
</div>

#### Most Accurate Recommender - Neural Network Regression

The other three algorithms, namely user-based collaborative filtering, model-based filtering, and regression, were compared based on their accuracy metrics measured on the test set.

|Model | RMSE | MAE |
| --- | --- | --- |
|**User-Based**| 1.21 | 0.93 |
|**SVD**| 1.08 | 0.84 |
|**Regression**| 0.96 | 0.68 |

In terms of model accuracy, the regression model deploying Neural Network was found to be best performing in terms of its prediction on rating.

However, reviewing the T-SNE plot for regression (Figure 9), it was found that this algorithm did not focus much on the user's preferences compared to content-based filtering.

**Figure 9. T-SNE Plot for Regression using Neural Network**<br><br>
<div>
<img src="tsne nn.jpg" width="500"/>
</div>

Most of these recommended businesses had very high ratings on Yelp, and it was found that Neural Network tended to recommend businesses that were popular. This is consistent with the insight drawn from the exploratory data analysis, where a rating is likely to be high if a business' average rating is high.

#### Accuracy vs Relevance - Need for Combination

It has been seen that the most accurate recommendations (i.e., NN regression) cannot be unique to each user, while the most relevant recommendations (i.e., content-based) do not always guarantee that the person will like the business (Figure 10). To mitigate the downside of each model and maximize benefit, a **hybrid** of these two models is created.

**Figure 10. Accuracy vs Relevance**
<div>
<img src="scale.jpg" width="800"/>
</div>

The hybrid algorithm was created considering three cases:

1. A new user without prior record of visiting businesses on Yelp,
2. A user not satisfied with any businesses visited (i.e. most highly rated business of less than 4), and
3. A user satisfied with at least one business they have tried (i.e. most highly rated business of 4 or greater).

#### Case 1. New User

New users are simply recommended the businesses on Yelp that were most highly rated (i.e., highest average business rating). These recommendations are made with a message ***Welcome to Yelp, try these most popular places on Yelp!***.

#### Case 2. User Not Satisfied with any Business on Yelp

For users not satisfied with any businesses on Yelp, it was decided that there was no need for trying to select businesses similar to the user's most highly rated. Therefore in this case, all ten recommendations were made from Neural Network regression with a message saying ***You can't go wrong with these places. Why don't you give them a try!***.

Not only do these recommendations have the most high probability of satisfaction while considering a given user's preferences to some degree, the message saying that they "can't go wrong" may provide them the comforting impression that they are still valued customers after the unsatisfactory experiences from their previous visits.

Figure 11 illustrates this case.

**Figure 11. Case when User Not Satisfied with any Business on Yelp**
<div>
<img src="unhappy_user.jpg" width="400"/>
</div>

#### Case 3. User Satisfied with at Least One Business

The hybrid model is deployed when a user was satisfied with at least one of the businesses they have tried. In this case, 5 most highly ranked businesses in both the content-based and regression are selected, followed by 3 other businesses solely based on content-based filtering in case the user were more interested in visiting places similar to the one they liked the most regardless of the reputation, followed by 2 more businesses solely based on regression to provide them the opportunity to try other places. These recommendations are made with a message ***Because you liked*** *(your favourite place),* ***these popular places similar to*** *(your favourite place)* ***are recommended***.

Figure 12 illustrates this case.

**Figure 12. Case when User Satisfied at Least One Business on Yelp**
<div>
<img src="happy_user.jpg" width="400"/>
</div>

As can be seen in the T-SNE representation of this third case (Figure 13), the hybrid recommender provides recommendations that are mostly similar to the user's favourite, while providing a few other recommendations for a variety of choice.

**Figure 13. T-SNE Plot for Hybrid Recommender**
<div>
<img src="tsne hybrid.jpg" width="500"/>
</div>

### 6. Summary

To improve the current recommendation system of Yelp, an online platform for sharing experiences with local businesses, a new recommender is developed with two purposes. First, to create a customized recommender unique to each user that makes accurate predictions and relevant with the user's past experiences, and second, to provide the users with the reasons behind recommendations to build credibility of the recommendations.

Based on the analysis on user and business features, it was found that the rating was highly dependent on each user's preferences, and that the only other good predictors of ratings are the business/user average ratings. In other words, the general popularity and reputation of a business was the best way to ensure that a given user would like the business.

With this insight in mind, several recommendation algorithms were considered, of which content-based filtering and regression using neural network were found to be most effective. While content-based filtering focused on finding businesses similar to the users' favourites, the regression algorithm was found to recommend businesses that were generally popular. The content-based model scored a cosine distance of 0.19 while regression scored a root mean square error and mean absolute error of 0.96 and 0.68 respectively.

To utiize the strength of each algorithm, a hybrid recommender combining content-based filtering and regression was created such that each user was recommended either businesses similar to their favourites, or generally popular businesses based on their user activity on Yelp.

In particular, the recommender was programmed such that the reason for the recommendations was provided for a higher credibility in the recommendations.

Implementation of this new recommender is expected to improve user experience and public reputation of Yelp thereby retaining current users while lowering churn rate.

As for the future action items, it is recommended to run an A/B testing on this new recommendation which was built on the dataset for Santa Clara, CA, against a control group to determine the statistical significance of its effectiveness. Upon validating the improvement, it is recommended to productionize the recommender such that scheduled updates are automated an