- Project Introduction
- Technologies Used
- Methods Used
- Project Description
- Feature Notebooks and Deliverables
- Most Important Findings
- 1. Which categories of businesses are getting top reviews?
- 2. How often do businesses get reviewed over time?
- 3. Which business categories get bad reviews?
- 4. Accuracy results of Sentiment Analysis using VADER
- 5. Predicting the sentiment using different classifiers
- 6. Conclusion and Future Recommendations
- Acknowledgments
- Licences
- Contact
The aim of the project is to predict the sentiment of a Yelp review, and make actionable recommendations to businesses that will help them understand customer needs, and monitor customer feedback.
- Data Processing / Data Cleaning
- Data Analysis
- Descriptive Statistics
- Feature Engineering
- Data Visualization
- Text Preprocessing
- Sentiment Analysis
- Predictive Modeling and Hyperparameter Tuning
- Evaluating Model Results
- Reporting
The Yelp dataset is a collection of businesses, reviews, and user data, intended for learning purposes, published by Yelp. It contains over 8 million reviews for 200 thousand businesses in 10 metropolitan areas of the US. The dataset that will be used for the purposes of sentiment analysis and prediction will contain businesses from the Phoenix, AZ metropolitan area.
I will explore the dataset to gain valuable insights of businesses working in this area, and then proceed to perform sentiment analysis of the reviews, and make actionable recommendations to businesses that will help them understand customer needs, and monitor customer feedback. Out last step will include predicting the sentiment of a Yelp review.
A very important step is asking questions we need to give answers to. As part of my mentorship I was tasked to answer the following:
- Gain different insights by exploring the dataset
- What businesses are getting top reviews?
- Which categories of businesses are getting top reviews?
- How often do businesses get reviewed over time?
- How do the categories of trending and top reviewed businesses differ?
- Which business categories get bad reviews?
- What are the most common words in bad reviews?
- Build a machine learning model to predict the sentiment of Yelp reviews.
- Predict or recommend something else.
The dataset used for this project can be found on the data.world website: yelp_reviews.csv
The complete Yelp Dataset can be found on their website: Yelp Dataset
- Data - folder containing processed data
- Images - folder containing assets such as images
- 1. Data Preprocessing and Basic EDA - Notebook which contains the process of Basic Data Exploration and Preprocessing
- 2. Business Case Data Analysis - Notebook which contains the analysis of relevant questions to the business case
- 3. Sentiment Analysis - Notebook which contains the Sentiment Analysis of the Yelp reviews
- 4. Modeling and Evaluation - Notebook dedicated to Text preprocessing, Predictive modeling, and Model evaluation
- Classification.py - Python script which performs model fitting using said classifiers, as well as model evaluation
- Blog post on Yelp Reviews Sentiment Analysis: project | Yelp Reviews Sentiment Analysis
Collapse
1. Data Preprocessing and Basic EDA
1. Imports
2. Data
2.1 Business Dataset
2.2 Review Dataset
2.3 User Dataset
3. Early EDA and Data Cleaning
3.1 Missing values
3.2 Duplicate rows
3.3 Removing unnecessary features
4. Saving data for the next stage
2. Business Case Data Analysis
1. Imports
2. Data
3. Business Case Data Analysis
3.1 What businesses are getting top reviews?
3.2 Which categories of businesses are getting top reviews?
3.3 How often do businesses get reviewed over time?
3.4 How do the categories of trending and top reviewed businesses differ?
3.5 Which business categories get bad reviews?
3.6 What are the most common words in bad reviews?
3. Sentiment Analysis
1. Imports
2. Data
3. Sentiment Analysis
3.1 Testing VADER with a random review
3.2 Computing polarity scores
3.3 Comparison Analysis of the compound score and the original label
4. Modeling and Evaluation
1. Imports
2. Data
3. Preparing Text
3.1 Removing Missing values
3.2 Creating three categories of labels from ratings
3.3 Train/Test Split
3.4 Vectorizing the text
4. Classification
4.1 Further splitting data into a train and validation set
4.2 Logistic Regression
4.3 Multinomial Naive Bayes
4.4 Random Forest
4.5 Decision Tree
4.6 K Neighbors
4.7 AdaBoost
4.8 XGBoost
5. Evaluation
5.1 Comparing scores from all models
5.2 Fitting the best model with test data
5.3 Additional model metrics and tuning
Link to the presentation: Yelp Sentiment Analysis Presentation.pdf
The top 10 categories of most reviewed businesses are as follows:
Almost the third of total categories in the top 30 reviewed businesses belong to Restaurants. Following categories include American (New), Bars, and Nightlife. It is clear that Restaurants are a dominant category when looking at top-rated businesses, as well as all categories in the dataset. However, looking at all categories Restaurants have a slightly smaller share, and is followed by Shopping, Food. There might be room for an assumption that people tend to review their experience with eating-out more than other consumer experiences. However, Restaurants and similar categories have a dominant share on Yelp, so this assumption might not hold entirely.
Once I analyzed the dataset, I noticed that the year 2013 has only reviews for the first 5 days of the year. I decided this year will not be taken into consideration. Year 2005 contains reviews from March 2005, and will be excluded as well. Let's see the trend of reviews per year:
Let us see the frequency of ratings of couple of businesses with most reviews over time:
Highly reviewed businesses such as the Phoenix Sky Airport, and Pizzeria Bianco show a positive trend over the years. The randomly selected two businesses from the list, Joe's Farm Grill and Postino Arcadia show a slightly different story: there was a steady positive trend of number of reviews up until the year 2010 and 2011, and the year 2012 recorded a drop in reviews for both establishments.
This steady growth in reviews can potentially show us these businesses value their customer's feedback, and are creating - as well as actively pursuing - a good business environment.
I defined a bad review as a review with a rating of 1. These are the top 7 categories of businesses that contain the highest number of bad reviews:
When looking at all categories of businesses that get less than 2 stars per review, the highest number of these reviews, almost 50%, goes to Restaurants, followed by Shopping, and Food.
Now let's see which businesses have the highest number of bad reviews:
By looking at categories of top 30 businesses with an aggregated business star less than 2, there is no difference in categories, except the presence of the Automotive category. The highest number of bad reviews goes to US Airways with 95 1-star review, whereas the mean number of 1-star reviews is 43. Again, this indicates that the dominant category in the dataset is Restaurants, and is most often reviewed.
The goal of our Sentiment Analysis of Yelp reviews is to determine if the review is positive, negative or neutral. NLTK Vader proved to be fairly good in this sense, I achieved an accuracy score of 0.71.
This score was calculated by comparing the true label of the reviews (positive/negative/neutral) with the predicted label that was formed based on the compound score taken from the VADER Polarity score, also called the compound label. I have taken positive reviews to have the score of 4 and 5, neutral reviews have the score of 3, and negative reviews are all reviews that are starred with 1 or 2. I have also based my compound label as negative being less than 0.5, neutral if the score is less than 0.5, and positive for all other compund scores. The confusion matrix I got from the analysis looks like this:
Here is the Classification report that helps us clarify the multi-class matrix we are seeing:
Precision | Recall | F1-Score | Support | |
---|---|---|---|---|
negative | 0.68 | 0.39 | 0.49 | 38245 |
neutral | 0.23 | 0.10 | 0.14 | 35268 |
positive | 0.76 | 0.93 | 0.84 | 155617 |
accuracy | 0.71 | 229130 | ||
macro avg | 0.55 | 0.47 | 0.49 | 229130 |
weighted avg | 0.66 | 0.71 | 0.67 | 229130 |
It is clear that the class "positive" has outnumbered the "negative" and the "neutral" classes - therefore the pretty good precision, recall and overall f1-score! Having in mind that the neutral class was formed by only one review (3) the result of the prediction is understandable. I find the results of precision and overall f1-score for negative reviews to be satisfying.
As part of the project, I performed prediction and eveluation of the reviews using the following classifiers: Logistic Regression, Multinomial Naive Bayes, Decision Tree, Random Forests, KNN, AdaBoost, and XGBoost. My decision to use these algorithms was based on using a set of different classifiers, from simple to complex, and then evaluate the performance of each algorithm. Prior to fitting models, I have vectorized the data using TfidfVectorizer, and a StandardScaler as my Feature scaling tool. Here are the results of said classifiers:
XGBoost and Logistic Regression displayed best results in the predictive modeling, therefore I would go with using these two algorithms to enhance the results with hyperparameter tuning.
The Yelp Dataset is quite rich on potential it has to help businesses understand customers and their needs. As such, I would definitely continue working on getting better results with sentiment prediction. The prediction results in this project have showed there is room to enhance performance of used algorithms, or use a Neural Network and compare results. As far as the exploration of the dataset goes, I can conclude that customers are more than willing to rate and review their experience with an establishment, especially restaurants and shopping establishments. The trend of sharing your experience is on the rise, and businesses can benefit greatly from such detailed analysis of reviews and ratings their customers leave.
Thanks so much to awesomeahi95 and their Classification.py script because I was able to learn a great deal about creating an usable and bug free script that can be reused for every classification problem, based on their work.
Database Contents License (DbCL) v1.0
Find me on LinkedIn, Twitter or adzictanja.com.