Predict Ratings for Chinese Restaurants using Sentiment Analysis

[image credit]: https://leitesculinaria.com/103294/writings-how-to-pair-wine-and-chinese-food.html

1. Read Yelp Data 🍜

JSON file reading script:
read_yelp_data_business.R, read_yelp_data_checkin.R, read_yelp_data_photo.R, read_yelp_data_tip.R, read_yelp_data_user.R
[JSON file reading coding credit]: https://github.com/dpliublog/yelp_data_challenge_R10

SQL file reading script:
yelp_review_Chinese.R

2. Load Data 🍚

[data files]: https://drive.google.com/drive/folders/1_4749ED32FJuprWEspjkIkFQJDiH1yAX?usp=sharing

3. Data Manipulation 🍲

R script:
yelp.analysis.R

4. Restaurants EDA 📈

How's the ratings on Chinese restaurants?
Just out of curiousity, where do these restaurants located? plotly link below 👇

[Restaurants Location Interactive Map]: https://plot.ly/~angelayuanyuan/1/

5. Text Analysis 📚

What words are frequently used when reviewing a Chinese Restaurant?

CHICKEN!!🍗🍗🍗 Of course.... What else would we expect😂😂😂😂

What else?

Looks like dim sum and fried rice are popular dishes

6. Sentiment Analysis 😋😃😄😑😞

Does the rating of a restaurant related to the average sentiment score of a person's review?

YASSSSSSS👏👏👏

How does sentiment words related to ratings?

Okay, we see something weird here, some words with negative sentiment scores actually indicate pretty decent ratings.

How did that happened?

Negative words like "die","disappoint" don't necessarily means dissatisfaction, people say things like "the food is to die for!!", "the food really doesn't disappoint us..."

Are we putting sentiment words into context? Not yet!

Just as we mentioned above, sentiment words might follow or followed by word that turn them into completely different meanings.

So let's take a look at the contexts

We could see from the graph, although some bigrams contain sentiment words (gluten free, egg drop etc), they are not used to discribe restaurants quality related stuff.

Next, we take a step further to explore positive and negative sentiment words in their contexts seperately

Positve sentiments first 👍👍👍

It seems that we interprete most of the postive sentiment words fine. However, bigrams like overly sweet, pretty bad are actually not expressing positve sentiments.

How about negative sentiments 👎👎👎?

What's wrong with hard boiled, earl grey, jerk chicken and so on ??
They are food names, but unfortunately contain negative sentinent words in their name !!:broken_heart:

Please keep in mind, these situations would definitely cause inaccuracy when we try to predict ratings using sentiment.

Can we predict ratings using sentiment score?

Seems promising:sunglasses:

7. Users info EDA 📊

Before going into prediction, let's take a step back by looking at how the users' rating data looks like

How many reviews do users usually write

Wow...we can't tell anything from it
Try remove the outliers so we could actually see something

Most users don't give a lot of reviews

What's the average ratings by users

Users don't tend to give really low ratings

8. Regression Analysis 💡

sentiment polarity model
Regression models and results in yelp.analysis.R script

In this model, we use [sentimentr]https://cran.r-project.org/web/packages/sentimentr/sentimentr.pdf package to assign sentiment scores for each review text

one more question: do ratings related to the text length of the review

It doesn't look like.
Review length seems to be related to personal habits rather than restaurants' quality

before fitting any model, how are our response variable distributed

The ratings are not normally distributed, we might want to use logit or multinomial models

logistic models

Try split the data into ratings higher than 3 stars and ratings lower than 3 stars

In our regression output, the log odds is extremely big, which means we didn't include sufficient information in our model building process or there are outliers in the data

multinomial models

Using the predictors as above to fit multinomial models would cause the same problem, so we think about what other information can we add to our model.

From the users' perspective, different users have different standard when giving ratings. Some users tend to have a strict requirements for dining, so the ratings they give on Yelp will be generally low. Some users might be more tolerating, even though the quality of restaurants are not that satisfying, they are still giving quite decent ratings. So the underlying standard of each users is a factor that influences the outcome. Therefore, we go back to the Users dataset, and calculate the average ratings per user and add that information in our regression models.

From the business's perspective, their ratings are definitely related to their own quality. Since we only have data in a limited period of time, we might not be able to get a full picture of how the businesses perform over the years. However, wo do have their ratings on Yelp, which is a cumulated results over a longer period. So we go ahead and add that in too.

The plots below shows the relationship between users' average rating versus their rating for a particular restaurants and the relationship between restaurants Yelp ratings versus their ratings in reviews.

multilevel models

Still, linear multilevel models don't suit our data
Therefore we try fitting multilevel logit models

multilevel logit models

1) model building

In this model, we split the restaurants into two categories, those who have users' ratings lower than 3 stars, and those who have users' ratings equal or higher than three stars.

Our main objective is to find out whether we can use the sentiment which users' have shown in thier reviews to predict the ratings they might give to a certain restaurant. Besides the sentiment score of reviews, our predictors also include: indicator for restaurants' price range, parking availability and users' average ratings(all the ratings they have given on YELP/ numbers of reviews they have posted on YELP). However, by looking at our outcome data or looking at our residual plots when running a linear regression, we could see seperate trends, since the data contains repeated measurement for restaurants. Therefore, restaurants' public ratings (the one rating which shows up at the business page of a certain restaurant on YELP) is our group level predictor, which cover the information of different restaurants' random effect to our model outcome.

2) regression output

After running the regression, we find two predictors that have relatively big influence on users' ratings, the sentiment score of the review and users' average rating on YELP. On the other hand, whether the restaurant has parking slot and the price range of the restaurant doesn't matter much in users' rating process. The results are quite intuitive, people who express positve emotions in their reviews tend to give higher ratings, and people who have the habit, although we don't know the exact reason why, of giving decent ratings tend to give higher ratings. Of course, among these two factors, sentiment score plays a more important role when predicting ratings.

3) model checking

How do our model perform when used to predict ratings?

To know that, we run predictive checkings. The results are shown below.

On the left side is the prediction value, on the right side is our original data. Honestly, the distributions are similar, but our model is definitely over estimate the difference between two categories.

Hence, we run a chi square test to see if the two distribution are really different.

Well, it's not! 💃

4) discussion and implication

Using sentiment score to predict ratings can be fun, but it is not that accurate. Well it can tell whether a restaurant has above average quality or below, it is difficult to predict mild difference in ratings. After all, different person has different language habit and rating habit. People tend to go write reviews when the service they receive is remarkably pleasant or remarkably unpleasant. Some people swear when they really hate something and some people swear when they really love something. Some do both. These are all information that might affect our model but we are not taking into account of right now. Not to mention the circumstances that we discussed before, where sentiment words are not adjectives but nouns.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
sample EDA		sample EDA
.gitignore		.gitignore
Average Sentiment Scores.png		Average Sentiment Scores.png
General Ratings on Chinese Restaurants.png		General Ratings on Chinese Restaurants.png
README.md		README.md
Score and rating.png		Score and rating.png
Sentiment and ratings.png		Sentiment and ratings.png
Sentiment prediction.png		Sentiment prediction.png
Yelp Data Plan.docx		Yelp Data Plan.docx
Yelp Data Plan.pages		Yelp Data Plan.pages
Yelp.Rproj		Yelp.Rproj
plotly.R		plotly.R
read_yelp_data_business.R		read_yelp_data_business.R
read_yelp_data_checkin.R		read_yelp_data_checkin.R
read_yelp_data_photo.R		read_yelp_data_photo.R
read_yelp_data_review.R		read_yelp_data_review.R
read_yelp_data_tip.R		read_yelp_data_tip.R
read_yelp_data_user.R		read_yelp_data_user.R
review length and ratings.png		review length and ratings.png
sentiment and rating.png		sentiment and rating.png
sentiment.score.csv		sentiment.score.csv
text length and ratings.png		text length and ratings.png
word cloud 1.png		word cloud 1.png
word cloud 2.png		word cloud 2.png
words and stars.png		words and stars.png
yelp.analysis.R		yelp.analysis.R
yelp_review_Chinese.R		yelp_review_Chinese.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Ratings for Chinese Restaurants using Sentiment Analysis

1. Read Yelp Data 🍜

2. Load Data 🍚

3. Data Manipulation 🍲

4. Restaurants EDA 📈

5. Text Analysis 📚

6. Sentiment Analysis 😋😃😄😑😞

7. Users info EDA 📊

8. Regression Analysis 💡

About

Releases

Packages

Languages

angelayuanyuan/yelp-Angela-Yuan

Folders and files

Latest commit

History

Repository files navigation

Predict Ratings for Chinese Restaurants using Sentiment Analysis

1. Read Yelp Data 🍜

2. Load Data 🍚

3. Data Manipulation 🍲

4. Restaurants EDA 📈

5. Text Analysis 📚

6. Sentiment Analysis 😋😃😄😑😞

7. Users info EDA 📊

8. Regression Analysis 💡

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages