Predicting star ratings on Yelp
Switch branches/tags
Nothing to show
Clone or download
Latest commit e5b82be Nov 10, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Add user data to sample creation script Nov 2, 2017
Predicting Star Ratings.ipynb Add data analysis notebook Nov 10, 2017
README.md Update README.md Nov 10, 2017
create_sample_data.py Add user data to sample creation script Nov 2, 2017

README.md

Predicting Star Ratings on Yelp

Summary

Which business attributes have the greatest impact on a Yelp business' star rating? Using a variety of regression models, I was able to predict (within about 0.75 stars accuracy) a business' star rating given only its business attributes. Some of the attributes that proved most predictive include:

  • Karaoke music
  • Catering
  • Bike parking
  • Street parking
  • Intimate ambiance

You can walk through the analysis here.

Dataset

This project uses the Yelp Open Dataset, which includes 5 files:

  • business.json: Contains business data including location data, attributes, and categories.
  • review.json: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
  • user.json: User data including the user's friend mapping and all the metadata associated with the user.
  • checkin.json: Checkins on a business.
  • tip.json: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.

I've filtered the dataset to include only businesses in Cleveland, Ohio, and included the filtered data in the data directory of this repo. To create your own subset of the Yelp dataset, you can use create_sample_data.py and alter the line below based on your own desired filter:

business_sample = business_df[business_df['city'].str.contains('leveland')]