## Summary

1. The number of reviews per business ranges from 5 to >9,000. 
2. The businesses with more reviews tend to be older, some going back to 2004.
3. About 55% of the businesses have been reviewed at least once during the pandemic. 
4. About 40% of all of the reviews are tagged as useful by at least one person.   
5. About 99% of businesses in this data have at least one review that has been tagged as useful.
6. Any given business's reviews may be anywhere between 100% useful and 0.
7. I was worried that there might be a correlation between useful reviews and age (existing longer would give businesses more time to wrack up useful reviews). The correlation coefficient is positive but VERY low. 
8. "Useful" scores range from 1 to 411. 
9. There is a very slight (-0.05) negative correlation between the star rating and the usefulness score



In [None]:
import json
import pandas as pd
import plotly.express as px
import datetime
import re

In [None]:
reviews = pd.read_json("processed_data/yelp_team7_dataset_review.json")
reviews.head()

### How many reviews does an individual business have?

In [None]:
reviews_per_biz = reviews.groupby("business_id").review_id.count()
rperbiz_fig =px.histogram(reviews_per_biz,
                          title = "Reviews per Business",
                         labels = {'value': 'Number of reviews'})
rperbiz_fig.show()

**Note:** I'd like to see where these restaurants are. Let also see when the first review was and how many reviews the restaurants had pre-pandemic.

That is a pretty wide range of reviews. Let's look at when the business was first reviewed to see if this makes sense.

In [None]:
first_review = reviews.groupby("business_id").agg({'review_id': 'count', 
                                                  'date': 'min'})
first_review.sort_values(["review_id"], ascending = False)

In [None]:
reviews["review_year"] = reviews.date.dt.year
first_review_year = reviews.groupby("business_id").agg({'review_id': 'count', 
                                                  'review_year': 'min'})
first_review_fig = px.scatter(first_review_year, 
                             x = "review_year", 
                             y = "review_id", 
                             title = "Year of First Review x Number of Reviews")

first_review_fig.show()

Let's make sure that a majority of these have reviews in the last year. Otherwise, we won't get any info about adaptations due to COVID.

In [None]:
pandemic_reviews = reviews[reviews.date >= datetime.datetime(2020, 2, 1)]
print("%s percent of reviews happened during the pandemic" %round(pandemic_reviews.shape[0]/reviews.shape[0]*100))

In [None]:
reviews_per_biz_pan = pandemic_reviews.groupby("business_id").review_id.count()
rperbiz_pandemic_fig =px.histogram(reviews_per_biz_pan,
                          title = "Reviews per Business During the Pandemic",
                         labels = {'value': 'Number of reviews'})
rperbiz_pandemic_fig.show()

In [None]:
print("%s percent of businesses have reviews during the pandemic" %round(reviews_per_biz_pan[reviews_per_biz_pan > 0].shape[0]/
                                                                         reviews_per_biz.shape[0]*100))

### Let's look at the usefulness of reviews. Specifically, let's look at:

1. What percentage of reviews are useful?
2. What is the distribution restaurants with useful reviews?
3. Are some reviewers more useful than others?
4. What is the overall distribution of useful scores?
5. How does the distribution of useful scores relate to the overall distribution of reviews?


In [None]:
print("%s percent of reviews are tagged as useful" %round(reviews[reviews.useful > 0].shape[0]/reviews.shape[0]*100))

In [None]:
reviews["review_is_useful"] = reviews.useful > 0
useful_reviews_per_biz = reviews.groupby("business_id").agg({"review_is_useful": "sum", "review_id": "count"})
useful_reviews_per_biz["percent_useful"] = round(
    useful_reviews_per_biz.review_is_useful/useful_reviews_per_biz.review_id*100)

percent_biz_with_reviews = round(useful_reviews_per_biz[useful_reviews_per_biz.review_is_useful > 0].shape[0]/
    reviews_per_biz.shape[0]*100)

print("%s percent of business have at least one useful review" %percent_biz_with_reviews)

Does the percent of useful reviews correlate with age of reviews?

In [None]:
first_review["first_review_year"] = first_review.date.dt.year
first_review['age'] = 2021 - first_review.first_review_year
useful_age = first_review.merge(useful_reviews_per_biz,
                               how='inner',
                               left_index=True,
                               right_index=True)

useful_age_fig = px.box(useful_age, 
                            x = "age", 
                            y = "percent_useful", 
                            title = "Percent of reviews that are useful X age")

useful_age_fig.show()

In [None]:
useful_age.age.corr(useful_age.percent_useful)

In [None]:
reviews.useful.sort_values().unique()

Is the "usefulness" of a review correlated with the star rating?

In [None]:
useful_review_stars = px.scatter(reviews, 
                                x = "stars",
                                y = "useful")
useful_review_stars.show()

In [None]:
reviews.stars.corr(reviews.useful)

Let's look at a few of the most useful reviews:

In [None]:
list(reviews[reviews.useful == 313].text)