<a href="https://colab.research.google.com/github/gabecoelho/CS410-CourseProject/blob/main/CS410_Sentiment_Analysis_Course_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Yelp Data - Business Review Sentiment Analysis

## 1. Topics to be explored

1. Business categories
2. Number of stars in reviews
3. Number of businesses open and closed
4. Sentiment analysis on all reviews

## 2. Start by downloading files and importing libraries for reading the data

In [16]:
# Download files

# Business data
!wget -q --show-progress --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1vCSG6KgUYo0wpn2_8-tb0YZhMgY4wk1h' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1vCSG6KgUYo0wpn2_8-tb0YZhMgY4wk1h" -O yelp_businesses.json && rm -rf /tmp/cookies.txt



In [15]:
# Reviews data
!wget -q --show-progress --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1GuxASYHSWWtGhaCJeh57WCA3RKsjJzaO' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1GuxASYHSWWtGhaCJeh57WCA3RKsjJzaO" -O yelp_reviews.json && rm -rf /tmp/cookies.txt


In [17]:
# General data analysis libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

# Reviews data types
review_dtypes = {
    'stars': np.float16
}

business_dtypes = {
    'is_open': np.bool8
}

reviews_data = pd.read_json('yelp_reviews.json',
                            lines=True,
                            chunksize=2000,
                            dtype=review_dtypes,
                            orient='records')

business_df = pd.read_json('yelp_businesses.json',
                           lines=True,
                           dtype=business_dtypes)

### Only looking at data from 2018 and beyond

In [18]:
# Reviews chunk storage
reviews_chunks = []

for chunk in reviews_data:
  # Drop unused columns and start from 2018 onward
  min_chunk = chunk.drop(columns=['user_id', 'review_id']).query("`date` >= '2018-01-01'")
  reviews_chunks.append(min_chunk)

reviews_df = pd.concat(reviews_chunks, ignore_index=True)

ValueError: ignored

Sample the business data

In [19]:
business_df.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,False,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,True,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,False,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,True,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,True,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


Sample the review data

In [20]:
reviews_df.head(5)

NameError: ignored

In [None]:
# Join all categories into one large string
joint_categories = ', '.join(business_df['categories'].dropna())

# Make a list with each separate category as an entry
categories = pd.DataFrame(joint_categories.split(', '), columns=['category'])

# Get a series count of unique values
categories_series = categories.category.value_counts()
categories_df = pd.DataFrame(categories_series)

# Use default index
categories_df.reset_index(inplace=True)

# Build plot for visualization
plt.figure(figsize=(12,10))
categories_axis = sns.barplot(y = 'index', x = 'category', data = categories_df.iloc[0:20], palette='hls')
categories_axis.set_ylabel('Category')
categories_axis.set_xlabel('Number of businesses')
categories_axis.set_title('Number of businesses by category')

for p in categories_axis.patches:
    categories_axis.annotate(int(p.get_width()),
                ((p.get_x() + p.get_width()),
                 p.get_y()),
                 xytext=(2, -16),
                fontsize=16,
                color='#00244d',
                textcoords='offset points',
                horizontalalignment='right')
 
plt.show()   

## 3. Get a count of only restaurant- or food-related categories

In [None]:
related_cats = ['Restaurants', 'Food', 'Bars', 'Sandwiches', 'American (Traditional)', 'Pizza', 'Coffee & Tea', 'Fast Food', 'Breakfast & Brunch', 'American (New)']

# Filter for all the related categories
related_df = categories.where(lambda category : category.isin(related_cats)).dropna()

len(related_df)

## 4. Filter only businesses that are food related

In [None]:
food_businesses = business_df.query("categories in @related_cats")
food_businesses.head(5)

## 5. Plot review stars count

In [None]:
plt.figure(figsize=(10,7))
axis = sns.countplot(x='stars', data = food_businesses, palette='Paired')
axis.set_ylabel('Count')
axis.set_xlabel('Stars')
axis.set_title('Restaurant-related businesses by rating')

for p in axis.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        axis.text(x+width-.4, 
                y+height+.3,
                '{:.0f}'.format(height),
                weight='bold',
                horizontalalignment='center')
                
plt.show()

## 6. Filter to see how many of the restaurants are still open 

In [None]:
open_businesses = food_businesses.where(food_businesses['is_open'] == True)

len(open_businesses)

## 7. Breakdown by Top 10 States

In [None]:
top10 = open_businesses['state'].value_counts(ascending=True).tail(10).to_frame()

In [None]:
plt.figure(figsize=(10,7))
axis = sns.barplot(x=top10.index,y='state', data=top10, palette='Blues')
axis.set_ylabel('Count')
axis.set_xlabel('States')
axis.set_title('Number of Businesses by State (Top 10)')

for patch in axis.patches:
        width, height = patch.get_width(), patch.get_height()
        x, y = patch.get_xy() 
        axis.text(x+width-.4, 
                y+height+0.3,
                '{:.0f}'.format(height),
                weight='bold',
                horizontalalignment='center',
                size='medium') 
                
plt.show()

### There are more reviews for PA than other States

## **8. Perform Sentiment Analysis**

### Some assumptions to keep in mind

1. Negative reviews will be anything from 0 to 3 stars
2. Positive reviews will be anything from 4 to 6 stars
3. We will work with the a subset of the dataset, a sample of 125k entries
4. The expected result will be a dataframe with the review ('text' in the dataset) as well as the 'sentiment' boolean variable

In [None]:
# Import libraries
import nltk
import re, string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split

nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')

from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Re-map review stars as 0 (negative) or 1 (positive) as described in the assumptions above

reviews_df['sentiment'] = reviews_df['stars'].replace({
    1: 0, # negative
    2: 0, # negative
    3: 0, # negative
    4: 1, # positive
    5: 1  # positive
}).astype(int)

In [None]:
# Sample 125k entries

sample_reviews_df = reviews_df.sample(125000).reset_index(drop=True)

In [None]:
# Create the two columns as described in the assumptions above

analysis_df = sample_reviews_df[
    ['text', 'sentiment']
]

analysis_df.head(5)

In [None]:
# Convert English contractions into separate words to standardize
# For instance: "wouldn't" becomes "would not"
# There is a comprehensive list of contractions accessible at https://www.sjsu.edu/writingcenter/docs/handouts/Contractions.pdf
# and a python object can be found at https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
# For purposes of this project, we shall use a simple function to take care of the most common patterns

def get_canonized_contractions(word):
     word = re.sub(r"couldn’t", "could not", word)
     word = re.sub(r"wouldn’t", "would not", word)
     word = re.sub(r"won’t", "will not", word)
     word = re.sub(r"can\’t", "can not", word)
     word = re.sub(r"\’d",  " would", word)
     word = re.sub(r"\’ve", " have", word)
     word = re.sub(r"\’ll", " will", word)
     word = re.sub(r"\’re", " are", word)
     word = re.sub(r"n\’t", " not", word)
     word = re.sub(r"\’t", " not", word)
     word = re.sub(r"\’m", " am", word)
     word = re.sub(r"\’s", " is", word)
     return word

In [None]:
# Because we will use bag of words, we convert all strings to lower case
analysis_df['processed'] = analysis_df['text'].apply(lambda x: ' '.join(x.lower() for x in str(x).split()))

# Substitute each contraction in the already lowercase review strings
analysis_df['processed'] = analysis_df['processed'].apply(lambda x: get_canonized_contractions(x))

# Keep only alphabetical chars
alpha = '[^A-Za-z]+'
analysis_df['processed'] = analysis_df['processed'].apply(lambda x: ' '.join(
    [re.sub(alpha, '', x) for x in nltk.word_tokenize(x)]
))

# Remove any extra spaces between words
analysis_df['processed'] = analysis_df['processed'].apply(lambda x: re.sub(' +', ' ', x))

# Remove stop words based on the ones downloaded from nltk
stop_words = stopwords.words('english')
analysis_df['processed'] = analysis_df['processed'].apply(lambda x: ' '.join(
    [x for x in x.split() if x not in stop_words]
))

# As a final processing step, use a lemmatizer
# This will transform words in the following way:
# dogs -> dog
# abaci -> abacus
# churches -> church
lemmatizer = WordNetLemmatizer()
analysis_df['processed'] = analysis_df['processed'].apply(lambda x: ' '.join(
    [lemmatizer.lemmatize(word) for word in nltk.word_tokenize(x)]
))

In [None]:
# Separate into A and B for an A/B test

A = analysis_df['processed']
B = analysis_df['sentiment']

# Use sklern to split arrays into random train and test subsets
# test_size = the proportion of the dataset to include in the test split
# random_state = controls the shuffling applied to the data before applying the split
A_train, A_test, B_train, B_test = train_test_split(A, B, test_size=0.2, stratify=B, random_state=45)

print('A - Training step shape', A_train.shape)
print('A - Test step shape', A_test.shape)

print('B - Training step shape', B_train.shape)
print('B - Test step shape', B_test.shape)

In [None]:
# Convert a collection of raw documents to a matrix of TF-IDF features
# that can be used as input to an estimator

vectorizer = TfidfVectorizer()
tf_A_train = vectorizer.fit_transform(A_train)
tf_A_test = vectorizer.transform(A_test)

In [None]:
# Build a report

classification = LinearSVC(random_state = 0)

# Fit the model according to the given training data
classification.fit(tf_A_train, B_train)

B_test_prediction = classification.predict(tf_A_test)

report = classification_report(B_test, B_test_prediction, output_dict = True)

pd.DataFrame(report)

### **8.1 Let's run some tests and evaluations**

### Positive Review

In [None]:
# Positive review from the original dataset

analysis_df['text'][1]

In [None]:
# The same positive review, from the processed column

analysis_df['processed'][1]

Notice the stop words are gone and the main words stay

---



In [None]:
positive_review = analysis_df['processed'][1]

positive_review_transformed = vectorizer.transform([positive_review])

positive_review_prediction = classification.predict(positive_review_transformed)

positive_review_prediction

**The value '1' means we predicted positive, which is correct**

### Custom Positive Review

In [None]:
# Very positive

custom_pos_review = '''
Excellent soondubu jjigae I had many coworkers from Korea recommend this restaurant, so that is how you know it is legit. I had pretty high expectations. 
I decided to try out their not spicy option. This is my first time ever trying soft tofu soup without any gochujang. It is definitely on the more salty and savory side.
I got the seafood one (Haemul) and it came with squid, mini mussels, tiny shrimp, and one prawn. Not a ton of seafood.
The soup was definitely dominated by the silken tofu, but that is how I prefer it. Of the side dishes, the pickled radish was my favorite.
Normally I see this served with fried chicken, so I was pretty delighted to see it and it was cut into strips. The kimchi was tasty, but more spicy than your typical kimchi.
'''

custom_pos_review_transformed = vectorizer.transform([custom_pos_review])

custom_pos_review_prediction = classification.predict(custom_pos_review_transformed)

custom_pos_review_prediction

**The value '1' means we predicted positive, which is correct**

In [None]:
# Dubiously positive (4 stars or more)

custom_pos_review = '''
Fish and chips for $10.50 in the bay area. I can't complain.
Just across the street there is a disgusting seafood restaurant.
They don't seem to even wash their hands, and seafood is not fresh.
But this place lives up to the expectations. Good food, gotta love it
'''

custom_pos_review_transformed = vectorizer.transform([custom_pos_review])

custom_pos_review_prediction = classification.predict(custom_pos_review_transformed)

custom_pos_review_prediction

**The value '1' means we predicted positive, which is correct**

### Negative Review

In [None]:
# Negative review from the original dataset

analysis_df['text'][6]

In [21]:
# The same negative review, from the processed column

analysis_df['processed'][6]

NameError: ignored

In [None]:
negative_review = analysis_df['processed'][1]

negative_review_transformed = vectorizer.transform([negative_review])

negative_review_prediction = classification.predict(negative_review_transformed)

negative_review_prediction

**The value '1' means we predicted positive, which is incorrect**



### Custom Negative Review

In [None]:
# Very negative

custom_negative_review = '''
I was highly disappointed on my last visit there last night.
May have a talk with the manager. Use to love this place.
Now thinking I don't need to go here again. They were very busy.
Very disorganized and need stronger management leadership.
Food was overcooked and Very Small portions.
Added roasted asparagus for $3 and only got 3 stocks!
Just overall a bad experience.
'''

custom_negative_review_transformed = vectorizer.transform([custom_negative_review])

custom_negative_review_prediction = classification.predict(custom_negative_review_transformed)

custom_negative_review_prediction

**The value '0' means we predicted negative, which is correct**


In [None]:
# Dubiously negative (less than 3 stars)

custom_negative_review = '''
Updated my reviews. Food is still good. But they keep missing side orders.
At least twice this happened to me. It's annoying but I keep coming anyway
'''

custom_negative_review_transformed = vectorizer.transform([custom_negative_review])

custom_negative_review_prediction = classification.predict(custom_negative_review_transformed)

custom_negative_review_prediction

**The value '0' means we predicted negative, which is correct**


### Custom Neutral Review

In [None]:
# Neutral

custom_neutral_review = '''
This place is alright. Price is okay and food is normal. Not bad, not great. Service is pretty good though.
'''

custom_neutral_review_transformed = vectorizer.transform([custom_neutral_review])

custom_neutral_review_prediction = classification.predict(custom_neutral_review_transformed)

custom_neutral_review_prediction

Although it was neutral leaning towards positive, the model predicted '0' negative

## Analysis




In [None]:
pd.DataFrame(report)

1.  This model has accuracy of ~92%

2.  Negative reviews were more difficult to predict

    Positive - 93% accuracy

    Negative - 89% accuracy

3. This model accurately predicted 2 custom positive reviews, and 1 negative review, and tends to lean towards negative given a neutral review

# 9. Environment take down


In [None]:
# Uncomment and run the code below to remove downloaded files from the environment

# !rm yelp_businesses.json
# !rm yelp_reviews.json