# Enhanced Review Analytics: Integrating Sentiment Analysis with Yelp Ratings

This notebook will take you through the steps from downloading the Yelp Dataset all the way to saving the model locally, to be used in the TR-API.

## Step 1: Download the Yelp Dataset
The Yelp Dataset has already been formatted from the original JSON to a readable CSV. 
It can be downloaded here: https://drive.google.com/file/d/1zPROW4EuuVXwLP8qtwbkuKY3PhFq0rTo/view?usp=sharing

See steps to download the original dataset here: [ link ]

## Step 1.5: Install Dependencies

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from joblib import dump
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from joblib import load

## Step 2: Dataset Partitioning (4,724,472 max)

In [5]:
# Load the dataset
file_path = 'restaurant_reviews.csv'
data = pd.read_csv(file_path)

# Select a random subset of x records
partition_size = 100000
subset_data = data.sample(n=partition_size, random_state=42) 

# Save to a new CSV file
output_file_path = f'reviews{partition_size/1000}k.csv'
subset_data.to_csv(output_file_path, index=False)

print(f"Random subset of {partition_size} records saved to {output_file_path}")

Random subset of 100000 records saved to reviews100.0k.csv


## Step 3: Dataset Preprocessing
This includes the following preprocessing steps:

#### General Preprocessing
1. Lowercasing
2. Removing non-word characters
3. Removing extra spaces
4. Tokenizing
5. Lemmatizing

#### NLTK Resource Loading
6. punkt, wordnet, and stopwords loading + wordnet lemmatizing

#### Final Steps
7. Dataset memory loading
8. Binarizing lables based on star rating

In [6]:
# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    words = nltk.word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in words if word not in stop_words])

# Load NLTK resources
print("Loading NLTK resources...")
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Load the dataset
print("Loading dataset...")
data = pd.read_csv(output_file_path)

# Preprocess the text
print("Preprocessing text data...")
data['text'] = data['text'].apply(preprocess_text)

# Binarize the labels into five categories
def binarize_label(star_rating):
    if star_rating == 1:
        return 'Very Negative'
    elif star_rating == 2:
        return 'Negative'
    elif star_rating == 3:
        return 'Neutral'
    elif star_rating == 4:
        return 'Positive'
    else:  # star_rating == 5
        return 'Very Positive'

print("Binarizing labels into five categories...")
data['sentiment'] = data['stars'].apply(binarize_label)

Loading NLTK resources...
Loading dataset...


[nltk_data] Downloading package punkt to /Users/wnr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/wnr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/wnr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessing text data...
Binarizing labels into five categories...


## Step 4: Vectorizer & Model (LR) Initialization
This step initialized the vectorizer as well as the Logistic Regression Model used in this application.

In [7]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Initialize the Logistic Regression model
model = LogisticRegression()

## Step 5: Dataset Split & Training

In [8]:
print(f"Processing...")

X = vectorizer.fit_transform(data['text'])
y = data['sentiment']  # Using the new five-category sentiment labels

# Splitting, training, and evaluating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"Finished Training. Accuracy: {accuracy_score(y_test, predictions)}")
print(classification_report(y_test, predictions))

Processing...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Finished Training. Accuracy: 0.64265
               precision    recall  f1-score   support

     Negative       0.43      0.31      0.36      1766
      Neutral       0.47      0.35      0.40      2319
     Positive       0.51      0.45      0.48      4769
Very Negative       0.69      0.77      0.73      2366
Very Positive       0.74      0.86      0.80      8780

     accuracy                           0.64     20000
    macro avg       0.57      0.55      0.55     20000
 weighted avg       0.62      0.64      0.63     20000



## Step 6: Saving the vectorizer and model

In [9]:
# Save the final model
model_file_path = 'LRM.joblib'
dump(model, model_file_path)

# Save the vectorizer to a file
vectorizer_file_path = 'vectorizer.joblib'
dump(vectorizer, vectorizer_file_path)

print(f"Vectorizer saved to {vectorizer_file_path}")

print(f"Final model saved to {model_file_path}")

Vectorizer saved to vectorizer.joblib
Final model saved to LRM.joblib


## Step 7 - Testing the Saved Model

In [10]:
# Load the saved model and vectorizer
model_file_path = 'LRM.joblib'
vectorizer_file_path = 'vectorizer.joblib'  # Ensure this path is correct
model = load(model_file_path)
vectorizer = load(vectorizer_file_path)

## Step 8 - Sample Data

In [24]:
data = pd.read_csv(output_file_path)

# Ask the user to specify the number of reviews they want to analyze
num_reviews = int(input("Enter the number of reviews to analyze: "))

# Select a random subset of reviews
sample_data = data.sample(n=num_reviews, random_state=42)

## Step 9 - Results

In [25]:
from textblob import TextBlob  # Assuming TextBlob is used for subjectivity analysis

def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Calculate subjectivity scores
sample_data['subjectivity'] = sample_data['text'].apply(get_subjectivity)

def get_sentiment_score(sentiment):
    sentiment_scores = {'Very Negative': -2, 'Negative': -1, 'Neutral': 0, 'Positive': 1, 'Very Positive': 2}
    return sentiment_scores[sentiment]

# Preprocess, predict sentiment, and calculate aggregate score
sample_data['processed_text'] = sample_data['text'].apply(preprocess_text)
X_sample = vectorizer.transform(sample_data['processed_text'])
sample_data['predicted_sentiment_label'] = model.predict(X_sample)
sample_data['predicted_sentiment_score'] = sample_data['predicted_sentiment_label'].apply(get_sentiment_score)

def calculate_adjusted_aggregate_score(sentiment_score, star_rating, subjectivity_score):
    normalized_star = (star_rating - 3)  # Normalize star rating
    weight = 1 - subjectivity_score if sentiment_score > 0 else 1  # Adjust weight based on sentiment and subjectivity
    return (sentiment_score * weight + normalized_star) / 2


sample_data['adjusted_aggregate_score'] = sample_data.apply(lambda row: calculate_adjusted_aggregate_score(row['predicted_sentiment_score'], row['stars'], row['subjectivity']), axis=1)

# Display results
for index, row in sample_data.iterrows():
    print(f"Review: {row['text']}")
    print(f"Predicted Sentiment: {row['predicted_sentiment_score']}")
    print(f"Star Rating: {row['stars']}")
    print(f"Subjectivity Score: {row['subjectivity']}")
    print(f"Adjusted Aggregate Score: {row['adjusted_aggregate_score']}\n------------------------\n")

Review: This is more of a quiet relaxing place. My friends and I were looking for something more loud fun and lively so we weren't as impressed. We only ordered hookah which was pretty okay. The decor is gorgeous. The waiters are all wearing costume-like Moroccan uniforms and  are professional but seemed a little tired and distant. If you are looking for a quiet place and willing to pay not-so-cheap prices then this is a nice place. closes by 2am fridays/saturdays.
Predicted Sentiment: 0
Star Rating: 3.0
Subjectivity Score: 0.6092592592592593
Adjusted Aggregate Score: 0.0
------------------------

Review: If I could give this place negative stars I would. I ordered carry out, rib bones, Mac and cheese and green beans. The green beans and mac were flavorless as well as the bones. The only way I could get the bones down was to drown them in BBQ sauce. Everything was so bland except for the bread. Good BBQ doesn't require any sauce whatsoever. If you think this is good BBQ I suggest you t