<a href="https://colab.research.google.com/github/derek881107/Sentiment-Classification-with-Machine-Learning-Approaches-/blob/main/Sentiment_Classification_with_Machine_Learning_Approaches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Code Cell 1: Import Libraries and Load Dataset


In [2]:
# Install necessary libraries
!pip install pandas scikit-learn vaderSentiment

# Import libraries
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
# Description: Load the first 10,000 rows of the Yelp review dataset.
file_path = '/content/sample_data/restaurant_reviews_az.csv'  # Replace with your dataset path
data = pd.read_csv(file_path).head(10000)  # Use the first 10,000 rows
print(f"Loaded dataset with {len(data)} rows.")
print(data.head())


Loaded dataset with 10000 rows.
                review_id                 user_id             business_id  \
0  IVS7do_HBzroiCiymNdxDg  fdFgZQQYQJeEAshH4lxSfQ  sGy67CpJctjeCWClWqonjA   
1  QP2pSzSqpJTMWOCuUuyXkQ  JBLWSXBTKFvJYYiM-FnCOQ  3w7NRntdQ9h0KwDsksIt5Q   
2  oK0cGYStgDOusZKz9B1qug  2_9fKnXChUjC5xArfF8BLg  OMnPtRGmbY8qH_wIILfYKA   
3  E_ABvFCNVLbfOgRg3Pv1KQ  9MExTQ76GSKhxSWnTS901g  V9XlikTxq0My4gE8LULsjw   
4  Rd222CrrnXkXukR2iWj69g  LPxuausjvDN88uPr-Q4cQA  CA5BOxKRDPGJgdUQ8OUOpw   

   stars  useful  funny  cool  \
0      3       1      1     0   
1      5       1      1     1   
2      5       1      0     0   
3      5       0      0     0   
4      4       1      0     0   

                                                text             date  \
0  OK, the hype about having Hatch chili in your ...  1/27/2020 22:59   
1  Pandemic pit stop to have an ice cream.... onl...   4/19/2020 5:33   
2  I was lucky enough to go to the soft opening a...  2/29/2020 19:43   
3  I've gone t

Code Cell 2: Data Processing and Lexicon-Based Sentiment Analysis

In [3]:
# Perform lexicon-based sentiment analysis using VaderSentiment
# Description: Apply VaderSentiment to classify reviews as 'positive' or 'negative'.

analyzer = SentimentIntensityAnalyzer()

def vader_sentiment(review):
    sentiment = analyzer.polarity_scores(review)
    return 'positive' if sentiment['compound'] >= 0 else 'negative'

# Apply VaderSentiment on the 'text' column
data['vader_sentiment'] = data['text'].apply(vader_sentiment)
print(data[['text', 'vader_sentiment']].head())


                                                text vader_sentiment
0  OK, the hype about having Hatch chili in your ...        positive
1  Pandemic pit stop to have an ice cream.... onl...        positive
2  I was lucky enough to go to the soft opening a...        positive
3  I've gone to claim Jumpers all over the US and...        positive
4  If you haven't been  to Maynard's kitchen, it'...        positive


Code Cell 3: Data Preparation for Machine Learning


In [4]:
# Data preparation for machine learning
# Description: Convert text to numerical features using CountVectorizer and split the dataset
# into training and testing sets (80% training, 20% testing).

from sklearn.preprocessing import LabelEncoder

# Convert review text into Bag of Words representation
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['text'].fillna('')).toarray()

# Encode sentiment labels into numerical values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['vader_sentiment'])  # positive -> 1, negative -> 0

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)}, Testing set size: {len(X_test)}")


Training set size: 8000, Testing set size: 2000


Code Cell 4: Describe Processed Data

In [5]:
# Describe the processed data
# Description: Print statistics about the processed data.

# Number of tokens (unique words)
num_tokens = len(vectorizer.get_feature_names_out())
print(f"Number of tokens (unique words): {num_tokens}")

# Unique customers (if user_id exists)
unique_customers = data['user_id'].nunique() if 'user_id' in data.columns else 'N/A'
print(f"Number of unique customers: {unique_customers}")


Number of tokens (unique words): 1000
Number of unique customers: 6830


Code Cell 5: Train and Evaluate Naive Bayes Classifier

In [6]:
# Train and evaluate Naive Bayes classifier
# Description: Train a Multinomial Naive Bayes model and evaluate its performance.

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)

print("Naive Bayes Performance:")
print(classification_report(y_test, nb_predictions, target_names=['negative', 'positive']))


Naive Bayes Performance:
              precision    recall  f1-score   support

    negative       0.55      0.75      0.63       302
    positive       0.95      0.89      0.92      1698

    accuracy                           0.87      2000
   macro avg       0.75      0.82      0.78      2000
weighted avg       0.89      0.87      0.88      2000



Code Cell 6: Train and Evaluate SVM Classifier


In [7]:
# Train and evaluate SVM classifier
# Description: Train a Support Vector Machine (SVM) classifier and evaluate its performance.

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

print("SVM Performance:")
print(classification_report(y_test, svm_predictions, target_names=['negative', 'positive']))


SVM Performance:
              precision    recall  f1-score   support

    negative       0.73      0.37      0.49       302
    positive       0.90      0.98      0.93      1698

    accuracy                           0.88      2000
   macro avg       0.81      0.67      0.71      2000
weighted avg       0.87      0.88      0.87      2000



Code Cell 7: Reload Data and Perform TF-IDF Vectorization

In [8]:
# Reload data and apply TF-IDF vectorization
# Description: Convert text data to TF-IDF representation for Logistic Regression.

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = tfidf_vectorizer.fit_transform(data['text'].fillna('')).toarray()
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42)


Code Cell 8: Train and Evaluate Logistic Regression Classifier


In [9]:
# Train and evaluate Logistic Regression classifier
# Description: Train a Logistic Regression model using TF-IDF features and evaluate its performance.

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train_tfidf)
lr_predictions = lr_model.predict(X_test_tfidf)

print("Logistic Regression Performance:")
print(classification_report(y_test_tfidf, lr_predictions, target_names=['negative', 'positive']))


Logistic Regression Performance:
              precision    recall  f1-score   support

    negative       0.78      0.53      0.63       302
    positive       0.92      0.97      0.95      1698

    accuracy                           0.91      2000
   macro avg       0.85      0.75      0.79      2000
weighted avg       0.90      0.91      0.90      2000



Text Cell 9: Compare and Conclude Observations

Comparison:
1. VaderSentiment is quick and effective for lexicon-based sentiment analysis but lacks adaptability for nuanced datasets.
2. Naive Bayes and SVM offer competitive performance but require proper feature engineering.
3. Logistic Regression with TF-IDF features demonstrates robust performance and adaptability.


Code Cell 10: Predict Sentiments for Input 2





In [25]:
# Predict sentiments for new reviews
# Description: Predict sentiment on new input reviews using the trained Logistic Regression model.

# Input new reviews for prediction
input_2_reviews = [
    "The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served tasted bad and extremely fishy, making us difficult to even finish it.",
    "The restaurant is definitely one of my favorites and of my family as well. The food is absolutely delicious!",
    "I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but acceptable."
]

# Preprocess the new reviews and transform them using the TF-IDF vectorizer
input_2_features = tfidf_vectorizer.transform(input_2_reviews).toarray()

# Predict sentiments using the Logistic Regression model
input_2_predictions = lr_model.predict(input_2_features)

# Output the predictions with proper interpretation
print("Input 2 Predictions:")
for review, sentiment in zip(input_2_reviews, input_2_predictions):
    # Correctly map sentiment labels
    mapped_sentiment = "Positive" if sentiment == 1 else "Negative"
    print(f"Review: {review}\nPredicted Sentiment: {mapped_sentiment}\n")


Input 2 Predictions:
Review: The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served tasted bad and extremely fishy, making us difficult to even finish it.
Predicted Sentiment: Negative

Review: The restaurant is definitely one of my favorites and of my family as well. The food is absolutely delicious!
Predicted Sentiment: Positive

Review: I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but acceptable.
Predicted Sentiment: Positive



Text Cell 11: Classify and Comment

In [26]:

print("\nClassification Results and Reasoning:")
input_2_reviews = [
    "The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served bad and tasted extremely fishy, making us difficult to even finish it.",
    "The restaurant is definitely one of my favorites and of my family as well. The food is absolutely delicious!",
    "I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but acceptable."
]

# Predicted sentiments for the reviews
input_2_predictions = lr_model.predict(input_2_features)

for i, (review, sentiment) in enumerate(zip(input_2_reviews, input_2_predictions)):
    print(f"Review {i+1}: {review}")
    if label_encoder.inverse_transform([sentiment])[0] == 'negative':
        print("Predicted Sentiment: Negative")
        if i == 0:
            print("Reasoning: The review highlights significant negative aspects such as poor sanitation, unappealing food, and hard-to-find location. These factors outweighed any positive comments.")
    elif label_encoder.inverse_transform([sentiment])[0] == 'positive':
        print("Predicted Sentiment: Positive")
        if i == 1:
            print("Reasoning: The review is entirely positive, focusing on favorite status and delicious food, reflecting a strong positive sentiment.")
        elif i == 2:
            print("Reasoning: Despite minor criticisms, the review's overall tone is positive, emphasizing friendly staff and acceptable service.")
    print("\n")



Classification Results and Reasoning:
Review 1: The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served bad and tasted extremely fishy, making us difficult to even finish it.
Predicted Sentiment: Negative
Reasoning: The review highlights significant negative aspects such as poor sanitation, unappealing food, and hard-to-find location. These factors outweighed any positive comments.


Review 2: The restaurant is definitely one of my favorites and of my family as well. The food is absolutely delicious!
Predicted Sentiment: Positive
Reasoning: The review is entirely positive, focusing on favorite status and delicious food, reflecting a strong positive sentiment.


Review 3: I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but acceptable.
Predicted Sentiment: Positive
Reasoning: Despite minor criticisms, the review's overall tone is positive, emphasizing friendly staff and acceptable servic

Text Cell 12: Acknowledgment

In [12]:
# Description: Acknowledge the use of GenAI tools and collaboration during the assignment.

print("\nAcknowledgment:")
print("This assignment was completed with the support of GenAI tools and collaboration with peers. GenAI assisted in refining coding practices, debugging, and clarifying explanations.")
print("Special thanks to the instructional team for their guidance and to classmates for their insightful discussions, which significantly contributed to the success of this project.")



Acknowledgment:
This assignment was completed with the support of GenAI tools and collaboration with peers. GenAI assisted in refining coding practices, debugging, and clarifying explanations.
Special thanks to the instructional team for their guidance and to classmates for their insightful discussions, which significantly contributed to the success of this project.
