# Lab Assignment 2 - Text Analytics CISB5123

Sentiment Analysis is the process of classifying the content of documents as positive, negative
and/or neutral. In this assignment, we will explore sentiment classification using the Amazon
Fine Food Review dataset.

#### Team members:
1. Abdul Hakiim bin Ahmad Rosli (SW01081337)
2. Muhammad Bazly bin Burhan (SW01081224)

### Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [2]:
# Load the dataset
df = pd.read_csv('Reviews.csv')

# Preview the head of the data
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#### Removing HTML tags and Unwanted characters
It is important to clean the text data by removing the HTML tags and any unwanted characters before proceeding with further analysis.

In [3]:
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# Apply the clean_text function to the 'Text' column
df['Text'] = df['Text'].apply(clean_text)

# View the cleaned text data
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,i have bought several of the vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",this is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,if you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,great taffy at a great price there was a wide ...


In [4]:
# Check for missing values
df.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               26
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [5]:
# Select relevant columns for sentiment analysis
df = df[['Score', 'Text']]

#### Tokenizing Text

In [6]:
nltk.download('punkt')

# Tokenize the text into individual words
df['Tokens'] = df['Text'].apply(lambda x: nltk.word_tokenize(x))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# Download the WordNet lemmatizer
nltk.download('wordnet')

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
df['Tokens'] = df['Tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# Join the tokens back into sentences
df['Preprocessed_Text'] = df['Tokens'].apply(lambda x: ' '.join(x))

# Save the preprocessed data to a new CSV file
df.to_csv('preprocessed_amazon_reviews.csv', index=False)

# Preview the preprocessed data
df.head()

Unnamed: 0,Score,Text,Tokens,Preprocessed_Text
0,5,i have bought several of the vitality canned d...,"[i, have, bought, several, of, the, vitality, ...",i have bought several of the vitality canned d...
1,1,product arrived labeled as jumbo salted peanut...,"[product, arrived, labeled, a, jumbo, salted, ...",product arrived labeled a jumbo salted peanuts...
2,4,this is a confection that has been around a fe...,"[this, is, a, confection, that, ha, been, arou...",this is a confection that ha been around a few...
3,2,if you are looking for the secret ingredient i...,"[if, you, are, looking, for, the, secret, ingr...",if you are looking for the secret ingredient i...
4,5,great taffy at a great price there was a wide ...,"[great, taffy, at, a, great, price, there, wa,...",great taffy at a great price there wa a wide a...


### Feature Extraction
Let us go through the feature extraction step using the Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) techniques

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### Bag-of-Words (BoW)

In [10]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the preprocessed text data
bow_features = vectorizer.fit_transform(df['Preprocessed_Text'])

# Get the vocabulary (unique words)
vocabulary = vectorizer.get_feature_names_out()

# Print the shape of the BoW features and the vocabulary size
print("BoW feature shape:", bow_features.shape)
print("Vocabulary size:", len(vocabulary))

BoW feature shape: (568454, 298598)
Vocabulary size: 298598


#### Term Frequency-Inverse Document Frequency (TF-IDF)

In [11]:
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the preprocessed text data
tfidf_features = tfidf_vectorizer.fit_transform(df['Preprocessed_Text'])

# Get the vocabulary (unique words)
tfidf_vocabulary = tfidf_vectorizer.get_feature_names_out()

# Print the shape of the TF-IDF features and the vocabulary size
print("TF-IDF feature shape:", tfidf_features.shape)
print("Vocabulary size:", len(tfidf_vocabulary))

TF-IDF feature shape: (568454, 298598)
Vocabulary size: 298598


### Model Selection

In [12]:
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

#### 1. Lexicon-based approach using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner):

In [13]:
# Download the VADER lexicon
nltk.download('vader_lexicon')

# Assign sentiment labels based on the 'Score' column
def assign_sentiment(score):
    if score >= 4:
        return 'Positive'
    elif score <= 2:
        return 'Negative'
    else:
        return 'Neutral'

df['Sentiment'] = df['Score'].apply(assign_sentiment)

# Initialize the sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Calculate sentiment scores for each review
df['Lexicon_Sentiment'] = df['Preprocessed_Text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Map sentiment scores to labels
df['Lexicon_Sentiment_Label'] = df['Lexicon_Sentiment'].apply(lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral'))

# Evaluate the lexicon-based approach
lexicon_accuracy = accuracy_score(df['Sentiment'], df['Lexicon_Sentiment_Label'])
print("Lexicon-based Approach Accuracy:", lexicon_accuracy)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Lexicon-based Approach Accuracy: 0.7992308964313735


#### 2. Machine learning-based approaches:

In [14]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['Preprocessed_Text'], df['Sentiment'], test_size=0.2, random_state=42)

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = tfidf.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = tfidf.transform(X_test)

# Train and evaluate Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
nb_predictions = nb_classifier.predict(X_test_tfidf)
nb_accuracy = accuracy_score(y_test, nb_predictions)
print("Naive Bayes Accuracy:", nb_accuracy)
print(classification_report(y_test, nb_predictions))

# Train and evaluate SVM classifier
svm_classifier = LinearSVC()
svm_classifier.fit(X_train_tfidf, y_train)
svm_predictions = svm_classifier.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_predictions)
print("SVM Accuracy:", svm_accuracy)
print(classification_report(y_test, svm_predictions))

Naive Bayes Accuracy: 0.7877052713055563
              precision    recall  f1-score   support

    Negative       0.95      0.03      0.06     16181
     Neutral       0.50      0.00      0.00      8485
    Positive       0.79      1.00      0.88     89025

    accuracy                           0.79    113691
   macro avg       0.75      0.34      0.32    113691
weighted avg       0.79      0.79      0.70    113691

SVM Accuracy: 0.8920407068281571
              precision    recall  f1-score   support

    Negative       0.78      0.75      0.76     16181
     Neutral       0.71      0.30      0.43      8485
    Positive       0.92      0.97      0.94     89025

    accuracy                           0.89    113691
   macro avg       0.80      0.68      0.71    113691
weighted avg       0.88      0.89      0.88    113691



### Discussion

In this discussion, we will analyze the strengths and weaknesses of the selected models for sentiment classification based on the experimental results and the characteristics of the Amazon Fine Food Reviews dataset.

1. Lexicon-based Approach (VADER):
    - Accuracy: The lexicon-based approach using VADER achieved an accuracy of 0.7992, which means it correctly classified approximately 79.92% of the reviews.
    - Strengths:
        -Interpretability: The lexicon-based approach is easily interpretable as it relies on predefined sentiment lexicons and rules to assign sentiment scores to words and phrases.
        -Efficiency: Lexicon-based methods are computationally efficient and can provide quick sentiment predictions without the need for extensive training.
    - Weaknesses:
        - Limited coverage: The performance of lexicon-based methods depends on the comprehensiveness of the sentiment lexicon used. If the lexicon does not cover domain-specific or colloquial language, the sentiment predictions may be less accurate.
        - Inability to capture context: Lexicon-based approaches often struggle with understanding the context and complex language patterns, such as sarcasm, irony, or negation, which can lead to misclassifications.

2. Naive Bayes Classifier:
    - Accuracy: The Naive Bayes classifier achieved an accuracy of 0.7877, correctly classifying approximately 78.77% of the reviews.
    - Strengths:
        - Simplicity: Naive Bayes is a simple and intuitive probabilistic model that is easy to implement and understand.
        - Efficiency: Naive Bayes is computationally efficient and can handle large datasets with high-dimensional features.
    - Weaknesses:
        - Independence assumption: Naive Bayes assumes that the features (words) are conditionally independent given the sentiment class, which may not always hold true in natural language.
        - Imbalanced performance: As seen in the classification report, Naive Bayes performs well in predicting the majority class (Positive) but struggles with the minority classes (Negative and Neutral), resulting in low recall and F1-scores for those classes.

3. Support Vector Machine (SVM) Classifier:
    - Accuracy: The SVM classifier achieved an accuracy of 0.8920, correctly classifying approximately 89.20% of the reviews, outperforming the other two models.
    - Strengths:
        - High accuracy: SVM demonstrates superior performance in sentiment classification, achieving the highest accuracy among the selected models.
        - Handles high-dimensional data: SVM can effectively handle high-dimensional feature spaces, making it suitable for text classification tasks.
        - Balanced performance: The classification report shows that SVM maintains good precision, recall, and F1-scores across all sentiment classes, indicating a more balanced performance compared to Naive Bayes.
    - Weaknesses:
        - Interpretability: SVM models are often considered as black-box models, making it challenging to interpret and explain the reasoning behind their predictions.
        - Computational complexity: Training an SVM model can be computationally expensive, especially with large datasets and complex kernel functions.

Based on the experimental results, the SVM classifier emerges as the best-performing model for sentiment classification on the Amazon Fine Food Reviews dataset. It achieves the highest accuracy and demonstrates a more balanced performance across all sentiment classes compared to the lexicon-based approach and Naive Bayes classifier.

However, it is important to note that the choice of the best model depends on various factors such as the specific requirements of the application, the trade-off between accuracy and interpretability, and the computational resources available.

The lexicon-based approach, despite its lower accuracy, offers the advantage of interpretability and efficiency, making it suitable for scenarios where quick sentiment predictions are needed, and the underlying reasoning is important.

On the other hand, if the primary goal is to achieve high accuracy and handle complex language patterns, the SVM classifier would be the preferred choice, especially if computational resources are not a constraint.

It is also worth mentioning that the performance of these models can be further improved by experimenting with different feature extraction techniques, hyperparameter tuning, and ensemble methods.

Additionally, the dataset's characteristics should be considered. The Amazon Fine Food Reviews dataset contains a large number of reviews, with an imbalance in the sentiment class distribution. Addressing class imbalance through techniques like oversampling, undersampling, or class weighting can potentially enhance the models' performance, particularly for the minority classes.

In conclusion, the SVM classifier demonstrates the best performance for sentiment classification on the given dataset, offering high accuracy and balanced performance across sentiment classes. However, the choice of the most suitable model depends on the specific requirements and constraints of the application, and further improvements can be explored through advanced techniques and dataset-specific considerations.