<a href="https://colab.research.google.com/github/carlos-alves-one/-Amazon-Review-NLP/blob/main/Sentiment_Analysis_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Goldsmiths University of London
### MSc. Data Science and Artificial Intelligence
### Module: Natural Language Processing
### Author: Carlos Manuel De Oliveira Alves
### Student: cdeol003
### Coursework Project

# Data Collection

### Load the data

In [1]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


Dataset source: https://www.kaggle.com/datasets/akudnaver/amazon-reviews-dataset

License: Unknown

In [2]:
# Import the pandas library and give it the alias 'pd' for data manipulation and analysis
import pandas as pd

# Load the dataset Amazon Review Details from Google Drive
data_path = '/content/drive/MyDrive/amazon_project/amazon-review-details.csv'
data = pd.read_csv(data_path)

# Display the first few rows of the dataframe
data.head(3).T


Unnamed: 0,0,1,2
report_date,2019-01-02,2019-01-03,2019-01-03
online_store,FRESHAMAZON,FRESHAMAZON,FRESHAMAZON
upc,8718114216478,5000184201199,5000184201199
retailer_product_code,B0142CI6FC,B014DFNNRY,B014DFNNRY
brand,Dove Men+Care,Marmite,Marmite
category,Personal Care,Foods,Foods
sub_category,Deos,Savoury,Savoury
product_description,Dove Men+Care Extra Fresh Anti-perspirant Deod...,Marmite Spread Yeast Extract 500g,Marmite Spread Yeast Extract 500g
review_date,2019-01-01,2019-01-02,2019-01-02
review_rating,5,5,4


# Data Preprocessing

The dataset contains multiple columns, but for our sentiment analysis, we will primarily focus on the 'review_rating' as our target variable and the text of the reviews for our feature.

**Tasks :**

- Select relevant columns ('review_rating' and the review text column).

- Handle missing values if necessary.

- Convert ratings to a binary sentiment (positive or negative).

- Preprocess the text data (tokenization, lowercasing, removing stop words, etc.).


## Import Libraries and Packages

In [3]:
# Importing the 'stopwords' collection from the nltk.corpus module
from nltk.corpus import stopwords

# Imports the regular expression module for pattern matching in strings
import re

# Importing the 'word_tokenize' and 'sent_tokenize' functions from nltk.tokenize for tokenizing strings into words
from nltk.tokenize import word_tokenize, sent_tokenize

# Importing the nltk module, which is a suite of libraries for natural language processing
import nltk

# Downloading the 'punkt' tokenizer models, used by nltk for sentence tokenization
nltk.download('punkt')

# Downloading the 'stopwords' dataset, which contains lists of common stopwords in various languages
nltk.download('stopwords')

# Importing lemmatizer and stemmer for text normalization
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Importing WordNet, a lexical database for the English language
from nltk.corpus import wordnet

# Import Word2Vec model from gensim library
from gensim.models import Word2Vec

# Import NumPy for numerical and array operations
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Function for Cleaning & Preprocessing

In [4]:
# Declare function for data cleaning and preprocessing
def preprocess_text(text):

    # Lowercasing
    text = text.lower()

    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Returns a string where all elements in the list 'tokens'
    # are concatenated into a single string, separated by spaces
    return ' '.join(tokens)


## Preprocessing the Review Text

In [5]:
# Apply preprocessing to the review text
data['processed_reviews'] = data['review_text'].apply(preprocess_text)


## Create Column Binary Sentiment

In [6]:
# Convert ratings to binary sentiment
data['sentiment'] = data['review_rating'].apply(lambda x: 1 if x > 3 else 0)


## Display Columns Preprocessed

In [7]:
# Set the display option for max column width
pd.set_option('display.max_colwidth', None)

# Display the columns relevant to check results
print(data[['review_rating', 'review_text', 'processed_reviews', 'sentiment']].head(3).T)


                                                                                                                                                                                                                            0  \
review_rating                                                                                                                                                                                                               5   
review_text        As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!   
processed_reviews                                                                                       get older know like suitable body like dove products gives fresh wide awake feeling dandruff flakey skin smelly apits   
sentiment                                                                                           

The displayed results from the dataset reveal that the preprocessing steps effectively distilled key sentiment-related content from the original reviews, with all three examples showing high review ratings indicative of positive sentiments. The processed reviews retain critical information, focusing on aspects directly related to user experiences and satisfaction, while extraneous details are omitted. Sentiment labels consistently assigned as `1` align with the positive nature of the review ratings and the processed texts, indicating a successful preprocessing and sentiment labelling effort. This suggests that the data is well-prepared for further sentiment analysis tasks, with the preprocessing ensuring that models or analyses are based on relevant, succinct representations of the original reviews, accurately reflecting their positive sentiments.

## Extensive Data Inspection

### Check Missing Values

> Check for missing values or inconsistent data entries

In [8]:
# Checking for missing values in 'review_rating' and 'review_text' columns
missing_values = data[['review_rating', 'review_text']].isnull().sum()

# Printing results in an aligned manner
print("Missing values in selected columns:")
for column, value in missing_values.items():
    print(f"{column:15}= {value}")


Missing values in selected columns:
review_rating  = 0
review_text    = 0


The analysis confirms that the `review_rating` and `review_text` columns have no missing values, a crucial advantage for sentiment analysis. This completeness ensures the dataset is ready for sentiment analysis without needing data imputation or streamlining preprocessing like text cleaning and tokenization. It provides a solid model training and evaluation foundation, enhancing analysis reliability. The absence of missing values in these key columns simplifies project workflows and focuses on core analytical and modelling tasks.

In [9]:
# Assuming 'review_rating' should be between 1 and 5
# Checking for any ratings outside this range
invalid_ratings = data[(data['review_rating'] < 1) | (data['review_rating'] > 5)]

# Printing only the relevant columns: 'review_text' and 'review_rating'
print("Invalid ratings:\n", invalid_ratings[['review_text', 'review_rating']])


Invalid ratings:
 Empty DataFrame
Columns: [review_text, review_rating]
Index: []


The `review_rating` column analysis reveals that all ratings fall within the expected range of 1 to 5, indicating no invalid ratings in the dataset. This finding underscores the high quality of the dataset regarding rating data integrity. It eliminates the need for data cleaning steps for correcting or removing out-of-range ratings. Consequently, the dataset is well-prepared for further processing and analysis, particularly sentiment analysis, where these ratings can be directly utilized or transformed into categorical sentiment labels. This ensures a reliable foundation for the project's analytical and modelling endeavours.

### Data Distribution

> Explore data distribution, such as the number of positive vs negative reviews.

In [10]:
# Define positive (e.g., ratings 4 and 5) and negative (e.g., ratings 1 and 2) reviews
data['review_sentiment'] = data['review_rating'].apply(lambda x: 'Positive' if x > 3 else ('Negative' if x < 3 else 'Neutral'))

# Count the number of positive vs. negative reviews
sentiment_distribution = data['review_sentiment'].value_counts()

print(sentiment_distribution)


Positive    2167
Negative     227
Neutral      107
Name: review_sentiment, dtype: int64


The dataset analyzed shows a dominant number of positive reviews (2,167) compared to negative (227) and neutral (107) reviews, indicating a general customer satisfaction or potential review collection bias. Positive reviews significantly outweigh negative and neutral ones, suggesting clear sentiment trends among reviewers, with few adopting a neutral stance. This imbalance highlights the importance of considering data diversity in sentiment analysis to avoid model biases toward positive outcomes.

## Text Normalization


### Lemmatization and Stemming

- Adding lemmatization and stemming. Lemmatization converts a word to its base form with a proper dictionary meaning, whereas stemming trims words to their root form, which might not be a valid word itself.

In [11]:
# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')  # Make sure WordNet is up-to-date
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [12]:
# Initialize the Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


In [13]:
# Defines a function to map NLTK part-of-speech tags to WordNet part-of-speech tags
def nltk_tag_to_wordnet_tag(nltk_tag):

    if nltk_tag.startswith('J'):
        return wordnet.ADJ
        # If the tag starts with 'J', it's an adjective in NLTK, so return the WordNet tag for adjective

    elif nltk_tag.startswith('V'):
        return wordnet.VERB
        # If the tag starts with 'V', it's a verb, so return the WordNet tag for verb

    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
        # If the tag starts with 'N', it's a noun, so return the WordNet tag for noun

    elif nltk_tag.startswith('R'):
        return wordnet.ADV
        # If the tag starts with 'R', it's an adverb, so return the WordNet tag for adverb

    else:
        return None
        # If the NLTK tag doesn't start with J, V, N, or R, return None as it doesn't match any WordNet tag categories


In [14]:
# Defines a function to lemmatize each word in a sentence
def lemmatize_sentence(sentence):

    words = word_tokenize(sentence)
    # Tokenizes the sentence into individual words

    lemmatized_words = []
    # Initializes an empty list to store the lemmatized words

    for word, tag in nltk.pos_tag(words):
        # Loops through each word and its part-of-speech tag

        wordnet_tag = nltk_tag_to_wordnet_tag(tag)
        # Converts the POS tag into a WordNet POS tag

        if wordnet_tag is None:
            # If there's no corresponding WordNet tag, keep the word as is
            lemmatized_words.append(word)
        else:
            # If there is a corresponding WordNet tag, lemmatize the word
            lemmatized_words.append(lemmatizer.lemmatize(word, wordnet_tag))

    return ' '.join(lemmatized_words)
    # Joins the list of lemmatized words into a single string and returns it


In [15]:
# Defines a function to stem each word in a sentence
def stem_sentence(sentence):

    words = word_tokenize(sentence)
    # Tokenizes the sentence into individual words

    stemmed_words = [stemmer.stem(word) for word in words]
    # Uses a list comprehension to apply the stemmer to each word in the list of words

    return ' '.join(stemmed_words)
    # Joins the list of stemmed words into a single string and returns it


In [16]:
# Apply Lemmatization and Stemming to the review_text column
data['lemmatized_review'] = data['review_text'].apply(lemmatize_sentence)
data['stemmed_review'] = data['review_text'].apply(stem_sentence)

# Display the first few rows to verify
print(data[['review_text', 'lemmatized_review', 'stemmed_review']].head())


                                                                                                                                                                                                  review_text  \
0  As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!   
1                             Three gigantic marmite jars that will last probably a whole life! What else would you possibly wish for? Order came in time, when mentioned, safely packed. Very happy with it.   
2                                                                                                                                                                                                   Excellent   
3                                                                                                                                                                  A

The results demonstrate the application of lemmatization and stemming on review texts, highlighting key differences. Lemmatization retains words in a form closer to their lexicographic roots, ensuring grammatical correctness and preserving the original meaning. Stemming simplifies words more aggressively to their stem forms, often leading to non-words, but helps consolidate word variations. Lemmatization is preferable for tasks requiring semantic accuracy and grammatical integrity while stemming is beneficial for search and indexing applications where speed and matching word variations are prioritized. Both techniques reduce the complexity of natural language data, aiding in text analysis by decreasing the number of unique words.

### Handling Negations

Sometimes, negations (like "not bad") can be crucial for sentiment analysis. Define a strategy to handle such cases.

In [17]:
# Define a function to handle negations
def handle_negations(text):

    # Pattern to identify negations followed by an alphanumeric word
    negation_pattern = re.compile(r'\b(not|no|never|n\'t)\s([a-zA-Z]+)')

    # Replace the identified pattern with the combined form (e.g., "not_good")
    modified_text = negation_pattern.sub(lambda x: x.group(1) + '_' + x.group(2), text)

    return modified_text


In [18]:
# Apply the function to the 'review_text' column
data['handled_negations'] = data['review_text'].apply(handle_negations)

# Display the first few rows to verify the changes
print(data[['review_text', 'handled_negations']].head())


                                                                                                                                                                                                  review_text  \
0  As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!   
1                             Three gigantic marmite jars that will last probably a whole life! What else would you possibly wish for? Order came in time, when mentioned, safely packed. Very happy with it.   
2                                                                                                                                                                                                   Excellent   
3                                                                                                                                                                  A

The function for handling negations within the dataset's `review_text` column successfully identifies and processes negations by merging them with the following word (e.g., "no dandruff" becomes "no_dandruff"). This method is selectively applied, altering the text only where negations are present and leaving other parts of the review unchanged. Such targeted modification is crucial for sentiment analysis, as it preserves the original sentiment's context, particularly in cases where negations can significantly alter the sentiment conveyed. This preprocessing step enhances the dataset's readiness for sentiment analysis by ensuring that negations' nuanced effects are accurately represented, thereby improving the analysis's overall accuracy and reliability.

## Text Vectorization

Converting text data into a format suitable for machine learning models. Common approaches are TF-IDF (Term Frequency-Inverse Document Frequency) and using word embeddings from models like Word2Vec and BERT (🤗 Transformers).

### TF-IDF Vectorization


In [19]:
# Import TF-IDF Vectorizer from scikit-learn for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# Extract the 'review_text' column from the DataFrame to a variable for processing
texts = data['review_text']

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the text data and transform the texts into TF-IDF vectors
tfidf_vectors = tfidf_vectorizer.fit_transform(texts)
# tfidf_vectors is a sparse matrix with TF-IDF values. It can be used for machine learning models


### BERT (🤗 Transformers) Embedding

In [20]:
# Import BERT tokenizer and model from the Hugging Face Transformers library for NLP tasks
from transformers import BertTokenizer, BertModel

# Import PyTorch, a deep learning library used for working with BERT and other transformers models
import torch

# Initialize BERT tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

> **BERT:** The operation can be very slow for large datasets due to BERT's computational complexity. Consider using a GPU available with Google Colab for faster results.

> NVIDIA Tesla A100: Using Google Colab, the A100 is currently one of the most powerful GPUs offered on cloud platforms. It would provide the best performance for training and fine-tuning BERT models due to its superior computing capabilities, more significant memory, and faster bandwidth.

In [21]:
# Define a function to get BERT embeddings for a piece of text
def get_bert_embedding(text, tokenizer, model):

    # Tokenize the input text, converting it to tensors, with truncation and padding applied as needed
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)

    # Disable gradient calculation to save memory and computations during inference
    with torch.no_grad():

        # Pass the tokenized inputs to the BERT model to obtain embeddings
        outputs = model(**inputs)

    # Calculate the mean of the last hidden state across the input sequence dimension to get a single embedding vector per input
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

    # Return the computed embeddings as a NumPy array
    return embeddings

# Apply the function to each row in the review_text column
data['bert_embedding'] = data['review_text'].apply(lambda x: get_bert_embedding(x, bert_tokenizer, bert_model))


In [22]:
# Import the display function from IPython for displaying various types of content (e.g., DataFrames, images) in Jupyter notebooks
from IPython.display import display

# Convert 'data' to a pandas DataFrame for better handling
data = pd.DataFrame(data)

# Display the first few rows for specified columns with better formatting
display(data[['review_text', 'bert_embedding']].head().transpose())


Unnamed: 0,0,1,2,3,4
review_text,"As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!","Three gigantic marmite jars that will last probably a whole life! What else would you possibly wish for? Order came in time, when mentioned, safely packed. Very happy with it.",Excellent,A great flavour top - up for slow cooking.,Does what is says it does
bert_embedding,"[-0.012894374, -0.033466015, 0.5775001, -0.008548661, 0.4159554, -0.107955515, 0.047968976, 0.6296541, 0.117565855, -0.2813466, 0.13127682, -0.28511518, -0.06539772, 0.31463906, -0.27102605, 0.50063705, 0.21445698, 0.17960623, -0.10273593, 0.6261044, 0.22607489, -0.10568842, -0.13871293, 0.5413264, 0.49035096, 0.02690766, 0.12900479, 0.13980943, -0.10320773, -0.23791888, 0.21250632, 0.15990634, -0.2669117, -0.23586485, -0.058295116, -0.09820992, -0.21120004, -0.13981596, -0.08335457, -0.084047034, -0.7080053, -0.18526103, 0.09819827, 0.081446394, -0.22195283, -0.2718587, 0.18632336, 0.121321, -0.07411146, -0.078617215, 0.014659281, -0.12716474, -0.19413272, -0.34396517, 0.15510508, 0.5391602, -0.025918258, -0.27146566, -0.37028563, -0.1571389, 0.12363938, 0.0047309464, 0.17865723, -0.15237795, 0.12247211, 0.21680929, -0.13355011, -0.018876009, -0.63954496, 0.2554509, -0.23382999, -0.14103307, 0.13640243, -0.14475954, -0.024388498, 0.23285624, -0.15507779, 0.12941442, 0.17981248, 0.27727038, -0.1445299, 0.116310745, -0.058328997, 0.61305, 0.21541345, 0.25480607, -0.064681254, 0.17353287, -0.27817604, 0.62052804, 0.017307954, -0.22454414, 0.4980002, 0.16278258, 0.18932664, -0.28580797, 0.13152409, -0.010175243, -0.016083341, 0.40693268, ...]","[0.13458304, 0.11581127, 0.3441332, -0.058032148, 0.22975887, 0.3184651, 0.07801269, 0.72438467, -0.17957854, -0.11425268, 0.25426176, -0.46887305, -0.20061408, 0.3637709, -0.19671893, 0.385746, -0.09184233, 0.16666919, -0.20291564, 0.29078686, 0.27165696, -0.11319967, -0.23489086, 0.35527673, 0.474822, -0.040610783, -0.0076428796, 0.05237347, -0.045040186, -0.257121, 0.58792233, 0.15223207, 0.02133933, -0.15826593, 0.27664798, -0.22014709, -0.018751493, 0.0898702, -0.35188782, 0.10764181, -0.64548457, -0.36875305, 0.21359596, -0.1530616, -0.3005632, -0.22549282, 0.32153082, -0.20438398, 0.030520812, -0.22647046, -0.29883105, 0.30530128, -0.28018945, -0.33962846, 0.25699216, 0.32651317, -0.4663314, -0.39705175, -0.39332712, -0.019956194, -0.19327755, -0.33615482, 0.11291305, -0.26773125, 0.1726253, 0.1650145, 0.13967402, 0.3832324, -0.48549753, -0.09284822, -0.2619383, -0.5823112, -0.008890231, -0.20703912, -0.2921195, -0.17772953, -0.22414239, -0.12874721, -0.012187726, -0.067776315, -0.27866542, 0.3373554, -0.34874994, 0.22933146, 0.012600565, 0.2692041, -0.19465116, 0.2380865, -0.27501753, 0.6287708, 0.062510885, -0.32223305, 0.34807378, 0.09891029, 0.20040427, -0.16970985, 0.035492726, 0.05080241, -0.14886054, 0.42189115, ...]","[0.29201838, 0.2998735, -0.24443285, -0.016476939, -0.024466446, -0.25771067, 0.13405164, 0.14172682, 0.18477309, -0.22432488, -0.040148627, 0.03503036, 0.26915526, 0.12542424, -0.0292779, -0.13269217, 0.15799923, 0.08229523, -0.12709174, 0.021225946, -0.02757001, -0.10918486, 0.06694741, 0.18524814, 0.09274755, 0.20396928, 0.09003028, -0.17940085, -0.33819234, -0.08479466, 0.06269323, -0.07045523, 0.056923795, 0.21404953, -0.36055756, -0.32038167, 0.07356016, -0.072996184, -0.311294, -0.14796801, 0.25797907, -0.109526485, 0.20944072, 0.1906258, -0.14694309, 0.044059824, -0.18891592, 0.21393676, 0.056047108, 0.14163728, -0.22695418, 0.3169262, -0.26082215, 0.18197979, 0.17488141, -0.040641353, 0.013912082, -0.037198912, -0.13463758, -0.039275657, -0.15434651, -0.06087788, -0.16287838, 0.07040399, 0.24257238, 0.053229842, 0.36479846, -0.1961164, -0.37811542, 0.42245212, 0.0556569, -0.41254392, 0.23508298, -0.20082074, 0.026387447, -0.053658586, -0.19963636, 0.111305974, -0.08430786, -0.18258683, 0.18265148, 0.19542672, -0.041973066, 0.4849019, 0.08056391, 0.19562203, -0.19304283, -0.19692416, -0.39366212, 0.22581641, 0.09323809, 0.18416603, 0.03610572, 0.10085099, 0.21816337, -0.15769495, -0.17196943, -0.09007618, -0.28808513, -0.48815966, ...]","[-0.07421097, -0.40000704, 0.27019957, 0.16700542, -0.053395998, -0.1233935, -0.21695139, 0.4877553, -0.040795844, -0.1919875, 0.5683321, -0.13277103, 0.15326037, 0.18058003, -0.24763697, -0.032018345, 0.17564352, 0.29079822, 0.019552331, 0.196594, 0.50550085, -0.43453285, -0.16939314, 0.5831832, 0.24943332, 0.13885604, 0.21117187, -0.06706667, 0.13114168, -0.112187296, 0.60787326, 0.36149624, -0.0036314416, -0.39150965, 0.45242903, -0.094242655, 0.22924511, 0.106241874, -0.2820004, 0.14252616, -0.5366891, -0.19625215, 0.573223, -0.011335894, 0.066552445, -0.2067325, -0.16638623, 0.029145854, 0.14591557, -0.43624723, 0.40622994, -0.31213507, -0.16575211, -0.48375443, 0.059880536, 0.5550478, -0.33829212, -0.43843523, -0.2234959, 0.4356992, -0.5218869, 0.016616965, -0.13875516, -0.06112544, 0.19787534, -0.066876404, -0.17839186, 0.18424356, -0.5195903, 0.23983335, -0.40065503, -0.2674168, 0.02312137, -0.031268016, -0.4212654, -0.09580648, 0.09888911, 0.074282385, 0.28791773, 0.1944318, -0.3720164, -0.08774978, -0.10509198, 0.6760327, 0.24872166, 0.45609736, -0.3362187, -0.1548314, -0.3876547, 0.86628264, -0.06653041, -0.039377335, 0.5844154, -0.119861305, -0.0037281886, -0.19092071, 0.10443743, 0.099701576, 0.06198907, 0.19343711, ...]","[0.31448126, -0.121554114, 0.08247549, -0.2760619, 0.3733873, -0.32440537, 0.512353, 0.29917976, 0.33165, -0.5962382, -0.15178914, -0.18281356, -0.004521724, -0.13762881, -0.71652144, 0.2661477, 0.62813956, 0.07576993, 0.211643, -0.019244246, -0.13767402, 0.26902434, -0.1807715, -0.12493794, 0.5865634, 0.44565997, 0.1551201, -0.50668937, -0.07234681, -0.19723226, 0.2535382, 0.08029683, -0.3144923, -0.23710313, -0.45114642, -0.42930883, 0.0638245, -0.043551177, -0.47986308, 0.050400913, -0.89869833, -0.3873329, 0.09137916, -0.37130556, -0.11543198, -0.23006159, 0.21727857, 0.02165594, 0.04199721, 0.30549327, -0.3889436, 0.052285995, -0.62998825, 0.39584112, 0.0876171, 0.15291345, 0.42872998, -0.3561719, -0.3606502, -0.30025852, 0.06064391, 0.14224629, 0.2449498, -0.074792065, 0.34516475, 0.13370314, -0.13369179, 0.029799588, -0.17908564, 0.35993284, -0.4560313, -0.5783877, 0.01950813, 0.21088144, -0.11812239, -0.30393818, -0.31505132, 0.37060997, 0.27440834, 0.006379813, -0.13915879, 0.4192577, -0.06571477, 0.5471825, 0.25351965, -0.26986235, -0.18640971, 0.2856174, -0.44425604, 0.14515904, 0.43338716, -0.53712064, -0.20061633, -0.24079119, -0.07500007, -0.3020068, -0.15244548, 0.20711751, 0.019145463, 0.411512, ...]"


### Train Word2Vec model

In [23]:
# Extract the 'review_text' column from the DataFrame to a variable for processing
text_data = data['review_text']

# Define a function Preprocess text data
def preprocess_text(text):

    # Convert all characters in the text to lowercase to ensure uniformity
    text = text.lower()

    # Remove all non-alphabetic characters, keeping spaces
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize the cleaned text into individual words
    tokens = word_tokenize(text)

    # Retrieve the set of English stopwords
    stop_words = set(stopwords.words('english'))

    # Filter out stopwords from tokens
    tokens = [word for word in tokens if word not in stop_words]

    # Return the list of filtered tokens
    return tokens

# Apply the preprocessing function to each sentence in the text data
preprocessed_text = [preprocess_text(sentence) for sentence in text_data]

# Train a Word2Vec model with the preprocessed text
word2vec_model = Word2Vec(sentences=preprocessed_text, vector_size=100, window=5, min_count=1, workers=4)
# Now we can use word2vec_model to vectorize text as shown in previous examples


### Use Word2Vec model

In [24]:
# Define a function to vectorize text using a trained Word2Vec model
def vectorize_text_with_word2vec(text, word2vec_model):
    """
    This function vectorizes a piece of text using a trained Word2Vec model by averaging the vectors of the words in the text
    """
    # Check if the input text is a string
    if not isinstance(text, str):

        # Return None if input is not a string to indicate error
        return None

    # Tokenize the text after converting it to lowercase
    tokens = word_tokenize(text.lower())

    # Filter tokens to ensure they are in the Word2Vec model's vocabulary
    tokens = [token for token in tokens if token in word2vec_model.wv.key_to_index]

    # Check if there are no valid tokens after filtering
    if not tokens:

      # Return a zero vector of the model's vector size
      return np.zeros(word2vec_model.vector_size)

    # Calculate the mean vector by averaging the vectors of the tokens
    vector = np.mean([word2vec_model.wv[token] for token in tokens], axis=0)

    # Return the averaged vector representation of the text
    return vector

# Extract the 'review_text' column from the DataFrame to a variable for processing
text = data['review_text']

# Use .apply() to vectorize each piece of text in the DataFrame's Series
vectorized_texts = texts.apply(lambda text: vectorize_text_with_word2vec(text, word2vec_model))
# `vectorized_texts` now contains the vectorized representation of each piece of text in the Series.

print(vectorized_texts)  # Print the vectorized texts for review


0                                 [-0.17983025, 0.20900096, 0.18007076, -0.05505056, 0.13689703, -0.56372046, 0.05483028, 0.79243857, -0.0848952, -0.19619247, -0.2027682, -0.53418994, -0.015119323, 0.24313831, 0.011011173, -0.16380066, 0.028381906, -0.33081722, 0.020336207, -0.524195, 0.18564872, 0.07513388, 0.008626162, -0.08220255, -0.17024049, 0.08738792, -0.31457472, -0.09444319, -0.2913075, -0.045058113, 0.18455759, 0.02857143, 0.24496627, -0.11922023, -0.19600087, 0.3067287, 0.035166055, -0.34513077, -0.3535444, -0.5681062, 0.007466544, -0.23729226, -0.15130335, -0.09110387, 0.43542117, -0.09348773, -0.328953, -0.026732627, 0.16996785, 0.21089205, 0.10342223, -0.27777818, -0.20209214, -0.04069887, -0.11681548, 0.12199128, 0.19435221, 0.032245312, -0.33414918, 0.3375416, 0.014825983, 0.08989349, -0.1278139, 0.06448219, -0.44195303, 0.2058267, 0.16518809, 0.19056793, -0.40702692, 0.5773715, -0.30081323, 0.13477391, 0.2655276, -0.06968811, 0.3601037, 0.17044002, -0.050536197, -0.138

# Advanced Text Preprocessing and Feature Engineering



## Custom Stopword Removal

In [25]:
# Define function Removes custom stopwords from a given text
def remove_custom_stopwords(text, custom_stopwords):

    # Tokenize the input text into individual words
    tokens = word_tokenize(text)

    # Create a list of tokens that are not in the custom stopwords list, case-insensitively
    filtered_tokens = [word for word in tokens if word.lower() not in custom_stopwords]

    # Join the filtered tokens back into a string and return it
    return ' '.join(filtered_tokens)

# Define a list of additional stopwords based on domain knowledge or frequent but less informative words
custom_stopwords = ['amazon', 'product', 'really', 'like', 'would', 'buy']

# Apply the custom stopword removal function to each review in the dataset, updating the 'processed_reviews' column
data['processed_reviews'] = data['review_text'].apply(lambda x: remove_custom_stopwords(x, custom_stopwords))


## Synonym Handling

In [26]:
# Define a function Replaces words with their most common synonym to reduce feature space
def replace_synonyms(text):

    # Tokenize the input text into individual words
    tokens = word_tokenize(text)

    # Initialize an empty list to hold the new tokens after synonym replacement
    new_tokens = []

    for word in tokens:

        # Retrieve a set of synonyms for the current word
        synonyms = wordnet.synsets(word)

        if synonyms:
          # Select the first synonym's first lemma (most common synonym) if synonyms are available
          most_common_synonym = synonyms[0].lemmas()[0].name()

          # Add the most common synonym to the new tokens list
          new_tokens.append(most_common_synonym)

        else:

          # If no synonyms are found, add the original word to the new tokens list
          new_tokens.append(word)

    # Join the new tokens back into a string and return it
    return ' '.join(new_tokens)

# Apply the synonym replacement function to each review in the dataset, updating the 'processed_reviews' column
data['processed_reviews'] = data['review_text'].apply(replace_synonyms)


## N-Grams

In [27]:
# Import CountVectorizer for text feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with n-gram range to include both unigrams and bigrams for richer text representation
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Transform the processed reviews into a sparse matrix of token counts, capturing both unigram and bigram frequencies
X = vectorizer.fit_transform(data['processed_reviews'])


## Feature Scaling

In [30]:
# Import StandardScaler for feature scaling
from sklearn.preprocessing import StandardScaler

# Initialize a StandardScaler to scale features to zero mean and unit variance
scaler = StandardScaler(with_mean=False)

# Apply the scaler to the TF-IDF vectors, standardizing them to have zero mean and unit variance
tfidf_scaled = scaler.fit_transform(tfidf_vectors)


# Model Building and Evaluation

In [None]:
# Import train_test_split to divide data into training and testing sets
from sklearn.model_selection import train_test_split

# Import LogisticRegression for logistic regression modeling
from sklearn.linear_model import LogisticRegression

# Import SVC for Support Vector Machine classification
from sklearn.svm import SVC

# Import RandomForestClassifier for random forest modeling
from sklearn.ensemble import RandomForestClassifier

# Import metrics for model evaluation
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Split the standardized TF-IDF vectors and sentiment labels into training and testing sets with a 20% test size
X_train, X_test, y_train, y_test = train_test_split(tfidf_scaled, data['sentiment'], test_size=0.2, random_state=42)

# Initialize a list of tuples where each tuple contains a model name and its corresponding initialized object
models = [

    # Logistic regression model
    ('Logistic Regression', LogisticRegression()),

    # Linear Support Vector Machine model
    ('Support Vector Machine', SVC(kernel='linear')),

    # Random Forest model
    ('Random Forest', RandomForestClassifier())
]

