<a href="https://colab.research.google.com/github/carlos-alves-one/-Amazon-Review-NLP/blob/main/Sentiment_Analysis_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Goldsmiths University of London
### MSc. Data Science and Artificial Intelligence
### Module: Natural Language Processing
### Author: Carlos Manuel De Oliveira Alves
### Student: cdeol003
### Coursework Project

# Data Collection

### Load the data

In [18]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Dataset source: https://www.kaggle.com/datasets/akudnaver/amazon-reviews-dataset

License: Unknown

In [19]:
# Import the pandas library and give it the alias 'pd' for data manipulation and analysis
import pandas as pd

# Load the dataset Amazon Review Details from Google Drive
data_path = '/content/drive/MyDrive/amazon_project/amazon-review-details.csv'
data = pd.read_csv(data_path)

# Display the first few rows of the dataframe
data.head(3).T


Unnamed: 0,0,1,2
report_date,2019-01-02,2019-01-03,2019-01-03
online_store,FRESHAMAZON,FRESHAMAZON,FRESHAMAZON
upc,8718114216478,5000184201199,5000184201199
retailer_product_code,B0142CI6FC,B014DFNNRY,B014DFNNRY
brand,Dove Men+Care,Marmite,Marmite
category,Personal Care,Foods,Foods
sub_category,Deos,Savoury,Savoury
product_description,Dove Men+Care Extra Fresh Anti-perspirant Deodorant Aerosol 250ml,Marmite Spread Yeast Extract 500g,Marmite Spread Yeast Extract 500g
review_date,2019-01-01,2019-01-02,2019-01-02
review_rating,5,5,4


# Data Preprocessing

The dataset contains multiple columns, but for our sentiment analysis, we will primarily focus on the 'review_rating' as our target variable and the text of the reviews for our feature.

**Tasks :**

- Select relevant columns ('review_rating' and the review text column).

- Handle missing values if necessary.

- Convert ratings to a binary sentiment (positive or negative).

- Preprocess the text data (tokenization, lowercasing, removing stop words, etc.).


## Import Libraries and Packages

In [20]:
# Importing the 'stopwords' collection from the nltk.corpus module
from nltk.corpus import stopwords

# Importing the 're' module for regular expression operations
import re

# Importing the 'word_tokenize' function from nltk.tokenize for tokenizing strings into words
from nltk.tokenize import word_tokenize

# Importing the nltk module, which is a suite of libraries for natural language processing
import nltk

# Downloading the 'punkt' tokenizer models, used by nltk for sentence tokenization
nltk.download('punkt')

# Downloading the 'stopwords' dataset, which contains lists of common stopwords in various languages
nltk.download('stopwords')

# Importing lemmatizer and stemmer for text normalization
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Importing WordNet, a lexical database for the English language
from nltk.corpus import wordnet


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Function for Cleaning & Preprocessing

In [21]:
# Declare function for data cleaning and preprocessing
def preprocess_text(text):

    # Lowercasing
    text = text.lower()

    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Returns a string where all elements in the list 'tokens'
    # are concatenated into a single string, separated by spaces
    return ' '.join(tokens)


## Preprocessing the Review Text

In [22]:
# Apply preprocessing to the review text
data['processed_reviews'] = data['review_text'].apply(preprocess_text)


## Create Column Binary Sentiment

In [23]:
# Convert ratings to binary sentiment
data['sentiment'] = data['review_rating'].apply(lambda x: 1 if x > 3 else 0)


## Display Columns Preprocessed

In [24]:
# Set the display option for max column width
pd.set_option('display.max_colwidth', None)

# Display the columns relevant to check results
print(data[['review_rating', 'review_text', 'processed_reviews', 'sentiment']].head(3).T)


                                                                                                                                                                                                                            0  \
review_rating                                                                                                                                                                                                               5   
review_text        As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!   
processed_reviews                                                                                       get older know like suitable body like dove products gives fresh wide awake feeling dandruff flakey skin smelly apits   
sentiment                                                                                           

## Extensive Data Inspection

### Check Missing Values

> Check for missing values or inconsistent data entries

In [25]:
# Checking for missing values in 'review_rating' and 'review_text' columns
missing_values = data[['review_rating', 'review_text']].isnull().sum()

# Printing results in an aligned manner
print("Missing values in selected columns:")
for column, value in missing_values.items():
    print(f"{column:15}= {value}")


Missing values in selected columns:
review_rating  = 0
review_text    = 0


The analysis confirms that the `review_rating` and `review_text` columns have no missing values, a crucial advantage for sentiment analysis. This completeness ensures the dataset is ready for sentiment analysis without needing data imputation or streamlining preprocessing like text cleaning and tokenization. It provides a solid model training and evaluation foundation, enhancing analysis reliability. The absence of missing values in these key columns simplifies project workflows and focuses on core analytical and modelling tasks.

In [26]:
# Assuming 'review_rating' should be between 1 and 5
# Checking for any ratings outside this range
invalid_ratings = data[(data['review_rating'] < 1) | (data['review_rating'] > 5)]

# Printing only the relevant columns: 'review_text' and 'review_rating'
print("Invalid ratings:\n", invalid_ratings[['review_text', 'review_rating']])


Invalid ratings:
 Empty DataFrame
Columns: [review_text, review_rating]
Index: []


The `review_rating` column analysis reveals that all ratings fall within the expected range of 1 to 5, indicating no invalid ratings in the dataset. This finding underscores the high quality of the dataset regarding rating data integrity. It eliminates the need for data cleaning steps for correcting or removing out-of-range ratings. Consequently, the dataset is well-prepared for further processing and analysis, particularly sentiment analysis, where these ratings can be directly utilized or transformed into categorical sentiment labels. This ensures a reliable foundation for the project's analytical and modelling endeavours.

### Data Distribution

> Explore data distribution, such as the number of positive vs negative reviews.

In [27]:
# Define positive (e.g., ratings 4 and 5) and negative (e.g., ratings 1 and 2) reviews
data['review_sentiment'] = data['review_rating'].apply(lambda x: 'Positive' if x > 3 else ('Negative' if x < 3 else 'Neutral'))

# Count the number of positive vs. negative reviews
sentiment_distribution = data['review_sentiment'].value_counts()

print(sentiment_distribution)


Positive    2167
Negative     227
Neutral      107
Name: review_sentiment, dtype: int64


The dataset analyzed shows a dominant number of positive reviews (2,167) compared to negative (227) and neutral (107) reviews, indicating a general customer satisfaction or potential review collection bias. Positive reviews significantly outweigh negative and neutral ones, suggesting clear sentiment trends among reviewers, with few adopting a neutral stance. This imbalance highlights the importance of considering data diversity in sentiment analysis to avoid model biases toward positive outcomes.

## Text Normalization

  - Lemmatization/Stemming: Consider adding lemmatization or stemming. Lemmatization converts a word to its base form with a proper dictionary meaning, whereas stemming trims words to their root form, which might not be a valid word itself.

In [28]:
# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')  # Make sure WordNet is up-to-date
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [29]:
# Initialize the Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


In [31]:
# Defines a function to map NLTK part-of-speech tags to WordNet part-of-speech tags
def nltk_tag_to_wordnet_tag(nltk_tag):

    if nltk_tag.startswith('J'):
        return wordnet.ADJ
        # If the tag starts with 'J', it's an adjective in NLTK, so return the WordNet tag for adjective

    elif nltk_tag.startswith('V'):
        return wordnet.VERB
        # If the tag starts with 'V', it's a verb, so return the WordNet tag for verb

    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
        # If the tag starts with 'N', it's a noun, so return the WordNet tag for noun

    elif nltk_tag.startswith('R'):
        return wordnet.ADV
        # If the tag starts with 'R', it's an adverb, so return the WordNet tag for adverb

    else:
        return None
        # If the NLTK tag doesn't start with J, V, N, or R, return None as it doesn't match any WordNet tag categories


In [None]:
# Defines a function to lemmatize each word in a sentence
def lemmatize_sentence(sentence):

    words = word_tokenize(sentence)
    # Tokenizes the sentence into individual words

    lemmatized_words = []
    # Initializes an empty list to store the lemmatized words

    for word, tag in nltk.pos_tag(words):
        # Loops through each word and its part-of-speech tag

        wordnet_tag = nltk_tag_to_wordnet_tag(tag)
        # Converts the POS tag into a WordNet POS tag

        if wordnet_tag is None:
            # If there's no corresponding WordNet tag, keep the word as is
            lemmatized_words.append(word)
        else:
            # If there is a corresponding WordNet tag, lemmatize the word
            lemmatized_words.append(lemmatizer.lemmatize(word, wordnet_tag))

    return ' '.join(lemmatized_words)
    # Joins the list of lemmatized words into a single string and returns it
