<a href="https://colab.research.google.com/github/carlos-alves-one/-Amazon-Review-NLP/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Goldsmiths University of London
### MSc. Data Science and Artificial Intelligence
### Module: Natural Language Processing
### Author: Carlos Manuel De Oliveira Alves
### Student: cdeol003
### Coursework Project

# Data Collection

### Load the data

In [1]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


Dataset source: https://www.kaggle.com/datasets/akudnaver/amazon-reviews-dataset

License: Unknown

In [2]:
# Import the pandas library and give it the alias 'pd' for data manipulation and analysis
import pandas as pd

# Load the dataset Amazon Review Details from Google Drive
data_path = '/content/drive/MyDrive/amazon_project/amazon-review-details.csv'
amazon_data = pd.read_csv(data_path)

# Display the first few rows of the dataframe
amazon_data.head(3).T


Unnamed: 0,0,1,2
report_date,2019-01-02,2019-01-03,2019-01-03
online_store,FRESHAMAZON,FRESHAMAZON,FRESHAMAZON
upc,8718114216478,5000184201199,5000184201199
retailer_product_code,B0142CI6FC,B014DFNNRY,B014DFNNRY
brand,Dove Men+Care,Marmite,Marmite
category,Personal Care,Foods,Foods
sub_category,Deos,Savoury,Savoury
product_description,Dove Men+Care Extra Fresh Anti-perspirant Deod...,Marmite Spread Yeast Extract 500g,Marmite Spread Yeast Extract 500g
review_date,2019-01-01,2019-01-02,2019-01-02
review_rating,5,5,4


In [3]:
# Extracting the first review text from the 'review_text' column of the amazon_data dataframe
data = amazon_data['review_text'].values[0]
data


'As you get older, you know what you like and what is suitable for your body. I like all Dove products. Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin. No smelly a/pits!'

# Lexical Analysis

## Word Tokenization

> The code snippet below first imports the NLTK library, a popular library for natural language processing in Python. Second downloads the 'punkt' package, which contains pre-trained models for tokenizing sentences, an essential step in many text processing tasks.

In [4]:
# Import the Natural Language Toolkit (nltk) library. This library is used for working with human language data (text) in Python
import nltk

# Download the 'punkt' tokenizer models. This is necessary for sentence tokenization, which is a part of the nltk library's functionality
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

> This code snippet is used for natural language processing. The `word_tokenize` function takes a string (stored in the variable `data`) and splits it into its constituent words, a process known as tokenization. This is often a preliminary step in text-processing tasks like language modelling, sentiment analysis, or information extraction. Note that for this code to work, `data` should be a string variable containing some text.

In [5]:
# From the nltk.tokenize module, import the sent_tokenize and word_tokenize functions
# sent_tokenize is used for splitting text into sentences, and word_tokenize is used for splitting sentences into words
from nltk.tokenize import sent_tokenize, word_tokenize

# Use the word_tokenize function to split the content of the variable 'data' into individual words
# This assumes 'data' is a string containing natural language text
# The result, a list of tokenized words, is then printed to the console
print(word_tokenize(data))


['As', 'you', 'get', 'older', ',', 'you', 'know', 'what', 'you', 'like', 'and', 'what', 'is', 'suitable', 'for', 'your', 'body', '.', 'I', 'like', 'all', 'Dove', 'products', '.', 'Gives', 'you', 'that', 'fresh', 'all', 'over', ',', 'wide', 'awake', 'feeling', 'and', 'no', 'dandruff', 'or', 'flakey', 'skin', '.', 'No', 'smelly', 'a/pits', '!']


## Sentence Tokenization

> This code snippet differs from the previous one by using sent_tokenize instead of word_tokenize. The sent_tokenize function takes a string of text (provided in the variable data) and splits it into its constituent sentences. This is a common first step in text analysis when dealing with more significant text documents, as it helps in breaking down the text into more manageable units for further processing, like language modelling, sentiment analysis, or summarization. Again, data should be a string variable containing text for this code to function as intended.

In [6]:
# Use the sent_tokenize function to split the content of the variable 'data' into individual sentences
# This assumes 'data' is a string containing natural language text
# The result, a list of tokenized sentences, is then printed to the console
print(sent_tokenize(data))


['As you get older, you know what you like and what is suitable for your body.', 'I like all Dove products.', 'Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin.', 'No smelly a/pits!']


## Store Words and Sentences

> In this code snippet, we perform two tokenisation types on the exact text (data): sentence tokenisation and word tokenisation. First, the text is split into sentences and stored in the variable phrases. Then, the exact text is split into words and stored in the variable words. Finally, both the tokenised sentences and words are printed. This approach is practical when we need to analyse the text at both the sentence and word levels.


In [7]:
# Use the sent_tokenize function to split the content of the variable 'data' into individual sentences
# The result is stored in the variable 'phrases'
# This operation assumes 'data' is a string containing natural language text
phrases = sent_tokenize(data)

# Use the word_tokenize function to split the content of the same 'data' variable into individual words
# The result is stored in the variable 'words'
# This also assumes 'data' is a string containing natural language text
words = word_tokenize(data)

# Print the list of tokenized sentences stored in 'phrases'
print(phrases)

# Print the list of tokenized words stored in 'words'
print(words)


['As you get older, you know what you like and what is suitable for your body.', 'I like all Dove products.', 'Gives you that fresh all over, wide awake feeling and no dandruff or flakey skin.', 'No smelly a/pits!']
['As', 'you', 'get', 'older', ',', 'you', 'know', 'what', 'you', 'like', 'and', 'what', 'is', 'suitable', 'for', 'your', 'body', '.', 'I', 'like', 'all', 'Dove', 'products', '.', 'Gives', 'you', 'that', 'fresh', 'all', 'over', ',', 'wide', 'awake', 'feeling', 'and', 'no', 'dandruff', 'or', 'flakey', 'skin', '.', 'No', 'smelly', 'a/pits', '!']


# Stop-Word Removal

> This code snippet prepares for natural language processing tasks by importing necessary functions and resources from the NLTK library. sent_tokenize and word_tokenize are for tokenizing text into sentences and words, while stopwords provide a list of common words often excluded in text analysis to focus on more meaningful words. The last line ensures that the stopwords data is downloaded and available.

In [8]:
# Import the stopwords collection from the nltk.corpus module
# Stopwords are common words like 'the', 'is', 'in', etc., that are often filtered out in natural language processing tasks
from nltk.corpus import stopwords

# Download the 'stopwords' resource from the NLTK data repository
# This resource is a collection of stopwords for various languages and is necessary for stopwords filtering tasks
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

> This code snippet filters out common stopwords from a given text (data). It first tokenizes the text into words and then checks each word against predefined stopwords (common words that usually do not carry significant meaning in text analysis). The words not in the stopwords list are then collected in words filtered. The final output includes the count and list of stopwords and the filtered words from the input text. This process is helpful in many text analysis tasks where the focus is on the meaningful content of the text.

In [9]:
# Create a set of English stopwords by using the stopwords.words function from the NLTK library
# This set contains common words like 'the', 'is', 'in', etc., that are typically removed in language processing
stopWords = set(stopwords.words('english'))

# Tokenize the 'data' string into words after converting it to lowercase
# This ensures that the tokenization is case-insensitive
words = word_tokenize(data.lower())

# Initialize an empty list to store the filtered words
wordsFiltered = []

# Iterate over each word in the tokenized 'words' list
for w in words:
    # Check if the word is not in the set of stopwords
    if w not in stopWords:
        # If the word is not a stopword, append it to the 'wordsFiltered' list
        wordsFiltered.append(w)

# Print the total number of stopwords in the stopWords set
print(len(stopWords))

# Print the set of English stopwords
print(stopWords)

# Print the list of words filtered out from the 'data' string, excluding stopwords
print(wordsFiltered)


179
{'s', 'an', 'if', 'and', 'yourselves', 'our', 'have', 'y', 'below', "don't", 'against', 'having', "mustn't", 'were', 'does', 'herself', 'than', 'such', 'now', "weren't", 'through', 'above', 'how', 'ourselves', "won't", 'myself', 'for', 'didn', "shouldn't", "shan't", 'these', "should've", 're', "didn't", "it's", "wasn't", "you're", 'as', "mightn't", 'needn', 'do', 'some', 'or', 'will', 'again', 'where', 'aren', 'theirs', 'any', 'up', 'o', 'off', 'a', 'to', 'between', "you'd", 'while', 'into', 'about', 'under', 'what', 'they', 'your', 'own', 'too', 'no', "doesn't", 'was', 'not', 'by', 'been', "couldn't", "haven't", 'more', 'which', 'is', 'there', 'when', 'wouldn', 'he', 'during', 've', 'then', 'ain', 'but', 'don', 'ma', 'should', "you'll", 'further', 'shouldn', 'their', 'down', 'few', 'haven', 'its', 'had', 'nor', 'out', 'd', 'in', 'i', 'here', 'same', 'before', 'doing', 't', "aren't", "isn't", "hasn't", 'am', 'his', 'me', 'it', 'she', 'other', 'hadn', 'did', 'her', "that'll", 'whom'

# Stemming

> This code snippet is set up for text-processing tasks using the Natural Language Toolkit (NLTK) in Python. It imports the PorterStemmer class, which is a popular stemming algorithm used to reduce words to their base or root form (for example, "running" to "run"). This can be useful in various natural language processing applications where the specific form of a word is less important than its core meaning. The snippet also imports sent_tokenize and word_tokenize, which are used for breaking text down into sentences and words, respectively, an essential step in many text analysis tasks.


In [10]:
# Import the PorterStemmer class from the nltk.stem module
# PorterStemmer is used for stemming, which is the process of reducing words to their word stem or root form
from nltk.stem import PorterStemmer

# Import the sent_tokenize and word_tokenize functions from the nltk.tokenize module
# sent_tokenize is used for splitting text into sentences, while word_tokenize is used for splitting sentences into words
from nltk.tokenize import sent_tokenize, word_tokenize


## Stem the Words

> In this code snippet, we are creating an instance of the PorterStemmer class from the NLTK library, which implements the Porter stemming algorithm. This algorithm is widely used for stemming in English, which involves reducing words to their root form. For example, the stemmer might reduce the word "running" to "run". The for loop iterates over each word in the list of words and prints its stemmed version for each word. This process is helpful in natural language processing tasks where we want to consider word usage in a more general sense, without the specifics of tense, plurality, etc.

In [11]:
# Initialize an instance of the PorterStemmer class
# This stemmer is used to reduce words to their base or root form
ps = PorterStemmer()

# Iterate through each word in the 'words' list.
# This list should contain individual words that are possibly tokenized from some text
for word in words:
    # Print the stemmed version of each word
    # The stem() method of the PorterStemmer instance reduces the word to its root form
    print(ps.stem(word))


as
you
get
older
,
you
know
what
you
like
and
what
is
suitabl
for
your
bodi
.
i
like
all
dove
product
.
give
you
that
fresh
all
over
,
wide
awak
feel
and
no
dandruff
or
flakey
skin
.
no
smelli
a/pit
!


## Stem the Sentences

> In this code snippet, we are iterating over each word in the word list. We print a combination of the original word and its stemmed version for each word, with a colon (":") separating the two. This output format makes it easy to compare the original word with its stemmed form, helping to understand how the stemming process changes each word. The ps.stem(word) call applies the Porter stemming algorithm to reduce each word to its root form. It is a common practice in many natural language processing tasks to generalize and simplify text data.

In [12]:
# Iterate through each word in the 'words' list
# This list should contain individual words that are possibly tokenized from some text
for word in words:
    # For each word, print the original word and its stemmed version, separated by a colon
    # The stem() method of the PorterStemmer instance 'ps' reduces the word to its root form
    print(word + ":" + ps.stem(word))


as:as
you:you
get:get
older:older
,:,
you:you
know:know
what:what
you:you
like:like
and:and
what:what
is:is
suitable:suitabl
for:for
your:your
body:bodi
.:.
i:i
like:like
all:all
dove:dove
products:product
.:.
gives:give
you:you
that:that
fresh:fresh
all:all
over:over
,:,
wide:wide
awake:awak
feeling:feel
and:and
no:no
dandruff:dandruff
or:or
flakey:flakey
skin:skin
.:.
no:no
smelly:smelli
a/pits:a/pit
!:!


# N-grams

## Word N-grams

> The ngrams function in the NLTK library helps generate n-grams from text. N-grams are contiguous sequences of 'n' items (where 'n' is a specified number) from the text. For example, in text processing, these items are usually words, so a 2-gram (or bigram) would consist of pairs of consecutive words from the text. N-grams are used in various natural language processing tasks, such as text analysis and linguistic research, and as a part of algorithms for text prediction, search, and machine learning models in language processing.

In [13]:
# Import the ngrams function from the nltk module
# The ngrams function is used to generate n-grams from a given sequence of items (like words in a text)
# An n-gram is a contiguous sequence of n items from a given sample of text or speech
from nltk import ngrams


In [14]:
# Print the first element (sentence) from the list 'phrases'
# This assumes that 'phrases' is a list of sentences obtained from a previous sentence tokenization process
print(phrases[0])


As you get older, you know what you like and what is suitable for your body.


> This code snippet generates 6-grams from the first sentence in the phrases list. The ngrams function from NLTK takes a list of items (in this case, words from the first sentence) and an integer n, producing an iterator over n-grams. We have chosen 6-grams here so that the function will produce tuples of six consecutive words from the sentence. These 6-grams are helpful in natural language processing and computational linguistics to analyse the context in which words appear and understand language patterns. The for loop then iterates over these 6-grams, printing each in turn.

In [15]:
# Set the number 'n' for the n-grams. In this case, n is set to 6 for generating 6-grams
n = 6

# Create 6-grams from the first sentence in the 'phrases' list
# The sentence is first split into words using the split() method, then ngrams() function is applied
# This results in an iterator over 6-word tuples from the sentence
w_6grams = ngrams(phrases[0].split(), n)

# Iterate over each 6-gram in the w_6grams
for grams in w_6grams:
    # Print each 6-gram
    # Each 6-gram is a tuple of 6 words that occur consecutively in the sentence
    print(grams)


('As', 'you', 'get', 'older,', 'you', 'know')
('you', 'get', 'older,', 'you', 'know', 'what')
('get', 'older,', 'you', 'know', 'what', 'you')
('older,', 'you', 'know', 'what', 'you', 'like')
('you', 'know', 'what', 'you', 'like', 'and')
('know', 'what', 'you', 'like', 'and', 'what')
('what', 'you', 'like', 'and', 'what', 'is')
('you', 'like', 'and', 'what', 'is', 'suitable')
('like', 'and', 'what', 'is', 'suitable', 'for')
('and', 'what', 'is', 'suitable', 'for', 'your')
('what', 'is', 'suitable', 'for', 'your', 'body.')


## Character N-grams

In [16]:
# Set 'n' to 6 for generating 6-grams
n = 6

# Create 6-grams from the first sentence in the 'phrases' list, but this time treating the sentence as a sequence of characters
# The ngrams() function is applied directly to the string, resulting in an iterator over 6-character tuples from the sentence
c_6grams = ngrams(phrases[0], n)

# Iterate over each 6-gram in the c_6grams
for grams in c_6grams:
    # Join the tuple of characters into a string and print it
    # Each 6-gram here is a sequence of 6 consecutive characters (including spaces and punctuation) from the sentence
    print(''.join(grams))


As you
s you 
 you g
you ge
ou get
u get 
 get o
get ol
et old
t olde
 older
older,
lder, 
der, y
er, yo
r, you
, you 
 you k
you kn
ou kno
u know
 know 
know w
now wh
ow wha
w what
 what 
what y
hat yo
at you
t you 
 you l
you li
ou lik
u like
 like 
like a
ike an
ke and
e and 
 and w
and wh
nd wha
d what
 what 
what i
hat is
at is 
t is s
 is su
is sui
s suit
 suita
suitab
uitabl
itable
table 
able f
ble fo
le for
e for 
 for y
for yo
or you
r your
 your 
your b
our bo
ur bod
r body
 body.


# Exploring Corpora

In [17]:
from nltk import FreqDist
from nltk.tokenize import word_tokenize


In [18]:
fdist1 = FreqDist(words)
print(fdist1.most_common(2)) # Prints two most common tokens
print(fdist1.hapaxes())      # Prints tokens with frequency 1


[('you', 4), ('.', 3)]
['as', 'get', 'older', 'know', 'is', 'suitable', 'for', 'your', 'body', 'i', 'dove', 'products', 'gives', 'that', 'fresh', 'over', 'wide', 'awake', 'feeling', 'dandruff', 'or', 'flakey', 'skin', 'smelly', 'a/pits', '!']
