**AIM:** To analyze the reason for the dissatisfaction of the customers and develop insights for the area of improvement.

This notebook is prepared to answer the second question of the research ( “Sentiment Analysis to Analyse Customer Reviews and Identify Areas for Improvement in The Product” ).

Name of the file is: **Amazon Echo Dot 2 Reviews.csv**

For this the dataset used is the dataset(b) (Yu 2022) available on the online platform “Kaggle” which is in CSV (Comma Separated Values) format and consists of 52,942 data points and 12 features.

This dataset can be downloaded from the given link: https://www.kaggle.com/datasets/linzey/amazon-echo-dot-2-reviews

To answer the proposed question of the research, following steps are followed:

1.   Importing the required libraries.
2.   Loading the dataset.
3.   Data Analysis
4.   Preprocessing of Textual data
5.   Polarity Distribution
6.   Analysis of different sentiments of reviews
7.   Implementing LDA model

**Importing the required Libraries**

All the required libraries that will support efficient execution of the code are downloaded in the following code cell.

In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
from textblob import TextBlob
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from wordcloud import WordCloud
from nltk.corpus.reader import reviews
import gensim
from gensim import corpora
from matplotlib.patches import Wedge

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Loading the Dataset**

In this notebook the above discussed data is downloaded from the website which is in CSV format and uploaded here using the upload feature of the Google Colab.

In [None]:
# To read the dataset
ds = pd.read_csv('/content/Amazon Echo Dot 2 Reviews.csv')                       # Here "ds" is defined as the dataset provided

**Dataset Analysis**

After looking at the data it is found that it consist of several columns in which different information is stored.
The column are: 'Uniq Id', 'Crawl Timestamp', 'Pageurl', 'Title', 'Review Text',
       'Review Color', 'User Verified', 'Review Date', 'Review Useful Count',
       'Configuration Text', 'Ratting', 'Declaration Text'.
Out of the given columns the most valuable and required information is stored in 'Review Text' thus except that all other coluns are dropped. After further analysis the found missing values from the 'Review Text' are dropped.

In [None]:
# To display the first few rows of the dataset
ds.head()

Unnamed: 0,Uniq Id,Crawl Timestamp,Pageurl,Title,Review Text,Review Color,User Verified,Review Date,Review Useful Count,Configuration Text,Ratting,Declaration Text
0,d63583450415a20094950528ffb4d955,2017-10-26T15:57:14Z,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars,Love the Echo Dot.,Black,Verified Purchase,2017-07-03,,Echo Dot,5.0 out of 5 stars,
1,dc8e5ca6b44bea1006c8bb85cdca3816,2017-10-26T15:57:14Z,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars,Working just fine.,Black,Verified Purchase,2017-07-12,,Echo Dot,5.0 out of 5 stars,
2,f3f823996e2317dd65a6235011492b42,2017-10-26T15:57:14Z,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars,I love my Echo Dot,Black,Verified Purchase,2017-08-01,,Echo Dot,5.0 out of 5 stars,
3,3b6c928e62707a1530c591b897b864d6,2017-10-26T15:57:14Z,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Three Stars,Not great speakers,Black,Verified Purchase,2017-10-03,,Echo Dot,3.0 out of 5 stars,
4,275af85c81c1be55efd706f51a6c7cbe,2017-10-26T15:57:14Z,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars,Great assistant !!,Black,Verified Purchase,2017-07-22,,Echo Dot,5.0 out of 5 stars,


In [None]:
# To get column names
ds.columns

Index(['Uniq Id', 'Crawl Timestamp', 'Pageurl', 'Title', 'Review Text',
       'Review Color', 'User Verified', 'Review Date', 'Review Useful Count',
       'Configuration Text', 'Ratting', 'Declaration Text'],
      dtype='object')

In [None]:
# To drop all the cloumns other than customer's feedback(Review Text) as we are going to use text dataset only to perform sentiment analysis
fb_ds = ds.drop(['Uniq Id', 'Crawl Timestamp', 'Pageurl', 'Title',
       'Review Color', 'User Verified', 'Review Date', 'Review Useful Count',
       'Configuration Text', 'Ratting', 'Declaration Text'], axis=1)              # Here 'axis=1' is to specify to drop columns(as opposed to rows)
# To display the first 10 rows of the resulting dataset
fb_ds.head(10)

Unnamed: 0,Review Text
0,Love the Echo Dot.
1,Working just fine.
2,I love my Echo Dot
3,Not great speakers
4,Great assistant !!
5,Works like a charm
6,Great little gagit
7,It needs some work
8,Neat little helper
9,Just what I needed


In [None]:
# To analyse the data in the text dataframe(fb_ds)
print(fb_ds['Review Text'].iloc[10],"\n")                                       # Used "\n" for new line
print(fb_ds['Review Text'].iloc[100],"\n")
print(fb_ds['Review Text'].iloc[200],"\n")
print(fb_ds['Review Text'].iloc[300],"\n")
print(fb_ds['Review Text'].iloc[400],"\n")

Still learning. This is amazing. 

The best toy I bought myself in a long time. 

Awesome 

Echo Dot andAmazon Echo are great!! 

For the price and what it can do, I'm impressed.  I now have three of them set up throughout the house.  They sometimes have trouble hearing my children, but they seem to hear me fine.  I will probably pick up a few more in the future for a few other rooms in the house.The kids like to listen to music with them, but the built in speaker really isn't up to the task and the sound gets distorted with the volume over half.My wife and I use it for information and the occasional trivia game. 



In [None]:
# Code to check missing values in the "Review Text" column of the dataframe
print(fb_ds['Review Text'].isna().sum())

5


In [None]:
# Code to drop the missing values from the column
fb_ds = fb_ds.dropna(subset=['Review Text'])

In [None]:
# To check the information of the data
fb_ds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10973 entries, 0 to 10976
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  10973 non-null  object
dtypes: object(1)
memory usage: 171.5+ KB


 **Preprocessing for Textual data**

 Under this section the data cleaning is performed to assure the better computation by removing the invaluable information from the dataset by following different process for example, removing urls, hashtags, duplicate text, performing tokenization using NLTK library.

In [None]:
# Create a function to do preprocessing for textual data
def pre_processing(text):
  text = text.lower()                                                           # To convert text into lower case
  text = re.sub(r"https\S+|www\S+https\S+", '', text, flags=re.MULTILINE)       # To remove url from the reviews using regex
  text = re.sub(r'\@w+|\#','',text)                                             # To remove hastags and punctuations using regex
  text = re.sub(r'[^\w\s]', '', text)                                           # To remove all other non-word and non-space characters using regex
  text_tokens = word_tokenize(text)                                             # Tokenization to split the words into individual words using nltk library
  new_text = [w for w in text_tokens if not w in stop_words]                    # To remove stopwords from the text as the have no use in sentiment analysis
  return " ".join(new_text)                                                     # To return the words joined into single string by 'join()'.

In [None]:
# To apply the preprocessing function on the dataset.
fb_ds['Review Text'] = fb_ds['Review Text'].fillna('')                          # To fill missing values in the column with the empty string
nltk.download('punkt')                                                          # To download punkt module from nltk
fb_ds['Review Text'] = fb_ds['Review Text'].apply(pre_processing)               # To apply the preprocessing on the 'Review Text' column and store the updated text in the 'Review Text'.

In [None]:
# To check and drop duplicate data by using duplicate method
print("Number of duplicates available, before removal:", fb_ds.duplicated().sum())  # Print total sum of duplicates available
fb_ds = fb_ds.drop_duplicates('Review Text')                                        # To drop the duplicates from the dataset
print("Number of duplicates available, after removal:", fb_ds.duplicated().sum())   # Print total sum of duplicates available

Number of duplicates available, before removal: 1627
Number of duplicates available, after removal: 0


In [None]:
# To perform stemming using PorterStemmer on the data so we can bring the data to its base form
stemmer = PorterStemmer()
fb_ds['Review Text'] = fb_ds['Review Text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [None]:
# To display data in the head function
fb_ds.head()

Unnamed: 0,Review Text
0,love echo dot
1,work fine
3,great speaker
4,great assist
5,work like charm


In [None]:
# To check the implimentation of preprocessing on the data in the dataframe(fb_ds)
print(fb_ds['Review Text'].iloc[0],"\n")                                       # Used "\n" for new line
print(fb_ds['Review Text'].iloc[1],"\n")
print(fb_ds['Review Text'].iloc[2],"\n")
print(fb_ds['Review Text'].iloc[3],"\n")
print(fb_ds['Review Text'].iloc[4],"\n")

love echo dot 

work fine 

great speaker 

great assist 

work like charm 



In [None]:
# To check updated information of the column
fb_ds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9346 entries, 0 to 10976
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  9346 non-null   object
dtypes: object(1)
memory usage: 146.0+ KB



**Polarity Of Data**

For sentiment analysis it is important to understand the emotions of the text and to make machine understand that NLP libraries are available which helps in assigning the polarity to the text. Thus, using TextBlob library polarities are assigned to the text and on basis of that text is distributed into different sentiments.

In [None]:
# Define a function to calculate the polarity of the data using text blob
def ds_polarity(text):
  return TextBlob(text).sentiment.polarity

In [None]:
# To apply the polarity function on the dataframe
fb_ds['polarity'] = fb_ds['Review Text'].apply(ds_polarity)
fb_ds.head(10)                                                                  # To display the data using head function

Unnamed: 0,Review Text,polarity
0,love echo dot,0.5
1,work fine,0.416667
3,great speaker,0.8
4,great assist,0.8
5,work like charm,0.0
6,great littl gagit,0.8
7,need work,0.0
8,neat littl helper,0.0
9,need,0.0
10,still learn amaz,0.0


In [None]:
# To add sentiment column to the dataframe

# Define a function to calculate the sentiment
def ds_sentiment(label):
  if label <0:
    return "Negative"
  elif label ==0:
      return "Neutral"
  elif label >0:
      return "Positive"

In [None]:
# To apply the sentiment function to the dataframe
fb_ds['sentiment'] = fb_ds['polarity'].apply(ds_sentiment)
fb_ds.head(10)                                                                    # Display to check the implementation of the sentiment function

Unnamed: 0,Review Text,polarity,sentiment
0,love echo dot,0.5,Positive
1,work fine,0.416667,Positive
3,great speaker,0.8,Positive
4,great assist,0.8,Positive
5,work like charm,0.0,Neutral
6,great littl gagit,0.8,Positive
7,need work,0.0,Neutral
8,neat littl helper,0.0,Neutral
9,need,0.0,Neutral
10,still learn amaz,0.0,Neutral


In [None]:
# Code to check missing values in the "Review Text" column of the dataframe
print(fb_ds['Review Text'].isna().sum())
print(fb_ds['sentiment'].isna().sum())

0
0


In [None]:
# Code to drop the missing values from the column
fb_ds = fb_ds.dropna(subset=['Review Text'])
fb_ds = fb_ds.dropna(subset=['sentiment'])

**Visualization**

Under this section the data is visualized by using bar-plot and pie-chart to understand the available dataset. The visualization has depicted the dataset is imbalanced as large proportion belongs to positive analysis.

In [None]:
# To visualize the data in the countplot
ds_fig = plt.figure(figsize=(7,7))                                              # To create a figure with a specified size
sns.countplot(x='sentiment', data = fb_ds)                                      # To create a countplot of the 'sentiment' column in the dataframe
plt.show()                                                                      # To display the plot

In [None]:
# To represent the data in a pie chart
ds_fig = plt.figure(figsize=(7,7))                                              # To create a figure of specific size
ds_colors = ("green", "orange", "red")                                          # To set different colors for the sentiments
ds_wp = {'linewidth':1, 'edgecolor': "white"}                                   # To define line width and edge color
tags = fb_ds['sentiment'].value_counts()                                        # To specify value of different sentiments in the chart
explode = (0.05,0.05,0.05)                                                      # To provide seperation between different divisons for better visuality
tags.plot(kind='pie', autopct='%1.1f%%', shadow= True, colors= ds_colors,
          startangle=90, wedgeprops = ds_wp, explode = explode, label='')       # For implementation of the defined input on the chart
plt.title('Distribution of sentiments')                                         # To provide heading to the chart
plt.show()                                                                      # To display the plot

**Analysis of reviews from different sentiment section**
In this section it is tried to expressed the different kind of words that are there in different sentiments and those are expressed using word-cloud so that we can identify difference in different sentiments.


In [None]:
# For Positive Reviews
pos_review = fb_ds[fb_ds.sentiment == 'Positive']                               # To select the rows where sentiment column is equal to positive
pos_review = pos_review.sort_values(['polarity'], ascending= False)             # To sort the positive reviews in descending order of polarity scores
pos_review.head()                                                               # To display the value using head function

Unnamed: 0,Review Text,polarity,sentiment
5124,want listen podcast perfect,1.0,Positive
8475,perfect hook audio system get voicecontrol mus...,1.0,Positive
8275,probabl best version echoalexa,1.0,Positive
1246,perfect apart easi instal use,1.0,Positive
5151,best amazon devic ive ever bought use play mus...,1.0,Positive


In [None]:
import wordcloud                                                                # Import wordcloud to visualize the word cloud of different words of different sentiments
# Visualize words in all the positive reviews
text = ' '.join([word for word in pos_review['Review Text']])                   # The words from positive reviews will be extracted and joined in a single string
plt.figure(figsize = (16,16), facecolor='None')                                 # To specify the size of figure
wordcloud = WordCloud(max_words=500, width=1600, height=800). generate(text)    # Height, width and limit of words for word cloud is specified here.
                                                                                # Generate is called to generate the word cloud image
plt.imshow(wordcloud, interpolation='bilinear')                                 # "imshow" to display the image
plt.axis('off')                                                                 # To specify axis
plt.title('Most frequent words in Positive Reviews', fontsize=20)               # To specify title and font size of the title

In [None]:
# For Negative Reviews
neg_review = fb_ds[fb_ds.sentiment == 'Negative']                               # To select the rows where sentiment column is equal to negative
neg_review = neg_review.sort_values(['polarity'], ascending= False)             # To sort the negative reviews in descending order of polarity scores
neg_review.head()                                                               # To display the value using head function

Unnamed: 0,Review Text,polarity,sentiment
554,get close sometim doesnt ask pandora mess ever...,-1.387779e-17,Negative
6979,husband realli enjoy echo purchas one two gift...,-0.002777778,Negative
5772,appar amazon appli filter content access 6 yea...,-0.004166667,Negative
3791,set go wellat tech savvi follow instruct well ...,-0.005008418,Negative
10891,dot great use listen book set timer host thing...,-0.006666667,Negative


In [None]:
# Visualize words in all the negative reviews
text = ' '.join([word for word in neg_review['Review Text']])                   # The words from negative reviews will be extracted and joined in a single string
plt.figure(figsize = (16,16), facecolor='None')                                 # To specify the size of figure
wordcloud = WordCloud(max_words=500, width=1600, height=800). generate(text)    # Height, width and limit of words for word cloud is specified here.
                                                                                # Generate is called to generate the word cloud image
plt.imshow(wordcloud, interpolation='bilinear')                                 # "imshow" to display the image
plt.axis('off')                                                                 # To specify axis
plt.title('Most frequent words in Negative Reviews', fontsize=20)               # To specify title and font size of the title

In [None]:
# For Neutral Reviews
neu_review = fb_ds[fb_ds.sentiment == 'Neutral']                                # To select the rows where sentiment column is equal to neutral
neu_review = neu_review.sort_values(['polarity'], ascending= False)             # To sort the neutral reviews in descending order of polarity scores
neu_review.head()                                                               # To display the value using head function

Unnamed: 0,Review Text,polarity,sentiment
5,work like charm,0.0,Neutral
7042,addict,0.0,Neutral
7135,must devic everi home system breez set even ea...,0.0,Neutral
7127,use portabl handi daili alarm check weather,0.0,Neutral
7108,gave christma present favorit gift year,0.0,Neutral


In [None]:
# Visualize words in all the negative reviews
text = ' '.join([word for word in neu_review['Review Text']])                   # The words from neutarl reviews will be extracted and joined in a single string
plt.figure(figsize = (16,16), facecolor='None')                                 # To specify the size of figure
wordcloud = WordCloud(max_words=500, width=1600, height=800). generate(text)    # Height, width and limit of words for word cloud is specified here.
                                                                                # Generate is called to generate the word cloud image
plt.imshow(wordcloud, interpolation='bilinear')                                 # "imshow" to display the image
plt.axis('off')                                                                 # To specify axis
plt.title('Most frequent words in Neutral Reviews', fontsize=20)                # To specify title and font size of the title

In [None]:
# To remove warnings error use filter warning
import warnings
warnings.filterwarnings('ignore')

**Implementation of Latent Dirichlet Allocation (LDA):**

LDA is a popular approach for Topic modelling that considers text as a set of topics and topics as a set of words. This approach will help in extracting the area for product improvement.

Here after getting insights from word-cloud image of the negative words, the different topics that is a set of different words are extracted along with their frequency.
On basis of that frequency comparison analysis are made and it is stated that improvement in connectivity is a reliable improvement that will improve the quality of the product and along with that improvement in qulity of speaker can also be made.

In [None]:
# Creating list of lists of tokens
reviews = [doc.split() for doc in neg_review['Review Text']]
# Creating dictionary of words
dictReviews = corpora.Dictionary(reviews)
# Converting corpus to the bag-of-words format
corpus = [dictReviews.doc2bow(text) for text in reviews]
# Training LDA model
num_topics = 10
model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            num_topics=num_topics,
                                            id2word=dictReviews,
                                            passes=50)
# Print the topics and their top words
for idx, topic in model.print_topics(-1):
      print('Topic: {} \nWords: {}'. format(idx, topic))

Topic: 0 
Words: 0.014*"amazon" + 0.012*"music" + 0.012*"work" + 0.009*"ask" + 0.008*"return" + 0.008*"question" + 0.008*"echo" + 0.008*"one" + 0.008*"thing" + 0.007*"answer"
Topic: 1 
Words: 0.015*"amazon" + 0.014*"get" + 0.013*"use" + 0.011*"ask" + 0.011*"alexa" + 0.011*"devic" + 0.011*"like" + 0.010*"would" + 0.010*"work" + 0.010*"dot"
Topic: 2 
Words: 0.017*"alexa" + 0.013*"wifi" + 0.012*"question" + 0.011*"devic" + 0.010*"dot" + 0.010*"time" + 0.009*"one" + 0.008*"understand" + 0.008*"even" + 0.007*"answer"
Topic: 3 
Words: 0.019*"like" + 0.014*"use" + 0.012*"alexa" + 0.011*"play" + 0.009*"sometim" + 0.009*"thing" + 0.009*"time" + 0.009*"amazon" + 0.009*"work" + 0.008*"game"
Topic: 4 
Words: 0.020*"work" + 0.018*"speaker" + 0.017*"echo" + 0.015*"alexa" + 0.012*"connect" + 0.010*"dot" + 0.009*"time" + 0.009*"use" + 0.009*"hard" + 0.008*"like"
Topic: 5 
Words: 0.028*"dot" + 0.027*"echo" + 0.017*"music" + 0.014*"one" + 0.013*"play" + 0.013*"amazon" + 0.011*"use" + 0.010*"alexa" + 0.0

In [None]:
#  Displaying topics without frequency
topTopics = model.show_topics(num_topics, formatted=False, num_words=10)
for topic in topTopics:
  print('\nTopic:', topic[0])
  topWords = [word[0] for word in topic[1]]
  print('Top words:', topWords)


Topic: 0
Top words: ['amazon', 'music', 'work', 'ask', 'return', 'question', 'echo', 'one', 'thing', 'answer']

Topic: 1
Top words: ['amazon', 'get', 'use', 'ask', 'alexa', 'devic', 'like', 'would', 'work', 'dot']

Topic: 2
Top words: ['alexa', 'wifi', 'question', 'devic', 'dot', 'time', 'one', 'understand', 'even', 'answer']

Topic: 3
Top words: ['like', 'use', 'alexa', 'play', 'sometim', 'thing', 'time', 'amazon', 'work', 'game']

Topic: 4
Top words: ['work', 'speaker', 'echo', 'alexa', 'connect', 'dot', 'time', 'use', 'hard', 'like']

Topic: 5
Top words: ['dot', 'echo', 'music', 'one', 'play', 'amazon', 'use', 'alexa', 'get', 'like']

Topic: 6
Top words: ['get', 'use', 'alexa', 'know', 'dot', 'echo', 'dont', 'like', 'would', 'googl']

Topic: 7
Top words: ['use', 'dot', 'product', 'set', 'app', 'work', 'get', 'echo', 'alexa', 'never']

Topic: 8
Top words: ['use', 'echo', 'cant', 'alexa', 'dot', 'music', 'amazon', 'one', 'doesnt', 'household']

Topic: 9
Top words: ['echo', 'alexa', '

In [None]:
# Create a dictionary with the frequency of each word in the top 5 topics
wordFreq = {}
for topic in topTopics:
    for word, freq in topic[1]:
        if word in wordFreq:
            wordFreq[word] += freq
        else:
            wordFreq[word] = freq

# Sort the dictionary in descending order based on the frequency
sorted_wordFreq = sorted(wordFreq.items(), key=lambda x: x[1], reverse=True)

# Extract the top 10 words and their frequencies from the sorted dictionary
top_Words = [word[0] for word in sorted_wordFreq[:10]]
top_Freqs = [word[1] for word in sorted_wordFreq[:10]]

# Generate random colors for each bar
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'orange', 'blue', 'green']

# Create a bar chart with different colors for each bar
plt.bar(top_Words, top_Freqs, color=colors)
plt.xticks(rotation=45)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Words from Top 5 Topics')

plt.show()

In [None]:
# Set the size of the figure
plt.figure(figsize=(7, 7))

# Create a pie chart with different colors for each slice
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'orange', 'purple', 'gray', 'pink']
plt.pie(top_Freqs, labels=top_Words, colors=colors, autopct='%1.1f%%', startangle=90)

plt.axis('equal')
plt.title('Top 10 Words from Top 5 Topics')
plt.show()
