<a href="https://colab.research.google.com/github/aakankshch/NLP/blob/main/NLP_Kindle_Review_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment analysis is a NLP technique used to determine the emotional tone or attitude expressed in a piece of text. It involves classifying the sentiment of the text into categories such as positive, negative, or neutral.

In [1]:
#Import the packages
import pandas as pd

This is a small dataset of Book reviews from Amazon Kindle Store

In [2]:
#Load the Dataset
data=pd.read_csv('/content/all_kindle_review.csv')
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [3]:
# checking important columns
df=data[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reviewText  12000 non-null  object
 1   rating      12000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 187.6+ KB


In [5]:
df.shape

(12000, 2)

In [6]:
df.isnull().sum()

Unnamed: 0,0
reviewText,0
rating,0


No Null values in the dataframe

In [7]:
#unique rating values
df['rating'].unique()

array([3, 5, 4, 2, 1])

In [8]:
#Unique value counts
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
5,3000
4,3000
3,2000
2,2000
1,2000


There is no imbalance in dataset

We want the Sentiment Analysis of Reviews whether it is positive or negative

So if Ratings are less than 3 we are making 0 else 1

In [9]:
#Preprocessing and cleaning
#Postive Review is 1 and Negative Reviews are 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)


In [10]:
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,8000
0,4000


This is good enough to not be an imblance dataset

Applying Preprocessing Steps:
* Changing the text to lower cases
* Removing stopwords and special characters

In [11]:
#Step1: lower all the cases
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:x.lower()) #df['reviewText].str.lower()

In [12]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [13]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
from bs4 import BeautifulSoup #To remove HTML tags

In [15]:
#Step2: Removing Special characters,Stopwords,URLs,HTML Tags and Extra Spaces
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

  df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())


In [16]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [17]:
#Step3 Lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [18]:
lemmatizer=WordNetLemmatizer()
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:" ".join([lemmatizer.lemmatize(y) for y in x.split()]))

Preprocessing and Cleaning is Done

In [19]:
#Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20)

In [20]:
X_train.head()

Unnamed: 0,reviewText
4006,bit peace of34 freak city34the character real ...
6422,look book author hope find book author read book
2735,found already read story bought vision given c...
9274,amazing loved didnt want end good must read ot...
624,ok stopped kidnapped castle missing way much w...


Converting the words into vectors using BOW and TF-IDF

In [21]:
#Bag Of Words
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [22]:
#TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [23]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [24]:
X_train_bow.shape,X_test_bow.shape


((9600, 35790), (2400, 35790))

In [25]:
#Naive Bayes Algorithm works well with the Sparse kind Matrix for both BOW and TF-IDF
from sklearn.naive_bayes import GaussianNB
nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)

In [26]:
#Making Predictions
y_pred_bow=nb_model_bow.predict(X_test_bow)
y_pred_tfidf=nb_model_tfidf.predict(X_test_tfidf)

In [35]:
compare_scores=pd.DataFrame({'Actual':y_test,'Predicted_BOW':y_pred_bow,'Predicted_TFIDF':y_pred_tfidf})
compare_scores.head()

Unnamed: 0,Actual,Predicted_BOW,Predicted_TFIDF
6483,1,1,1
7324,1,0,0
9944,0,0,0
11703,1,1,1
4392,1,0,0


In [27]:
# Importing the Accuracy Metrics
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [28]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))
print("TF-IDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

BOW accuracy:  0.5770833333333333
TF-IDF accuracy:  0.5816666666666667


The accuracy is around 58% using either Bag of Words or TF-IDF methods

In [29]:
confusion_matrix(y_test,y_pred_bow)

array([[529, 262],
       [753, 856]])

In [30]:
confusion_matrix(y_test,y_pred_tfidf)

array([[516, 275],
       [729, 880]])