# Kindle Review Sentiment Analysis
This notebook demonstrates sentiment analysis on Amazon Kindle reviews using multiple feature extraction methods: Bag of Words, TF-IDF, and Word2Vec.

## 1. Data Loading and Initial Exploration
Load the dataset and check for missing values and rating distribution.

In [1]:
# Load the dataset
import pandas as pd
data=pd.read_csv('kindle_review_dataset/all_kindle_review.csv')
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [45]:
df=data[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [46]:
df.shape

(12000, 2)

In [47]:
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [48]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [49]:

df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [50]:
## Preprocessing And Cleaning
## postive review is 1 and negative review is 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)


In [51]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [52]:
## 1. Lower All the cases
df['reviewText']=df['reviewText'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].str.lower()


In [53]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [54]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
from bs4 import BeautifulSoup

In [56]:
%pip install lxml




In [57]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags using BeautifulSoup with 'html.parser' instead of 'lxml'
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

In [58]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [59]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer

In [60]:
lemmatizer=WordNetLemmatizer()

In [61]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [62]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))


In [63]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


## 2. Data Preprocessing and Cleaning
Lowercase, remove special characters, stopwords, URLs, HTML tags, and extra spaces.

## 3. Lemmatization
Apply lemmatization to normalize words.

## 4. Train-Test Split
Split the cleaned reviews and binary ratings into training and testing sets.

In [64]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20)

## 5. Bag of Words Feature Extraction
Convert text to numerical features using Bag of Words.

In [65]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train)
X_test_bow=bow.transform(X_test)

## 6. TF-IDF Feature Extraction
Convert text to numerical features using TF-IDF.

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)


## 7. Train Multinomial Naive Bayes Classifier
Train and evaluate on BOW and TF-IDF features.

In [67]:
from sklearn.naive_bayes import MultinomialNB
# Use MultinomialNB for sparse/dense high-dimensional text data to avoid MemoryError
nb_model_bow = MultinomialNB().fit(X_train_bow, y_train)
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)

## 8. Model Evaluation: BOW and TF-IDF
Evaluate accuracy and print confusion matrix for both models.

In [68]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [69]:
# Predict and evaluate BOW model
y_pred_bow = nb_model_bow.predict(X_test_bow)
print("BOW accuracy: ", accuracy_score(y_test, y_pred_bow))
print(classification_report(y_test, y_pred_bow))
print("BOW Confusion Matrix:\n", confusion_matrix(y_test, y_pred_bow))

BOW accuracy:  0.8325
              precision    recall  f1-score   support

           0       0.80      0.70      0.75       849
           1       0.85      0.90      0.87      1551

    accuracy                           0.83      2400
   macro avg       0.82      0.80      0.81      2400
weighted avg       0.83      0.83      0.83      2400

BOW Confusion Matrix:
 [[ 595  254]
 [ 148 1403]]


In [70]:
# Predict and evaluate TFIDF model
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)
print("TFIDF accuracy: ", accuracy_score(y_test, y_pred_tfidf))
print(classification_report(y_test, y_pred_tfidf))
print("TFIDF Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tfidf))

TFIDF accuracy:  0.6708333333333333
              precision    recall  f1-score   support

           0       1.00      0.07      0.13       849
           1       0.66      1.00      0.80      1551

    accuracy                           0.67      2400
   macro avg       0.83      0.53      0.46      2400
weighted avg       0.78      0.67      0.56      2400

TFIDF Confusion Matrix:
 [[  59  790]
 [   0 1551]]


## 9. Word2Vec Feature Extraction
Train Word2Vec model and use average word vectors for each review.

In [71]:
# Tokenize reviews for Word2Vec
from gensim.utils import simple_preprocess
reviews_tokenized = [simple_preprocess(text) for text in df['reviewText']]

In [72]:
# Train Word2Vec model
from gensim.models import Word2Vec
w2v_model = Word2Vec(sentences=reviews_tokenized, vector_size=100, window=5, min_count=2, workers=4)

In [73]:
# Function to get average word2vec for a review
import numpy as np
def avg_word2vec(tokens, model):
    vectors = [model.wv[word] for word in tokens if word in model.wv.index_to_key]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

In [74]:
# Compute average word2vec for all reviews
X_word2vec = np.vstack([avg_word2vec(tokens, w2v_model) for tokens in reviews_tokenized])

## 10. Train-Test Split for Word2Vec Features
Split the Word2Vec features and labels for classification.

In [75]:
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(X_word2vec, df['rating'], test_size=0.20)

## 11. Random Forest Classifier on Word2Vec Features
Train and evaluate RandomForest on average Word2Vec features.

In [76]:
from sklearn.ensemble import RandomForestClassifier
rf_model_w2v = RandomForestClassifier()
rf_model_w2v.fit(X_train_w2v, y_train_w2v)
y_pred_w2v = rf_model_w2v.predict(X_test_w2v)
print("Word2Vec (Avg) accuracy: ", accuracy_score(y_test_w2v, y_pred_w2v))
print(classification_report(y_test_w2v, y_pred_w2v))
print("Word2Vec Confusion Matrix:\n", confusion_matrix(y_test_w2v, y_pred_w2v))

Word2Vec (Avg) accuracy:  0.7666666666666667
              precision    recall  f1-score   support

           0       0.66      0.54      0.60       760
           1       0.80      0.87      0.84      1640

    accuracy                           0.77      2400
   macro avg       0.73      0.71      0.72      2400
weighted avg       0.76      0.77      0.76      2400

Word2Vec Confusion Matrix:
 [[ 413  347]
 [ 213 1427]]


## 12. Accuracy Matrix Comparison
Compare all methods (BOW, TFIDF, AvgWord2Vec) on accuracy and F1-score.

In [77]:
# Collect accuracy and F1-score for all methods
from sklearn.metrics import f1_score
results = {
    'BOW': {
        'accuracy': accuracy_score(y_test, y_pred_bow),
        'f1': f1_score(y_test, y_pred_bow)
    },
    'TFIDF': {
        'accuracy': accuracy_score(y_test, y_pred_tfidf),
        'f1': f1_score(y_test, y_pred_tfidf)
    },
    'AvgWord2Vec': {
        'accuracy': accuracy_score(y_test_w2v, y_pred_w2v),
        'f1': f1_score(y_test_w2v, y_pred_w2v)
    }
}
import pandas as pd
accuracy_matrix = pd.DataFrame(results).T
print("\nAccuracy Matrix (All Methods):\n", accuracy_matrix)


Accuracy Matrix (All Methods):
              accuracy        f1
BOW          0.832500  0.874688
TFIDF        0.670833  0.797020
AvgWord2Vec  0.766667  0.835970
