# Kindle Review Sentiment Analysis 

## Kindle Review Dataset — Description for Sentiment Analysis

### Dataset overview
- Collection of customer reviews for Kindle products (review text, title, star rating) and optional metadata (reviewer ID, product ID, review time, verified purchase, helpfulness votes).
- Typical fields: `reviewText`, `summary` (title), `overall` (rating 1–5), `reviewerID`, `asin` (product), `unixReviewTime`, `helpful`.

### Typical preprocessing & label setup
- Clean text: remove HTML, URLs, extra whitespace, non‑informative punctuation; normalize case; optionally remove emojis or map them to tokens.
- Tokenization, stopword removal, and lemmatization/stemming as needed.
- Label mapping examples:
    - Binary: ratings 4–5 → positive, 1–2 → negative (drop or map 3 to neutral or exclude).
    - Ternary: 1–2 negative, 3 neutral, 4–5 positive.
    - Regression: predict numeric rating (1–5).
- Handle class imbalance with resampling, class weights, or focal loss.

### Features to use
- Textual: `reviewText`, `summary`.
- Metadata (useful signals): `overall` (target), `helpful` counts, verified purchase flag, product/category, time features.
- Derived features: review length, punctuation counts, sentiment lexicon scores, TF-IDF / embeddings.

### Tasks & evaluation
- Tasks: binary/ternary classification, rating regression, aspect-based sentiment, sentiment trend analysis over time.
- Metrics:
    - Classification: accuracy, precision, recall, F1 (macro for imbalanced classes), confusion matrix.
    - Regression: MAE, RMSE.
    - Use stratified train/validation/test splits or cross‑validation.

### Practical considerations
- Short reviews and sarcasm make labels noisy; rating ≠ pure sentiment (e.g., 3-star ambiguous).
- Reviewers and products introduce dependence — consider splitting by product/reviewer to test generalization.
- Remove personally identifiable info and respect dataset licensing.
- Baselines: logistic regression / SVM on TF‑IDF, then fine-tune transformer models (BERT) for stronger performance.

### Best practices
1. Preprocessing and cleaning
2. Train test splits
3. BOW, TfIDF, Word2vec
4. Train ML algorithms

In [1]:
import pandas as pd
data = pd.read_csv('all_kindle_review.csv')

In [2]:
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [3]:
data = data[['reviewText', 'rating']]
data.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [4]:
## missing values

data.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [5]:
## unique rating

data['rating'].unique()

array([3, 5, 4, 2, 1])

In [6]:
data['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

#### Processing and cleaning

In [7]:
## +ve rating = 1 and -ve rating = 0
data['rating'] = data['rating'].apply(lambda x : 1 if x >=3 else 0)

In [8]:
data.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",1
1,Great short read. I didn't want to put it dow...,1
2,I'll start by saying this is the first of four...,1
3,Aggie is Angela Lansbury who carries pocketboo...,1
4,I did not expect this type of book to be in li...,1


In [9]:
data['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [10]:
## lower all values
data['reviewText']=data['reviewText'].str.lower()

In [11]:
data.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [12]:

import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\chandan
[nltk_data]     kumar/nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [13]:
!pip install bs4



In [14]:
!pip install lxml



In [15]:
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [16]:
### removing special characters
data['reviewText'] = data['reviewText'].apply(lambda x : re.sub('[^a-z A-Z 0-9]+', '', x))

In [17]:
## removing the stopwords
data['reviewText'] = data['reviewText'].apply(lambda x :" ".join([y for y in x.split() if y not in stopwords.words('english')]))

In [18]:
## remove url
data['reviewText']= data['reviewText'].apply(lambda x : re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', str(x)))

In [19]:
## remove html tags
data['reviewText']= data['reviewText'].apply(lambda x : BeautifulSoup(x, 'lxml').get_text())

In [20]:
## remove any additional spaces
data['reviewText']=data['reviewText'].apply(lambda x : ' '.join(x.split()))

In [21]:
data.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [22]:
from nltk.stem import WordNetLemmatizer

lemitizer = WordNetLemmatizer()

In [23]:
def lemitizer_words(text):
    return " ".join([lemitizer.lemmatize(word) for word in text.split()])

In [24]:
data['reviewText'] = data['reviewText'].apply(lambda x : lemitizer_words(x))

In [25]:
data.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


### Train Test Splits

In [26]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data['reviewText'], data['rating'], test_size=0.20)

In [27]:
x_train.shape

(9600,)

In [29]:
x_train

6422      look book author hope find book author read book
10217    review novel main theme presented detailed pro...
7566     real fan scifi frank herbert book dune one gre...
2361     welome negative review book everyone else love...
7848     like cover wish reading light version like sma...
                               ...                        
1417     great book would never believed would ended wa...
5492     would several year road run someone meant worl...
9776     bad read one id recommend either humorous mome...
2170     normally like author book wasnt good female we...
9223     nothing new nothing particularly interesting a...
Name: reviewText, Length: 9600, dtype: object

In [None]:
## BOW
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
x_train_bow = bow.fit_transform(x_train).toarray()
x_test_bow = bow.transform(x_test).toarray()

In [None]:
## TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x_train_tfidf = tfidf.fit_transform(x_train).toarray()
x_test_tfidf = tfidf.transform(x_test).toarray()

In [None]:
## word2vec
from gensim.models import Word2Vec

sentences = [doc.split() for doc in x_train]  # tokenization

model = Word2Vec(
    sentences,
    vector_size=100,  # output vector size
    window=5,
    min_count=1,
    workers=4
)
sentences

[['look', 'book', 'author', 'hope', 'find', 'book', 'author', 'read', 'book'],
 ['review',
  'novel',
  'main',
  'theme',
  'presented',
  'detailed',
  'product',
  'description',
  'coombes',
  'wood',
  'hold',
  'darkness',
  'reason',
  'izzys',
  'unique',
  'approach',
  'getting',
  'rid',
  'george',
  'leaf',
  'smile',
  'face',
  'novel',
  'fun',
  'read',
  'action',
  'danger',
  'course',
  'romance',
  'romance',
  'neighbor',
  'give',
  'book',
  'little',
  'spice',
  'get',
  'comfortable',
  'read',
  'one',
  'better',
  'romantic',
  'suspense',
  'novel',
  'read',
  'lately',
  'especially',
  'paranormal',
  'twist',
  'highly',
  'recommended'],
 ['real',
  'fan',
  'scifi',
  'frank',
  'herbert',
  'book',
  'dune',
  'one',
  'great',
  'classic',
  'scifi',
  'story',
  'standard',
  'frankly',
  'like',
  'paid',
  'write',
  'story',
  'many',
  'word',
  'magazine',
  'suddenly',
  'realized',
  'needed',
  'finish',
  'story',
  'quickly',
  'abrupt

In [42]:
import numpy as np

def avg_word2vec(doc):
    words = doc.split()
    vectors = [model.wv[word] for word in words if word in model.wv.index_to_key]
    
    if len(vectors) == 0:
        return np.zeros(model.vector_size)

    return np.mean(vectors, axis=0)

x_train_w2v = np.array([avg_word2vec(doc) for doc in x_train])
x_test_w2v = np.array([avg_word2vec(doc) for doc in x_test])

In [34]:
x_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(9600, 37661))

In [43]:
x_train_w2v

array([[-0.57415354,  0.9301917 ,  0.5852097 , ..., -0.76463246,
         0.7280891 ,  0.3502022 ],
       [-0.24445315,  0.64238113,  0.13964525, ..., -0.39817154,
         0.08862659, -0.05314998],
       [-0.34773916,  0.56000435,  0.20745116, ..., -0.53800964,
         0.25102553, -0.04247489],
       ...,
       [-0.4961017 ,  0.6145534 ,  0.1079912 , ..., -0.49376476,
         0.23932122, -0.12244054],
       [-0.41073385,  0.3999131 ,  0.09031245, ..., -0.66799676,
         0.36819297, -0.12371196],
       [-0.24319752,  0.5262648 ,  0.16961849, ..., -0.44157913,
         0.20396978, -0.09141967]], shape=(9600, 100), dtype=float32)

In [44]:
x_train_w2v[0].shape

(100,)

### Use Gaussian Naive Bayes

In [46]:
from sklearn.naive_bayes import GaussianNB

nb_model_bow = GaussianNB().fit(x_train_bow, y_train)
nb_model_tfidf = GaussianNB().fit(x_train_tfidf, y_train)
nb_model_w2v = GaussianNB().fit(x_train_w2v, y_train)

In [47]:
y_pred_bow = nb_model_bow.predict(x_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(x_test_tfidf)
y_pred_w2v = nb_model_w2v.predict(x_test_w2v)

In [48]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print('='*20,'BOW', "="*20)
print("Accuracy : ",accuracy_score(y_test, y_pred_bow))
print("Confusion matrix : \n",confusion_matrix(y_test, y_pred_bow))
print("Classification report : \n",classification_report(y_test, y_pred_bow))

print('='*20,'TF-IDF', "="*20)
print("Accuracy : ",accuracy_score(y_test, y_pred_tfidf))
print("Confusion matrix : \n",confusion_matrix(y_test, y_pred_tfidf))
print("Classification report : \n",classification_report(y_test, y_pred_tfidf))

print('='*20,'word2vec', "="*20)
print("Accuracy : ",accuracy_score(y_test, y_pred_w2v))
print("Confusion matrix : \n",confusion_matrix(y_test, y_pred_w2v))
print("Classification report : \n",classification_report(y_test, y_pred_w2v))

Accuracy :  0.58375
Confusion matrix : 
 [[559 275]
 [724 842]]
Classification report : 
               precision    recall  f1-score   support

           0       0.44      0.67      0.53       834
           1       0.75      0.54      0.63      1566

    accuracy                           0.58      2400
   macro avg       0.59      0.60      0.58      2400
weighted avg       0.64      0.58      0.59      2400

Accuracy :  0.5858333333333333
Confusion matrix : 
 [[549 285]
 [709 857]]
Classification report : 
               precision    recall  f1-score   support

           0       0.44      0.66      0.52       834
           1       0.75      0.55      0.63      1566

    accuracy                           0.59      2400
   macro avg       0.59      0.60      0.58      2400
weighted avg       0.64      0.59      0.60      2400

Accuracy :  0.6858333333333333
Confusion matrix : 
 [[ 604  230]
 [ 524 1042]]
Classification report : 
               precision    recall  f1-score   supp