### Context This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns

* asin - ID of the product, like B000FA64PK
* helpful - helpfulness rating of the review - example: 2/3.
* overall - rating of the product.
* reviewText - text of the review (heading).
* reviewTime - time of the review (raw).
* reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
* reviewerName - name of the reviewer.
* summary - summary of the review (description).
* unixReviewTime - unix timestamp.
Acknowledgements This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

### Inspiration

Sentiment analysis on reviews.
Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
Fake reviews/ outliers.
Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
Any other interesting analysis


### Best Practises
1. Preprocessing And Cleaning
2. Train Test Split
3. BOW,TFIDF,Word2vec
4. Train ML algorithms

In [47]:
import numpy as np
import pandas as pd 

In [48]:
df = pd.read_csv('all_kindle_review.csv')
df.head(10)  # Display the first few rows of the DataFrame

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000
5,5,3744,B0021L9YDK,"[6, 6]",5,Aislinn is a little girl with big dreams. Afte...,"12 7, 2009",A3J5NN6MJK4M4A,"Aubrie A. Dionne ""Fantasy, Sci Fi Author""",A story of a little girl with big dreams.,1260144000
6,6,13641,B0038NN38W,"[1, 1]",2,This has the makings of a good story... unfort...,"08 18, 2011",A531QY5K7JVXI,Chicano,This story has potential but ultimately disapp...,1313625600
7,7,4448,B002AJ7X2C,"[1, 1]",4,I got this because I like collaborated short s...,"03 8, 2010",AN8ELR6AHMMQ,"Jessss ""I read to find stories that inspire m...",Good thriller,1268006400
8,8,2797,B001L5T22U,"[0, 0]",5,"Loved this book, I am hooked on this series an...","09 30, 2013",AMSWCFSQ8SLK9,Amazon Customer,Loved it!,1380499200
9,9,5294,B002F3PPVE,"[0, 1]",4,"And that's a good thing. Short, sweet tease th...","07 29, 2009",AB53C7GYZHYIE,"A. Williams ""blkkat""",I was scared...,1248825600


In [49]:
df = df[['reviewText', 'rating']]
df.head(1)  # Display the first few rows of the DataFrame after selecting specific columns

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3


In [50]:
df.shape

(12000, 2)

In [51]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [52]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [53]:
df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)

In [54]:
df['rating'].head(10)

0    1
1    1
2    1
3    1
4    1
5    1
6    0
7    1
8    1
9    1
Name: rating, dtype: int64

In [55]:
df['rating'].value_counts() 

rating
1    8000
0    4000
Name: count, dtype: int64

In [56]:
df['reviewText'] = df['reviewText'].str.lower()

In [57]:
df['reviewText'].head(5)

0    jace rankin may be short, but he's nothing to ...
1    great short read.  i didn't want to put it dow...
2    i'll start by saying this is the first of four...
3    aggie is angela lansbury who carries pocketboo...
4    i did not expect this type of book to be in li...
Name: reviewText, dtype: object

In [58]:
import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [59]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

  df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())


In [60]:
df['reviewText'].head(5)

0    jace rankin may short hes nothing mess man hau...
1    great short read didnt want put read one sitti...
2    ill start saying first four books wasnt expect...
3    aggie angela lansbury carries pocketbooks inst...
4    expect type book library pleased find price right
Name: reviewText, dtype: object

In [61]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [62]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])  

In [63]:
df['reviewText'] = df['reviewText'].apply(lambda x :lemmatize_words(x) )

In [64]:
df.head(5)

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [65]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df['reviewText'], df['rating'], test_size=0.3, random_state=42)

In [80]:
from sklearn.naive_bayes import GaussianNB

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()  

In [72]:
x_train_bow = vectorizer.fit_transform(x_train).toarray()
x_test_bow = vectorizer.transform(x_test).toarray()

In [70]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

In [73]:
x_train_tfidf = tfidf_transformer.fit_transform(x_train_bow).toarray()
x_test_tfidf = tfidf_transformer.transform(x_test_bow).toarray()

In [81]:
nb_model_bow = GaussianNB().fit(x_train_bow, y_train)
nb_model_tfidf = GaussianNB().fit(x_train_tfidf, y_train)

In [82]:
nb_model_bow

In [83]:
nb_model_tfidf

In [84]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [85]:
y_pred_bow = nb_model_bow.predict(x_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(x_test_tfidf)

In [86]:
accuracy_score_bow = accuracy_score(y_test, y_pred_bow)
accuracy_score_tfidf = accuracy_score(y_test, y_pred_tfidf)

In [87]:
print(accuracy_score_bow)
print(accuracy_score_tfidf)

0.5927777777777777
0.5938888888888889


In [88]:
classification_report_bow = classification_report(y_test, y_pred_bow)
classification_report_tfidf = classification_report(y_test, y_pred_tfidf)

In [89]:
print(f"Accuracy for BOW model: {accuracy_score_bow}")
print(f"Accuracy for TF-IDF model: {accuracy_score_tfidf}") 
print("Classification Report for BOW model:\n", classification_report_bow)
print("Classification Report for TF-IDF model:\n", classification_report_tfidf)

Accuracy for BOW model: 0.5927777777777777
Accuracy for TF-IDF model: 0.5938888888888889
Classification Report for BOW model:
               precision    recall  f1-score   support

           0       0.44      0.64      0.52      1244
           1       0.75      0.57      0.65      2356

    accuracy                           0.59      3600
   macro avg       0.59      0.60      0.58      3600
weighted avg       0.64      0.59      0.60      3600

Classification Report for TF-IDF model:
               precision    recall  f1-score   support

           0       0.44      0.62      0.51      1244
           1       0.74      0.58      0.65      2356

    accuracy                           0.59      3600
   macro avg       0.59      0.60      0.58      3600
weighted avg       0.64      0.59      0.60      3600



### Now training it with using word2vec and average word2vec

In [90]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

In [91]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [92]:
x_train_tokens = [text.split() for text in x_train]
x_test_tokens = [text.split() for text in x_test]

In [99]:
x_train_tokens

[['different',
  'new',
  'good',
  'cute',
  'story',
  'way',
  'wont',
  'lie',
  'little',
  'bridle',
  'part',
  'faint',
  'heart',
  'good',
  'read'],
 ['revised',
  '05-06-13',
  'initial',
  'inclination',
  'rate',
  'set',
  '4',
  '5',
  'star',
  'realized',
  'totally',
  'consistently',
  'ignored',
  'austen',
  'use',
  'italic',
  'major',
  'work',
  'austen',
  'italicized',
  'word',
  'emphasis',
  'bedford',
  'edition',
  'doesnt',
  'use',
  'cap',
  'equal',
  'italic',
  'even',
  'before-and-after',
  'underline',
  '_like',
  'this_',
  'way',
  'edition',
  'priced',
  'dollar',
  'reader',
  'might',
  'willing',
  'overlook',
  'defect',
  'even',
  'aware',
  'price',
  'doubled',
  '199',
  'price',
  '2',
  'superior',
  'edition',
  'kindle',
  'store',
  'one',
  'shouldnt',
  'consider',
  'serious',
  'enough',
  'flaw',
  'rate',
  'bedford',
  'edition',
  '2',
  'star',
  'fan',
  'jane',
  'austen',
  'sorely',
  'tempted',
  'even',
  'rate

In [118]:
w2v_model = Word2Vec(sentences=x_train_tokens, vector_size=100, window=12, min_count=5, workers=4)

In [119]:
w2v_model

<gensim.models.word2vec.Word2Vec at 0x22bcd154980>

In [120]:
def avg_word2vec(tokens, model, vector_size):
    vec = np.zeros(vector_size)
    count = 0
    for word in tokens:
        if word in model.wv:
            vec += model.wv[word]
            count += 1
    return vec / count if count > 0 else vec

In [121]:
x_train_w2v = np.array([avg_word2vec(tokens, w2v_model, 100) for tokens in x_train_tokens])
x_test_w2v = np.array([avg_word2vec(tokens, w2v_model, 100) for tokens in x_test_tokens])

In [122]:
rf_model_w2v = RandomForestClassifier(random_state=42)
rf_model_w2v.fit(x_train_w2v, y_train)

In [123]:
y_pred_w2v = rf_model_w2v.predict(x_test_w2v)
accuracy_score_w2v = accuracy_score(y_test, y_pred_w2v)
classification_report_w2v = classification_report(y_test, y_pred_w2v)

print(f"Accuracy for Average Word2Vec model: {accuracy_score_w2v}")
print("Classification Report for Average Word2Vec model:\n", classification_report_w2v)

Accuracy for Average Word2Vec model: 0.7536111111111111
Classification Report for Average Word2Vec model:
               precision    recall  f1-score   support

           0       0.67      0.56      0.61      1244
           1       0.79      0.85      0.82      2356

    accuracy                           0.75      3600
   macro avg       0.73      0.71      0.72      3600
weighted avg       0.75      0.75      0.75      3600

