## About Dataset
Context
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

- asin - ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

Acknowledgements
This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration
- Sentiment analysis on reviews.
- Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
- Any other interesting analysis

#### Best Practises
1. Preprocessing And Cleaning
2. Train Test Split
3. BOW,TFIDF,Word2vec
4. Train ML algorithms

In [10]:
### Read the data
import pandas as pd

df = pd.read_csv("Kindle Reviews/all_kindle_review.csv")

In [11]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [14]:
df = df[['reviewText', 'rating']]

In [15]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [16]:
df.shape

(12000, 2)

In [17]:
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [19]:
df.duplicated().sum()

0

In [20]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [21]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [22]:
### Prerprocessing and cleaning

In [24]:
### Positive review is 1 and negative review is 0
df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)


In [25]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",1
1,Great short read. I didn't want to put it dow...,1
2,I'll start by saying this is the first of four...,1
3,Aggie is Angela Lansbury who carries pocketboo...,1
4,I did not expect this type of book to be in li...,1


In [26]:
df['rating'].unique()

array([1, 0], dtype=int64)

In [27]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [30]:
## 1. Lower all the cases
df['reviewText'] = df['reviewText'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].str.lower()


In [31]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [35]:
## 2. Clean the data 
import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [36]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))


In [37]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [38]:
### Lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [39]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])


In [40]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

In [41]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [42]:
## Train Test Split
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20)

In [43]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((9600,), (2400,), (9600,), (2400,))

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [88]:
X_train_bow.shape, X_test_bow.shape

((9600, 36186), (2400, 36186))

In [90]:
X_train_tfidf.shape, X_test_bow.shape

((9600, 36186), (2400, 36186))

In [71]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

In [68]:
X_train

10077    joedandonigijim richards sixtieth birthday loo...
5307     read book found harder harder believe heroine ...
1        great short read didnt want put read one sitti...
4388     unfortunately able connect character story alt...
6388     good story line like dragonriders theme defina...
                               ...                        
8874     read see non hollywood version story get littl...
9213     although one main character emotionally challe...
277      story short think novella well-written plot mo...
6247     simply subject wish explore spam amight want r...
2698     love romance little spice however like read mm...
Name: reviewText, Length: 9600, dtype: object

In [72]:
[word_tokenize(sentence) for sentence in X_train]

[['joedandonigijim',
  'richards',
  'sixtieth',
  'birthday',
  'loom',
  'get',
  'email',
  'woman',
  'discovers',
  'murdered',
  'late',
  'save',
  'trusty',
  'biker',
  'sidekick',
  'try',
  'stop',
  'rapidly',
  'escalating',
  'murder',
  'get',
  'woman',
  'admits',
  'loved',
  'long',
  'ago',
  '40',
  'year',
  'never',
  'knew',
  'figure',
  'mystery',
  'time',
  'save',
  'herif',
  'like',
  'good',
  'thrillermystery',
  'youll',
  'enjoy',
  'onephotoman35mm'],
 ['read',
  'book',
  'found',
  'harder',
  'harder',
  'believe',
  'heroine',
  'chosen',
  'queen',
  'nayla',
  'heroine',
  'came',
  'spineless',
  'childish',
  'like',
  'kid',
  'playing',
  'dress',
  'court',
  'soon',
  'see',
  'potential',
  'lover',
  'one',
  'night',
  'sex',
  'shes',
  'already',
  'breaking',
  'rule',
  'capitulating',
  'demandsbut',
  'isnt',
  'suspension',
  'belief',
  'story',
  'would',
  'woman',
  'who',
  'family',
  'killed',
  'tortured',
  'werewolf',


In [73]:
# Tokenize the sentences
X_train_tokens = [word_tokenize(sentence) for sentence in X_train]
X_test_tokens = [word_tokenize(sentence) for sentence in X_test]

# Train Word2Vec model
w2v = Word2Vec(X_train_tokens)

In [85]:
### Related to Word2Vec

def word2Vec():
    ### Get all vocabulary
    w2v.wv.index_to_key

    w2v.corpus_count    ### Same as X_train shape

    w2v.epochs

    w2v.wv.similar_by_word("book")

    w2v.wv['book']

    w2v.wv['book'].shape

In [96]:
def avg_word2vec(doc):
    # remove out-of-vocabulary words
    #sent = [word for word in doc if word in model.wv.index_to_key]
    
    return np.mean([w2v.wv[word] for word in doc if word in w2v.wv.index_to_key],axis=0)

In [102]:
from tqdm import tqdm
### Apply Word2Vec on every sentences
import numpy as np
X_train_w2v = []
X_test_w2v = []
for i in tqdm(range(len(X_train))):
    X_train_w2v.append(avg_word2vec(X_train_tokens[i]))
for i in tqdm(range(len(X_test))):
    X_test_w2v.append(avg_word2vec(X_test_tokens[i]))

100%|██████████| 9600/9600 [00:11<00:00, 821.39it/s]
100%|██████████| 2400/2400 [00:03<00:00, 790.29it/s]


In [103]:
len(X_train_w2v), len(X_test_w2v)

(9600, 2400)

In [106]:
X_train_w2v = np.array(X_train_w2v)
X_test_w2v = np.array(X_test_w2v)

In [107]:
X_train_w2v.shape, X_test_w2v.shape

((9600, 100), (2400, 100))

In [51]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [52]:
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [108]:
X_train_w2v

array([[-2.0983480e-01,  3.0762085e-01,  8.4424518e-02, ...,
        -3.3683756e-01,  3.9656937e-02, -3.9684776e-02],
       [-2.2158721e-01,  2.9790702e-01,  3.0191787e-02, ...,
        -2.2551228e-01,  1.7536542e-01, -8.2067341e-02],
       [-2.4911409e-04,  2.2587113e-01, -1.7674379e-01, ...,
        -2.3603600e-01,  1.4806579e-01,  8.7271869e-02],
       ...,
       [-7.5369659e-03,  1.5804595e-01, -1.1266444e-01, ...,
        -2.5790337e-01,  8.0230445e-02,  1.3069437e-01],
       [-1.9126002e-01,  2.1931860e-01,  6.9443421e-03, ...,
        -2.7784047e-01,  1.3225399e-01, -3.9408572e-02],
       [ 3.4967999e-03,  2.3180741e-01, -1.7278820e-01, ...,
        -2.3632197e-01,  1.3344739e-01,  4.2125251e-02]], dtype=float32)

In [109]:
from sklearn.naive_bayes import GaussianNB
nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)
nb_model_w2v=GaussianNB().fit(X_train_w2v, y_train)

In [54]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [55]:
y_pred_bow=nb_model_bow.predict(X_test_bow)

In [56]:
y_pred_tfidf=nb_model_bow.predict(X_test_tfidf)

In [110]:
y_pred_w2v=nb_model_w2v.predict(X_test_w2v)

In [58]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))
print("BOW Confusion Matrix: \n",confusion_matrix(y_test,y_pred_bow))

BOW accuracy:  0.5933333333333334
BOW Confusion Matrix: 
 [[508 274]
 [702 916]]


In [59]:
print("TF-IDF accuracy: ",accuracy_score(y_test, y_pred_tfidf))
print("TF-IDF Confusion Matrix: \n",confusion_matrix(y_test, y_pred_tfidf))

TF-IDF accuracy:  0.59375
TF-IDF Confusion Matrix: 
 [[503 279]
 [696 922]]


In [111]:
print("Word2Vec accuracy: ",accuracy_score(y_test, y_pred_w2v))
print("Word2Vec Confusion Matrix: \n",confusion_matrix(y_test, y_pred_w2v))

Word2Vec accuracy:  0.6983333333333334
Word2Vec Confusion Matrix: 
 [[ 562  220]
 [ 504 1114]]
