### Data Classes

In [88]:
import random
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

### Load Data

In [77]:
# sci-kit Learn is great for traditional, algorithmic models
# TensorFlow is for neural networks
import json

file_name = 'books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].sentiment

'POSITIVE'

### Data Prep

In [89]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

In [103]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

train_y.count(Sentiment.POSITIVE)
train_y.count(Sentiment.NEGATIVE)


436

In **Scikit Learn** you must specify the amount of data which will be used for training and the rest of the data will be used for testing. In this way, the model will have some amount of data which it hasn't seen before to check the results.

When the `Random_state` is not defined in the code for every run train data will change and accuracy might change for every run. When the `Random_state = " constant integer"` is defined then train data will be constant For every run so that it will make easy to debug.

The `random state` is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.

In [51]:
print(training[0].text)

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

In [52]:
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

#### Bag of words vectorization

In [113]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


#TfidVectorizer - term frequency inverse document frequency. if a word occur more frquently than the weight of this word is lessened
# This book is great !
# This bookwas so bad

#vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer()

train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0])

These four books were free :) and I was hoping to discover a new author.  Alas, the only reason I can give them two stars is because I did read the first one (waiting for the writing to improve) and skim through three.  The plots are very thin and silly, with many places where abrupt shifts illustrate the illogical storylines and truly dumb dialogue between the characters.  These books are really Too Stupid to Read.
  (0, 7594)	0.1547073294028731
  (0, 8079)	0.084283727513289
  (0, 6411)	0.07657430792937012
  (0, 1362)	0.0737675348506117
  (0, 874)	0.11479882721841433
  (0, 2220)	0.1457786206229173
  (0, 2494)	0.16387494328087113
  (0, 8205)	0.12781052935335702
  (0, 7536)	0.2037834454653299
  (0, 3982)	0.19211095684640156
  (0, 3985)	0.2037834454653299
  (0, 7090)	0.2037834454653299
  (0, 153)	0.17215670575417216
  (0, 8687)	0.1073467446215586
  (0, 5902)	0.16048421713524377
  (0, 4891)	0.0917461154258429
  (0, 8760)	0.05389204686933516
  (0, 7150)	0.1547073294028731
  (0, 7965)	0.160

### Classification

#### Linear SVM

In [114]:
from sklearn import svm

clf_svm = svm.SVC(kernel = 'linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]
clf_svm.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [115]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [95]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors, train_y)

clf_gnb.predict(test_x_vectors[0])

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

#### Logistic Regression

In [96]:
from sklearn.linear_model import LogisticRegression

clf_lg = LogisticRegression()
clf_lg.fit(train_x_vectors, train_y)

clf_lg.predict(test_x_vectors[0])


array(['POSITIVE'], dtype='<U8')

## Evaluation

In [116]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_lg.score(test_x_vectors, test_y))

0.8076923076923077
0.6346153846153846
0.8100961538461539


In [117]:
# F1 Scores
from sklearn.metrics import f1_score

f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
f1_score(test_y, clf_lg.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])

  average, "true nor predicted", 'F-score is', len(true_sum)


array([0.80963855, 0.        , 0.81055156])

In [102]:
test_y.count(Sentiment.POSITIVE)

2767

In [101]:
train_y.count(Sentiment.NEUTRAL)

0

In [100]:
test_y.count(Sentiment.NEGATIVE)

208

In [111]:
test_set = ['this is total bulshit', 'well i thought that nothing can be more gloriously stupid that his previous book. but this one just amazed me of how increadibly boring this book can be.', 'wow that was great, instant read']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['NEGATIVE', 'NEGATIVE', 'POSITIVE'], dtype='<U8')

### Tuning our model (with Grid Search)

In [119]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)



GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [120]:
print(clf.score(test_x_vectors, test_y))

0.8197115384615384


In [126]:
# you can also try to get rid of punctuation to improve it and use other algorithms instead of bag of words

## Saving Model

In [122]:
import pickle

with open('sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

#### Load Model

In [125]:
# you can just load the model from the pickle and use the already trained model
with open('sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [124]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

This is a no brainer for anyone who loves paranormal romance.  An excellent assortment of authors; I would have made the purchase just for Ava Catori or Selena Kit but I enjoyed every book in the assortment.I am not sure how these box sets are put together other than by genre, but I really liked that there was a consistent infusion of humor amongst these stories.  I have already looked up several of the authors that were new to me because I liked their stories so well.  That is saying a lot from a reader who has a TBR mountain so high that there are clouds circling the top.Without giving a summary of each story I recommend readers begin with Blinded By Magic by Ava Catori and make your way from page one right on through to Vampire Lords of Blacknail: Trinity by Shirl Anders.  There are lots and lots of hours of enjoyable reading here.I was given an ARC of this book for an honest review.


array(['POSITIVE'], dtype='<U8')

## Confusion Matrix

In [127]:
from sklearn.metrics import confusion_matrix
import seaborn as sn
import pandas as pd
import matplotlib as plt
y_pred = clf.predict(test_x)

labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]

cm = confusion_matrix(test_y, y_pred, labels=labels)
df_cm = pd.DataFrame(cm, index=reverse(labels), columns=labels)

sn.heatmap(df_cm, annot=True, fmt='d')

ValueError: could not convert string to float: 'This is a no brainer for anyone who loves paranormal romance.  An excellent assortment of authors; I would have made the purchase just for Ava Catori or Selena Kit but I enjoyed every book in the assortment.I am not sure how these box sets are put together other than by genre, but I really liked that there was a consistent infusion of humor amongst these stories.  I have already looked up several of the authors that were new to me because I liked their stories so well.  That is saying a lot from a reader who has a TBR mountain so high that there are clouds circling the top.Without giving a summary of each story I recommend readers begin with Blinded By Magic by Ava Catori and make your way from page one right on through to Vampire Lords of Blacknail: Trinity by Shirl Anders.  There are lots and lots of hours of enjoyable reading here.I was given an ARC of this book for an honest review.'