# Amazon Books Sentiment Analysis

## Sentiment and Review Class

In [1]:
class Sentiment:
    #Assigning sentiments to variable
    POSITIVE = 'POSITIVE'
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()

    def get_sentiment(self):
        #Conditions for assigning sentiments
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: 
            #Score of 4 or 5
            return Sentiment.POSITIVE


## Load Data From Json File

In [2]:
import json

file_name = 'Books_10000.json'

#Load file items into review list
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))

reviews[30].text

"My Reflections:The well developed characters, the slow steady build all work together to deliver a tidy little package where mystery and history entwine.I loved the idea of a story centering on the decision, Franklyn Roosevelt devised to help people destroyed by the depression of 1935. His idea was to send families to a remote area of Alaska to colonies and grow the Matanuska Valley. The really superb thing about this book is these two authors use real and fictional characters to develop their narrative.Dr Jeremiah Vaughan's life is destroyed by allegations of abuse. When he uses a ground-breaking IV sedation technique with an influential patient, and the patient dies, the authorities are out for blood. This causes his license to be stripped away. Because of this, his intended and her mother want nothing to do with the shame. A has-been doctor is not what a high society woman wants on her arm. Fleeing from the hurt and rejection, from not only his fiance but also his own parents Jerem

In [3]:
reviews[30].score

3.0

In [4]:
reviews[30].sentiment

'NEUTRAL'

## Review Container Class

In [5]:
import random

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews

    def get_text(self):
        return [x.text for x in self.reviews]

    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]

    def distribute(self):
        #Evenly distributed positive and negative sentiments so that one sentiment is not more favoured than the other
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        shrunk_positive = positive[:len(negative)]
        self.reviews = shrunk_positive + negative
        random.shuffle(self.reviews)
        

## Data Preparation

In [6]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

#Assinging training and test data as a ReviewContainer class
train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [7]:
train_container.distribute()

#Splits training data into x and y variables(features and targets)
train_x = train_container.get_text()
train_y = train_container.get_sentiment()


test_container.distribute()

#Splits test data into x and y variables(features and targets)
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


## Bag of words Vectorization

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

What kind of a society have we become that casually using the F-word in our stories is now mainstream? I hate it! It is a vulgar, nasty word. This is the 4th book this week I have started to read that casually uses the word. Why? For dramatic flair? To show your readers your unlimited vocabulary? To me, it just proves that the author wasn't able to come up with something more intelligent. I now hear foul language in the grocery store, in restaurants, walking down the streets, in our schools, it's everywhere! Why can't I sit down and enjoy a good story without it? We are bombarded with it everyday. Please keep it out of books! Please.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

In [9]:
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [10]:
#Sklearn ML pipeline
pipelines = {
    'rf':make_pipeline(RandomForestClassifier(random_state=42)),
    'svm':make_pipeline(svm.SVC(random_state=42)),
    'lr':make_pipeline(LogisticRegression(random_state=42)),
}

In [11]:
RandomForestClassifier().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [12]:
svm.SVC().get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [13]:
LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [14]:
#Dictionary to store hyperparameters
hyparamgrid = {
    'rf': {
        'randomforestclassifier__min_samples_split':[2,4,6],
        'randomforestclassifier__min_samples_leaf':[1,2,3]
    },
    'svm':{
        'svc__kernel':['linear', 'rbf'],
        'svc__C':[1,4,8,16,32]
    },
    'lr':{
        'logisticregression__solver':['liblinear','lbfgs'],
        'logisticregression__C':[1,4,8,16,32]
    }
}

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.exceptions import NotFittedError

In [16]:
#Dictionary to store fitted models 
fit_models = {}

#For loops through piplines and uses GridSearchCV for hyperparameter tunning
for algo, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyparamgrid[algo], cv=10, n_jobs=-1)
    #Logging results while training models
    try:
        print('Starting training for {}.'.format(algo))
        model.fit(train_x_vectors, train_y)
        fit_models[algo] = model
        print('{} has been successfully fit.'.format(algo))
    except NotFittedError as e:
        print(repr(e))

Starting training for rf.
rf has been successfully fit.
Starting training for svm.
svm has been successfully fit.
Starting training for lr.
lr has been successfully fit.


### Making Some Predictions

In [17]:
#Creating a new set to test
testing_set = ['very fun', 'bad book do not buy', 'horrible waste of time']
new_test = vectorizer.transform(testing_set)

In [18]:
fit_models['svm'].predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

In [19]:
fit_models['rf'].predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

In [20]:
fit_models['lr'].predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

## Evaluation

In [26]:
from sklearn import metrics

#Loop through and evaluate models performance
for algo,model in fit_models.items():
    yhat = model.predict(test_x_vectors)
    print('{} scores - Accuracy: {}'.format(algo, metrics.accuracy_score(test_y, yhat)))

rf scores - Accuracy: 0.7764423076923077
svm scores - Accuracy: 0.8197115384615384
lr scores - Accuracy: 0.8197115384615384


### Random Forest Classification Report Confusion Matrix

In [28]:
rf_predict_test = fit_models['rf'].predict(test_x_vectors)

print('Confusion Matrix')
print(metrics.confusion_matrix(test_y, rf_predict_test))
print('')

print('Classification Report')
print(metrics.classification_report(test_y, rf_predict_test))

Confusion Matrix
[[177  31]
 [ 62 146]]

Classification Report
              precision    recall  f1-score   support

    NEGATIVE       0.74      0.85      0.79       208
    POSITIVE       0.82      0.70      0.76       208

    accuracy                           0.78       416
   macro avg       0.78      0.78      0.78       416
weighted avg       0.78      0.78      0.78       416



### Support Vector Machine Classification Report Confusion Matrix

In [30]:
svm_predict_test = fit_models['svm'].predict(test_x_vectors)

print('Confusion Matrix')
print(metrics.confusion_matrix(test_y, svm_predict_test))
print('')

print('Classification Report')
print(metrics.classification_report(test_y, svm_predict_test))

Confusion Matrix
[[167  41]
 [ 34 174]]

Classification Report
              precision    recall  f1-score   support

    NEGATIVE       0.83      0.80      0.82       208
    POSITIVE       0.81      0.84      0.82       208

    accuracy                           0.82       416
   macro avg       0.82      0.82      0.82       416
weighted avg       0.82      0.82      0.82       416



### Logistic Regression Classification Report Confusion Matrix

In [31]:
lr_predict_test = fit_models['lr'].predict(test_x_vectors)

print('Confusion Matrix')
print(metrics.confusion_matrix(test_y, lr_predict_test))
print('')

print('Classification Report')
print(metrics.classification_report(test_y, lr_predict_test))

Confusion Matrix
[[166  42]
 [ 33 175]]

Classification Report
              precision    recall  f1-score   support

    NEGATIVE       0.83      0.80      0.82       208
    POSITIVE       0.81      0.84      0.82       208

    accuracy                           0.82       416
   macro avg       0.82      0.82      0.82       416
weighted avg       0.82      0.82      0.82       416



#### Support Vector Machine and Logistic Regression Models performed extremely similar, while Random Forest performed a bit worse.