Data Class

In [32]:
# Install scikit-learn
!pip install scikit-learn





In [158]:
# Import Libraries
import random
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


#Create classes to be used
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

Load Data

In [134]:
# Create variable for the data to load
file_name = './data/sentiment/books_small_10000.json'

# Create empty list and append the data from the file
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].text

'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

Prep Data

In [135]:
# How many rows
len(reviews)

10000

In [136]:
# Split Train-Test Data
# Training is the X and test is the y variables
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [137]:
# How many rows after split
len(training)
#len(test)

6700

In [138]:
# display first text
print(training[0].text)
# display first sentiment
print(training[0].sentiment)

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

In [139]:
# Training variables
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()
# Display results
print(train_x[0])
print(train_y[0])

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

More and more Wolfe seems derivative, or perhaps he was all along.  "I Am Charlotte Simmons" reads like he thought he could out-JC-Oates JC Oates.  Yet it lacks the soul of the female lead that JCO renders in such vivid hues (for example, Anellia in "I'll Take You There").Now "Back to Blood" would seem to be Wolfe's riff on Carl Hiaasen.  He must have thought he could do it better.  But Wolfe does not come close to the Miami Herald reporter's assiduously plotted romps through the land- and cityscape of rogues, crooked politicos and good (if unbalanced) guys who hunger for justice.Where Hiaasen's work rarely contains a wasted word, let alone plot twist, Wolfe delivers Miami's elite and vile mixing it up in a rambling foray some 700 pages long (I hope Little, Brown recouped their $10,000 per page investment).  The story is told primarily through the eyes of two Cubanos desperate to escape their home town Haileah, though for different reasons. It is punctuated with many lurid sex scenes i

In [140]:
# Test variables
test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()
# Display results
print(test_x[0])
print(test_y[0])

Hard to put this book down. I found I really cared about the characters and was drawn into the story.
POSITIVE


Bags of words vectorization

In [159]:
# Create Count Vector for train data

# This book is great !
# This book was so bad

#vectorizer = CountVectorizer() # Not optimized
vectorizer = TfidfVectorizer() # Optimized by word frequency
train_x_vectors = vectorizer.fit_transform(train_x) # vectorizes the taining data
# The above line does the fit and transform in one line
# above line could also be split out by doing the following two lines
# vectorizer.fit(train_x)
# train_x_vectors = vectorizer.transform(train_x)

test_x_vectors = vectorizer.transform(test_x)
print(train_x[0])
print(train_x_vectors[0])

More and more Wolfe seems derivative, or perhaps he was all along.  "I Am Charlotte Simmons" reads like he thought he could out-JC-Oates JC Oates.  Yet it lacks the soul of the female lead that JCO renders in such vivid hues (for example, Anellia in "I'll Take You There").Now "Back to Blood" would seem to be Wolfe's riff on Carl Hiaasen.  He must have thought he could do it better.  But Wolfe does not come close to the Miami Herald reporter's assiduously plotted romps through the land- and cityscape of rogues, crooked politicos and good (if unbalanced) guys who hunger for justice.Where Hiaasen's work rarely contains a wasted word, let alone plot twist, Wolfe delivers Miami's elite and vile mixing it up in a rambling foray some 700 pages long (I hope Little, Brown recouped their $10,000 per page investment).  The story is told primarily through the eyes of two Cubanos desperate to escape their home town Haileah, though for different reasons. It is punctuated with many lurid sex scenes i

## Classification

## Linear SVM

In [160]:
# Import svm library
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Decision Tree

In [161]:
# Import Library
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [144]:
# Import Library
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors, train_y)

clf_gnb.predict(test_x_vectors[0])

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

#### Linear Regression

In [162]:
# Import Library
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Evaluation

In [163]:
# Compare results to confirm accuracy
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y)) # Linear SVM
print(clf_dec.score(test_x_vectors, test_y)) # Decision Tree
print(clf_log.score(test_x_vectors, test_y)) # Logistic Regression

0.8076923076923077
0.6947115384615384
0.8052884615384616


In [164]:
# F1 Score
# Import Library
from sklearn.metrics import f1_score
# Linear SVM, Decision Tree, Logistic Regression
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.80582524 0.80952381]
[0.69099757 0.69833729]
[0.80291971 0.80760095]


In [149]:
# Look at train data as the values for Neutral and Negative are way off
train_y.count(Sentiment.POSITIVE)

436

In [165]:
# Test the quality using the Linear SVM model
test_set = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

In [166]:
# Test the quality using the Decision Tree model
test_set = ['not great', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_dec.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'POSITIVE'], dtype='<U8')

In [167]:
# Test the quality using the Logistic Regression model
test_set = ['very fun', "great book to buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_log.predict(new_test)

array(['POSITIVE', 'POSITIVE', 'NEGATIVE'], dtype='<U8')

#### Tuning our model (with Grid Search)

In [169]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [171]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y)) # Linear SVM
print(clf_dec.score(test_x_vectors, test_y)) # Decision Tree
print(clf_log.score(test_x_vectors, test_y)) # Logistic Regression
print(clf.score(test_x_vectors, test_y)) # Tuned model w/ GridSearch

0.8076923076923077
0.6947115384615384
0.8052884615384616
0.8197115384615384


## Saving Model

In [180]:
# Import Library
import pickle

with open('./models/sentiment_classifier_EP.pkl', 'wb') as f:
    pickle.dump(clf, f)
    
# Save Vectorizer
with open('./models/sentiment_vectorizer_EP.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

## Load Model

In [181]:
with open('./models/sentiment_classifier_EP.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

# Load Vectorizer
with open('./models/sentiment_vectorizer_EP.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

In [179]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

Hard to put this book down. I found I really cared about the characters and was drawn into the story.


array(['POSITIVE'], dtype='<U8')