In [1]:
import pandas as pd
import json

In [2]:
import random

class Sentiment :
    NEGATIVE = "NEGATIVE"
    POSITIVE = "POSITIVE"
    NEUTRAL = "NEUTRAL"

class Review : 
    def __init__(self , text , score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self,reviews):
        self.reviews = reviews
   
    def get_text(self):
        return[x.text for x in self.reviews]
    
    def get_sentiment(self):
        return[x.sentiment for x in self.reviews]

    
    def evenly_distrubute(self):
        negative = list(filter(lambda x : x.sentiment == Sentiment.NEGATIVE , self.reviews))
        positive = list(filter(lambda x : x.sentiment == Sentiment.POSITIVE , self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

In [3]:
file_name  = "C:/Users/meher/Books_small_10000.json"
reviews = []
with open(file_name) as f:
    for line in f:
        
        review = json.loads(line)
        reviews.append(Review(review['reviewText'] , review['overall']))

In [4]:
reviews[5].text


'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

In [5]:
len(reviews)

10000

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
training , test =train_test_split(reviews , test_size = 0.33 , random_state = 42)

In [8]:
train_cont = ReviewContainer(training)
test_cont = ReviewContainer(test)


In [9]:
train_cont.evenly_distrubute()
train_x = train_cont.get_text()
train_y = train_cont.get_sentiment()
test_cont.evenly_distrubute()
test_x = test_cont.get_text()
test_y = test_cont.get_sentiment()

In [10]:
train_y.count(Sentiment.POSITIVE)
train_y.count(Sentiment.NEGATIVE)


436

In [11]:
from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vec = vectorizer.fit_transform(train_x)
test_x_vec = vectorizer.transform(test_x)

In [12]:
print(train_x[0])
print(train_x_vec[0].toarray())


Long before the term behavioural finance there was someone writing about the significance of identity. Long before the witty Buffett-isms, someone wrote those same words as part of his Irregular Rules. And long before Michael Lewis carved out his own position as the Wall Street storyteller du jour, someone else did so with similar eloquent finesse. Today, nearly three months have passed since George J.W. Goodman died on January 3, 2014 - or, as the financial after world knew him by, Adam Smith. A name created for him by the publisher of New York Magazine so as to keep his weekly Wall Street columns anonymous.Rarely have the need to quote from a book been greater, and a good way to start is with Paul Samuelson's front-cover phrase "a modern classic", as it embraces the book beautifully. On one hand the book is eons ahead of its time, crafting the mindset-house that practitioners like Warren Buffett, Peter Lynch and the behavioural finance entourage would furnish. But on the other hand i

In [13]:
 from sklearn import svm

In [14]:
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vec,train_y)

In [15]:
svm_s = clf_svm.score(test_x_vec,test_y)*100

In [16]:
test_x[15]

'A chilling look at cyber romance.  Female character a bit flawed, but a winner in the end. Not something to read in the dark!'

In [17]:
clf_svm.predict(test_x_vec[15])

array(['POSITIVE'], dtype='<U8')

In [18]:
train_x_vec

<872x8906 sparse matrix of type '<class 'numpy.float64'>'
	with 53647 stored elements in Compressed Sparse Row format>

In [19]:
import numpy as np

# Assuming you have predictions for x_test stored in a variable called 'predictions'
predictions = clf_svm.predict(test_x_vec)  # Replace 'model' with your actual model

# Convert predictions to an array
predictions = np.array(predictions)

# Count the occurrences of 'POSITIVE' in predictions
positive_count = np.count_nonzero(predictions == 'POSITIVE')
negattive_count = np.count_nonzero(predictions=='NEGATIVE')


In [20]:
print(f"Number of positive predictions: {positive_count}")
print(f"Number of negative predictions: {negattive_count}")


Number of positive predictions: 204
Number of negative predictions: 212


In [21]:
import pandas as pd

# Assuming you have x_test and predictions
data = {'Text': test_x, 'Sentiment': predictions}
df = pd.DataFrame(data)

# Map predictions to more readable labels
df['Sentiment'] = df['Sentiment'].map({'POSITIVE': 'Positive', 'NEGATIVE': 'Negative'})

# Display the DataFrame
print(df)


                                                  Text Sentiment
0    I wanted to try a different genre from what I ...  Negative
1    I have been waiting six months for book 3 and ...  Positive
2    All you ever wanted to know about surrendering...  Negative
3    I expected personal stories of women marrying ...  Negative
4    If this novel is to represent contemporary tim...  Negative
..                                                 ...       ...
411  Originally posted on The Canon! [...]The gorge...  Negative
412  If you like short stories or vignettes you are...  Positive
413  Unlike some heroines not to be named here, I c...  Positive
414  This book is twisted in the abuse put forth ag...  Positive
415  It may be a well written book, could use a bet...  Negative

[416 rows x 2 columns]


### DECISION TREE

In [22]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vec , train_y)

In [23]:
clf_s = clf_dec.score(test_x_vec,test_y)*100

In [24]:
test_x[18]

"The non downloading issue was a real bummer. I was really looking forward to reading it. Oh well...clear some more space. I did delete some books, but it still wouldn't fit."

In [25]:
clf_dec.predict(test_x_vec[18])

array(['POSITIVE'], dtype='<U8')

### NAIVE BAYES

In [26]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

gnb = GaussianNB()

# Convert train_x_vec to a dense numpy array
a = train_x_vec.toarray()

# Fit the Gaussian Naive Bayes classifier
gnb.fit(a, train_y)



In [27]:
# Reshape the input array for prediction
input_sample = a[12].reshape(1, -1)

# Make prediction on the reshaped input
prediction = gnb.predict(input_sample)

print(prediction)


['NEGATIVE']


In [28]:
gnb_s=gnb.score(test_x_vec.toarray(), test_y)*100


### LOGISTIC REGRESSION

In [29]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(train_x_vec, train_y)


In [30]:
logreg_s = logreg.score(test_x_vec, test_y)*100


In [31]:
logreg.predict(test_x_vec[0])

array(['NEGATIVE'], dtype='<U8')

### EVALUATION

In [32]:
print('svm score : ' , svm_s)
print('logistic score : ' , logreg_s)
print('gnb score : ' , gnb_s)
print('decision tree score : ' , clf_s)

svm score :  80.76923076923077
logistic score :  80.52884615384616
gnb score :  66.10576923076923
decision tree score :  62.74038461538461


In [33]:
from sklearn.metrics import f1_score

In [48]:
# svm f1 score 

f1_score(test_y,clf_svm.predict(test_x_vec) , average= None , labels=[Sentiment.POSITIVE , Sentiment.NEUTRAL , Sentiment.NEGATIVE] )

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


array([0.80582524, 0.        , 0.80952381])

In [35]:
# dec tree f1 score

f1_score(test_y,clf_dec.predict(test_x_vec) , average= None , labels=[Sentiment.POSITIVE , Sentiment.NEUTRAL , Sentiment.NEGATIVE] )

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


array([0.61538462, 0.        , 0.63869464])

In [36]:
# log reg score 

f1_score(test_y,logreg.predict(test_x_vec) , average= None , labels=[Sentiment.POSITIVE , Sentiment.NEUTRAL , Sentiment.NEGATIVE] )

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


array([0.80291971, 0.        , 0.80760095])

In [37]:
train_x[0]

'Long before the term behavioural finance there was someone writing about the significance of identity. Long before the witty Buffett-isms, someone wrote those same words as part of his Irregular Rules. And long before Michael Lewis carved out his own position as the Wall Street storyteller du jour, someone else did so with similar eloquent finesse. Today, nearly three months have passed since George J.W. Goodman died on January 3, 2014 - or, as the financial after world knew him by, Adam Smith. A name created for him by the publisher of New York Magazine so as to keep his weekly Wall Street columns anonymous.Rarely have the need to quote from a book been greater, and a good way to start is with Paul Samuelson\'s front-cover phrase "a modern classic", as it embraces the book beautifully. On one hand the book is eons ahead of its time, crafting the mindset-house that practitioners like Warren Buffett, Peter Lynch and the behavioural finance entourage would furnish. But on the other hand

In [38]:
test_y.count(Sentiment.NEGATIVE)

208

In [39]:
test_y[4]

'NEGATIVE'

### Tuning our model (with grid search)

In [47]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel' : ('linear' , 'rbf') , 'C' : (1,4,8,16,32)}
svc = svm.SVC()
clf = GridSearchCV(svc,parameters,cv=5)
clf.fit(train_x_vec , train_y)


In [51]:
import pickle

with open('./sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)


In [52]:
with open('sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)


In [54]:
print(test_x[0])

loaded_clf.predict(test_x_vec[0])


I wanted to try a different genre from what I normally read. So I decided to try this christian romance novel. I chose this book on its many good reviews. I found it too predictable, and there are lots of unnecessary scenes that slowed the story down. Too much scripture for one thing. I liked Lilly; she's a strong woman. The thing that really bothered me and seemed so out of character for Lilly was her hot-over-heels love for Paul. But then--suddenly--without enough story or explanation, she's google-eyed over Tern after looking into his eyes. I thought maybe I'd accidentally jumped ahead too far in my Kindle, so I went back. But no I hadn't. And I was disappointed in the ending. Again, the crucial part of the story was rushed. Too much of what the characters wore, etc. I would've liked more meat of the story itself. (I don't want to give too much of the story away.) Overall, I've got to admit this genre is not for me.


array(['NEGATIVE'], dtype='<U8')