## Data Class

In [None]:
# as data gets messier it gets difficult to go through code so lets make a class 
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
         

In [47]:
%pwd

'G:\\code\\sklearn_tuts'

## Load Data

In [78]:
import json

file_name = 'G:\\code\\sklearn_tuts\Scikit_tutorial\data\sentiment\Books_small.json'

reviews = []

with open(file_name) as f:
    for line in f:
        review  = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall'])) #note how we use class to pass an object over here
        
reviews[4].text  #way to add up see list and then later in depth tuple


'With the government knowing this could happen and not telling anyone then this is a good story of what would happen if an emp attack happened.'

Text data is not good for machine learning models, it majorly take data in matrices, so we convert string to text data, like for every word appearing make it a key of dictionary and use 1 and 0 for every thing sentance appearing

## Prep Data

In [79]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [80]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

47
47


In [81]:
len(Training)

6700

In [82]:
len(Testing)

3300

In [83]:
Training[1] #Output after class being made is a reference

<__main__.Review at 0x1ce441b3e10>

In [84]:
Training[1].sentiment

'POSITIVE'

Always understand what are your input and outputs are

In [85]:
train_x = [Text.text for Text in Training]
train_y = [Text.sentiment for Text in Training]

test_x = [Text.text for Text in Testing]
test_y = [Text.sentiment for Text in Testing]

In [86]:
train_x[5]

'I enoyed reading Stormy Montana Sky by Debra Holland, a part of the Montana Sky Series. In fact, the book was so good I am now reading another book in the series and have the third waiting on my kindle.'

In [87]:
train_y[5]

'POSITIVE'

## Bag of word Vectorisation

In [88]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer()

train_x_vectors = vectorizer.fit_transform(train_x) #earlier it was fit and trasform seperately

test_x_vectors = vectorizer.fit_transform(test_x)

print(train_x[5])
print(train_x_vector[5]) #now a vector
print(train_x_vector[5].toarray())




I enoyed reading Stormy Montana Sky by Debra Holland, a part of the Montana Sky Series. In fact, the book was so good I am now reading another book in the series and have the third waiting on my kindle.
  (0, 23767)	4
  (0, 16595)	1
  (0, 16683)	1
  (0, 11974)	2
  (0, 1326)	1
  (0, 25742)	1
  (0, 15895)	1
  (0, 21943)	1
  (0, 10960)	1
  (0, 21105)	2
  (0, 3137)	2
  (0, 8133)	1
  (0, 19160)	2
  (0, 22703)	1
  (0, 15625)	2
  (0, 21705)	2
  (0, 3706)	1
  (0, 6261)	1
  (0, 11387)	1
  (0, 17233)	1
  (0, 8750)	1
  (0, 10283)	1
  (0, 1206)	1
  (0, 16417)	1
  (0, 1439)	1
  (0, 23874)	1
  (0, 25654)	1
  (0, 13352)	1
[[0 0 0 ... 0 0 0]]


## Classification

There are lot of classifiers in sklearn.

Which one to choose??

Go through various resourses like mit lecture and other stuff to know better what is what

A lot of time its about simply testing and knowing which is better

#### Linear SVM

In [89]:
from sklearn import svm
clf_svm = svm.SVC(kernel = 'linear')
clf_svm.fit(train_x_vector, train_y)

clf_svm.predict(train_x_vector[0])

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [90]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(train_x_vector[0])

array(['POSITIVE'], dtype='<U8')

#### Naive Bayes

In [91]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vector, train_y)

clf_gnb.predict(train_x_vector[0])

array(['POSITIVE'], dtype='<U8')

#### Logistic Regression

In [92]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(train_x_vector[0])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array(['POSITIVE'], dtype='<U8')

## Evaluation

In [93]:
#mean accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))
# i dont know why is this not working as i havent have any difference in training data and testing data



ValueError: X has 18643 features, but SVC is expecting 26615 features as input.

In [94]:
# F1 Scores
from sklearn.metrics import f1_score

f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])
#f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])


ValueError: X has 18643 features, but SVC is expecting 26615 features as input.

In [102]:
test_set = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

ValueError: X has 18643 features, but SVC is expecting 26615 features as input.

## Tuning our Model (with Grid Search)

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()

clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)


## Saving model

#### Save

In [None]:
import pickle

with open('./models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)
