# Machine Learning
### Machine Learning process

1. What question are we trying to answer. ----> 2. Find data to answer question. 3. Process data. ----> 4. Build a model---->
5. Evaluate mode ----> 6. Improve model further.

Sckit Learn is used for data mining and data analysis.
It's built on Numpy, SciPy, and matplotlib.

### Question: Classifying comments as either positive or negative.

### Data Class

In [142]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"



class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
        
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return[x.text for x in self.reviews]
    

    def get_sentiment(self):
        return[x.sentiment for x in self.reviews]
        
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        
        random.shuffle(self.reviews)
        

### Load Data

In [113]:
import json

file_name = 'D:\Python\Data Science\sklearn datasets\Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))
        
reviews[3].text

'I really enjoyed this adventure and look forward to reading more of Robert Spire. I especially liked all the info on global warming. You did a good job on the research.'

### Prep Data

In [143]:
from sklearn.model_selection import train_test_split
training, test = train_test_split(reviews, test_size = 0.33, random_state = 42)


train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

cont.evenly_distribute()


In [144]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


### Bags of words vectorization

In [157]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#This book is great
#This book was so bad

vectorizer = TfidfVectorizer()

train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)


print (train_x[2])
print(train_x_vectors[2].toarray())



This book depicts women as they really do not want to be depicted. As sex toys to be used and abused by men.Women deserve respect, just try and remember and think about the wholesomeness about that.  Don't let men gang up on you women, just be the wonderful people you are; but never be used.
[[0. 0. 0. ... 0. 0. 0.]]


### Classification

#### Linear SVM

In [158]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

test_x[2]
clf_svm.predict(test_x_vectors[2])

array(['NEGATIVE'], dtype='<U8')

#### Decision Tree

In [159]:
from sklearn.tree import DecisionTreeClassifier

In [160]:
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


test_x[2]
clf_dec.predict(test_x_vectors[2])

array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [161]:
from sklearn.linear_model import LogisticRegression

In [162]:
clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)


test_x[2]
clf_log.predict(test_x_vectors[2])

array(['NEGATIVE'], dtype='<U8')

### Evaluation

In [163]:
# Check persormance using Mean Accuracy

print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6370192307692307
0.8052884615384616


In [164]:
# Check perfomance using F1 Scores

from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.80582524 0.80952381]
[0.63438257 0.63961814]
[0.80291971 0.80760095]


In [154]:
test_set = ['I thoroughly enjoyed this, 5 stars', 'bad book, do not buy', 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_log.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

### Tuning the model using Grid Search

In [168]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':('linear', 'rbf'), 'C':(1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)



In [169]:
print(clf.score(test_x_vectors, test_y))

0.8197115384615384


## Saving Model

In [170]:
import pickle

In [179]:
with open ('C:/Users/Muange/MACHINE LEARNING/Models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

## Load Model

In [180]:
with open ('C:/Users/Muange/MACHINE LEARNING/Models/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [186]:
print(test_x[2])
loaded_clf.predict(test_x_vectors[2])

While the attempt to write something new and worthwhile in an abused subgenre is more than welcome, the end result, at least in this case, fails exactly in the literary moments that should  embody its novelty.First, beside the steady use of an atrocious 'baristO' in place of the correct 'baristA', there are several grammar mistakes, which is quite rich in a work that tries to use a new, lyrical language.Second, the French locutions are out of place.Last, but absolutely not least, the pervasive poetic quotations appear -to me at least- far too long and hardly related with the feeling they are supposed to convey.Plot does not make a lot of sense and the overlong scene with Eden at the cafe is hardly conducive of the lead's growth.


array(['NEGATIVE'], dtype='<U8')