# Loading Json Data
This file contains 10,000 book reviews from amazon with the following attributes :reviewerID, asin, reviewerName, helpful,overall, summary, unixReviewTime, and reviewTime

Tutorial by [Keith Galli](https://youtu.be/M9Itm95JzL0)

In [198]:
import json #to process out json file

In [199]:
file_name=r"C:\Users\yesmi\OneDrive\Desktop\Data Analytics Projects\Projects Set 1\Books_small_10000.json"
with open (file_name) as f:
    for line in f:
        print(line)  
        break

{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}



We want to extract, from each line, the "reviewText" and the "overall", then assigne to each score a sentiment:
* 1-2: negative sentiment
* 3: neutral
* 4-5: positive sentiment

In [206]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            #return 'NEGATIVE'
            return Sentiment.NEGATIVE 
        elif self.score == 3:
            #return 'NEUTRAL'
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            #return"POSITIVE"
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

The purpose of the evenly_distribute() method is to evenly distribute the reviews in the container so that there are an equal number of positive and negative reviews.

The method filters the reviews into separate lists based on their sentiment (either positive or negative). Then, it shrinks the positive list to be the same length as the negative list by taking only the first n positive reviews, where n is the length of the negative list. Finally, it concatenates the two lists and shuffles them randomly.

In [207]:
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))
len(reviews)

10000

In [208]:
print("Text:",reviews[1].text)
print("Score:",reviews[1].score)
print("Sentiment:",reviews[1].sentiment)

Text: I enjoyed this short book. But it was way way to short ....I can see how easily it would have been to add several chapters.
Score: 3.0
Sentiment: NEUTRAL


# Data Preprocessing

In [209]:
from sklearn.model_selection import train_test_split

In [255]:
#splitting the data into training and test set
training_set, test_set = train_test_split(reviews, test_size=0.33,random_state=40)

train_container = ReviewContainer(training_set)

test_container = ReviewContainer(test_set)

In [256]:
#The target here is the sentiment (y), and the predictor is the text (x)
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

423
423


## CountVectorizer
CountVectorizer will convert the text  to a matrix of counts. Tt creates a vocabulary of all unique words in the text corpus and assigns each word a unique integer index.

This vectorization process enables us to perform statistical analysis on text data. 

In [257]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer() 

# vectorizer.fit(x)
# train_x_vectors = vectorizer.transform(x)
train_x_vectors = vectorizer.fit_transform(train_x) #converts each line into a vector of numerical values.
test_x_vectors = vectorizer.transform(test_x) #We should not fit the test data

# Model Selection

## 1. Linear SVM
**Support Vector Machine** is used for linearly seperable binary sets. Its main goal is to design a hyperplane that classifies data into two sets with the maximum margin. The margin is the distance between the hyperplane and the closest data points from each class, which are the support vectors.

In [258]:
from sklearn import svm
svm_model = svm.SVC(kernel='linear')
svm_model.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [259]:
#prediction test
print(test_x[1])
print('Predicted sentiment:',svm_model.predict(test_x_vectors[1]))

This is an engaging, interesting, thoughtful book by a former atheist who gradually, kicking and screaming, converts to Catholicism.  A well-told conversion story with depth and humor.
Predicted sentiment: ['POSITIVE']


## 2. Desicion Tree

In [284]:
from sklearn.tree import DecisionTreeClassifier
Dec_model = DecisionTreeClassifier()
Dec_model.fit(train_x_vectors,train_y)

DecisionTreeClassifier()

In [285]:
Dec_model.predict(test_x_vectors[1])

array(['POSITIVE'], dtype='<U8')

## 3. Naive Bayes

In [286]:
from sklearn.naive_bayes import GaussianNB

NB_model = GaussianNB()
NB_model.fit(train_x_vectors.toarray(),train_y) #"toarray" converts a sparse matrix to a dense numpy array

GaussianNB()

**Why use "toarray()"?**

When dealing with *high-dimensional* data, such as text data with a *large number of features*, the resulting feature vectors can be very sparse. In other words, most of the elements in the feature vectors are zero. In such cases, representing the feature vectors as dense numpy arrays can be computationally expensive and memory-intensive.

To avoid this issue, it is common to represent sparse feature vectors as sparse matrices. A sparse matrix only stores the non-zero elements in a compressed format, which can significantly reduce memory usage and computation time. However, not all machine learning algorithms can handle sparse matrices.

In the case of **GaussianNB**, it requires dense data because it assumes a *continuous distribution* for the features, which *cannot be modeled using sparse matrices*. Therefore, when using GaussianNB, we need to convert the sparse feature vectors to dense numpy arrays before fitting the model.

In [287]:
NB_model.predict(test_x_vectors[1].toarray())

array(['POSITIVE'], dtype='<U8')

## 4. Logistic Regression

In [288]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression()
LR_model.fit(train_x_vectors,train_y)

LogisticRegression()

In [289]:
LR_model.predict(test_x_vectors[1])

array(['POSITIVE'], dtype='<U8')

# Model Analysis and Evaluation
## 1. Compare Accuracy
The **mean accuracy** of a classifier is a measure of how often the classifier correctly predicts the class label for the entire dataset. It is the ratio of the number of correctly predicted samples to the total number of samples in the dataset. 

In [290]:
print("SVM Classifier Accuracy:",svm_model.score(test_x_vectors,test_y)) 
print("Decision Tree Classifier Accuracy:",Dec_model.score(test_x_vectors,test_y))
print("Naive Bayes Classifier Accuracy:",NB_model.score(test_x_vectors.toarray(),test_y))
print("Logistic Regression Classifier Accuracy:",LR_model.score(test_x_vectors.toarray(),test_y))

SVM Classifier Accuracy: 0.8552036199095022
Decision Tree Classifier Accuracy: 0.6764705882352942
Naive Bayes Classifier Accuracy: 0.6176470588235294
Logistic Regression Classifier Accuracy: 0.8506787330316742


## 2. F1 Score
The F1 score is a weighted average of precision and recall. It measures the balance between the **precision** (the number of true positives divided by the number of true positives plus false positives) and **recall** (the number of true positives divided by the number of true positives plus false negatives)

In [291]:
from sklearn.metrics import f1_score

# f1_score(test_y, svm_model.predict(test_x_vectors), average= None, 
#          labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
#SVM 
f1_score(test_y, svm_model.predict(test_x_vectors), average= None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.85253456, 0.85777778])

In [292]:
#Decision tree
f1_score(test_y, Dec_model.predict(test_x_vectors), average= None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.66510539, 0.68708972])

In [293]:
#Naive Bayes
f1_score(test_y, NB_model.predict(test_x_vectors.toarray()), average= None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


array([0.60788863, 0.        , 0.62693157])

In [294]:
#Logistic Regression
f1_score(test_y, LR_model.predict(test_x_vectors), average= None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.84862385, 0.85267857])

## Qualitative Testing

In [295]:
test = ["very good. excellent book","terrible book, do not recommend", "wouldn't recommend it to my worst enenmies"]

vect_test = vectorizer.transform(test)

NB_model.predict(vect_test.toarray())

array(['NEGATIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

# Tuning the Model

In [296]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)

clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [297]:
clf.score(test_x_vectors, test_y)

0.8484162895927602

# Saving Model

In [298]:
import pickle

with open (r'C:\Users\yesmi\OneDrive\Desktop\Data Analytics Projects\Projects Set 1\Models\SVM_sentiment_classifier.pkl'
          ,'wb') as f:

    pickle.dump(clf, f)

This code saves the trained clf model using Python's pickle module. The model is saved in binary format to the file sentiment_classifier.pkl located in the ./models directory.

The with open(...) block opens the file for writing using the 'wb' mode, which means that the file is opened for writing in binary mode. The pickle.dump() function then writes the clf object to the file.

## Loading the Model
Use a loaded model without training it again.

In [299]:
with open (r'C:\Users\yesmi\OneDrive\Desktop\Data Analytics Projects\Projects Set 1\Models\SVM_sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [302]:
test_x[0]

"Constance Cherry's book is a wealth of information.  What a fresh way to look at worship planning!  I highly recommend to worship planners and worshipers alike!"

In [301]:
loaded_clf.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')