### Define Classes

In [17]:
#Create a data class for all the data being loaded

class Sentiment:
    negative = "negative"
    neutral = "neutral"
    positive = "positive"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
#Define sentiment score in review class based of star rating (used to allow NLP ML later)
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.negative
        elif self.score == 3:
            return Sentiment.neutral
        else:
            return Sentiment.positive
        
        
#Create class to even out Positive and negative training data as not to indroduce bias into the ML model
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
      
    #create method to extract text from each review
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_dist(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.negative, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.positive, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        print("No. of negative: " + str(len(negative)))
        print("No. of positive: " + str(len(positive_shrunk)))

### Load Data

In [5]:
import json

#Openning file in same directory
file_name = './Books_small_10000.json'

with open(file_name) as f:
    for line in f:
        print(line)
        break

{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}



In [6]:
# Use above print to find dictionary keys to print corresponding values. Current file is not of type dict so need to convert

with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        print(review['reviewText'])
        print(review['overall'])
        break

I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.
5.0


In [11]:
#Above code works to find individual reviews and rating, use this to create a list of every review and rating. 
#Use review class to append as a review ibject

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

#Class allows us to use integer to select review tuple and 'text', 'sentiment' or 'score' to specify the body of the review or its rating
print(reviews[5].text)
print(reviews[5].sentiment)

I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
positive


### Prep Data

In [6]:
len(reviews)

10000

In [18]:
from sklearn.model_selection import train_test_split
import random


#split review data to use to train and test ML algorithm
train, test = train_test_split(reviews, test_size = 0.2, random_state = 42)

train_container = ReviewContainer(train)
train_container.evenly_dist()

test_container = ReviewContainer(test)
test_container.evenly_dist()

No. of negative: 513
No. of positive: 513
No. of negative: 131
No. of positive: 131


In [22]:
# split training data into text and sentiment (X and y in ML algorithm)
X_train = train_container.get_text()
y_train = train_container.get_sentiment()

# likewise split test data into text and sentiment (X and y)
X_test = test_container.get_text()
y_test = test_container.get_sentiment()

#### Bag of words vectorisation

In [23]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#Use bag of words to create a numerized version of review text e.g. each word is assigned a row and counted in a vector when it appears in a review

vectorizer = TfidfVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)

X_test_vectors = vectorizer.transform(X_test)

X_train_vectors[0]
X_test_vectors[0]

<1x9625 sparse matrix of type '<class 'numpy.float64'>'
	with 75 stored elements in Compressed Sparse Row format>

### Classification

#### Linear SVM

In [24]:
from sklearn import svm

clf_svm = svm.SVC(kernel = 'linear')

clf_svm.fit(X_train_vectors, y_train)
# Show prediction for first review
clf_svm.predict(X_test_vectors[0])

array(['positive'], dtype='<U8')

In [39]:
#actual text & sentiment from first review
print(X_test[0])
print(y_test[0])

If you can, read the first 4-5 pages, you will immediately if you want to read more or not. As for me, I've rated The Witness two stars because I had trouble relating to the main character (girl, 16, she is a genious - no problem with that - but she can do anything, dancing like shakira, finding out what the FBI needed years of hard work to know by just googling; she is also very beautiful, she has great insight, she is completely good - no grey shade, she'll go straight to heaven)  and clich&eacute;d (girl raised by a strict mother rebels). The story is no too complex either so... Two stars.
negative


#### Decision tree

In [26]:
from sklearn.tree import DecisionTreeClassifier

clf_decision = DecisionTreeClassifier()
clf_decision.fit(X_train_vectors, y_train)
# Show prediction for first review
clf_decision.predict(X_test_vectors[0])

array(['negative'], dtype='<U8')

#### Naive Bayes

In [27]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(X_train_vectors.toarray(), y_train)
# Show prediction for first review
clf_gnb.predict(X_test_vectors[0].toarray())

array(['positive'], dtype='<U8')

#### Logistic regression

In [28]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(X_train_vectors, y_train)
# Show prediction for first review
clf_log.predict(X_test_vectors[0])

array(['positive'], dtype='<U8')

### Evaluation

In [29]:
#Pass test data through ML models to see how well they score on mean accuracy

print(clf_svm.score(X_train_vectors, y_train))
print(clf_decision.score(X_train_vectors, y_train))
print(clf_gnb.score(X_train_vectors.toarray(), y_train))
print(clf_log.score(X_train_vectors, y_train))

0.9892787524366472
1.0
0.9844054580896686
0.9668615984405458


In [55]:
# Pass test data through ml models and evaluate their f1_scores

from sklearn.metrics import f1_score

print(f1_score(y_test, clf_svm.predict(X_test_vectors), average = None, labels = [Sentiment.positive, Sentiment.negative]))
print(f1_score(y_test, clf_decision.predict(X_test_vectors), average = None, labels = [Sentiment.positive, Sentiment.negative]))
print(f1_score(y_test, clf_log.predict(X_test_vectors), average = None, labels = [Sentiment.positive, Sentiment.negative]))

[0.82442748 0.82442748]
[0.67896679 0.65612648]
[0.83333333 0.83076923]


In [56]:
#Qualitative test

test_set = ['Great book, would recommend', 'I thouroughly enjoyed this purchase', 'The was a poor read']
test_set_vect = vectorizer.transform(test_set)

clf_log.predict(test_set_vect)

array(['positive', 'positive', 'negative'], dtype='<U8')

#### Use a grid model to tune the ML algorithm

In [67]:
# Grid search allows us to test different outputs using different parameters within our prediction models
# We can then choose the parameters with the most accurate results
from sklearn.model_selection import GridSearchCV

#parameters = { 'C': (1,2,3,4,8)}
parameters = {'C': (1,2,3,4,8), 'kernel': ('linear', 'rbf')}
#log = LogisticRegression()
svc = svm.SVC()
#clf = GridSearchCV(log, parameters, cv=5)
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train_vectors, y_train)
clf.cv_results_

{'mean_fit_time': array([0.41761489, 0.4894989 , 0.4553153 , 0.52656617, 0.45564137,
        0.47928734, 0.43006668, 0.46594052, 0.49164701, 0.52843781]),
 'std_fit_time': array([0.03012064, 0.02166058, 0.01635343, 0.04539253, 0.02144911,
        0.00689959, 0.00464009, 0.00909492, 0.03748035, 0.05264262]),
 'mean_score_time': array([0.10853539, 0.11447353, 0.10506654, 0.13060327, 0.09372711,
        0.11356897, 0.09373569, 0.11343985, 0.09875398, 0.13222017]),
 'std_score_time': array([1.28396386e-02, 3.64179497e-03, 2.20242901e-02, 1.03787136e-02,
        1.39653672e-02, 5.83483432e-03, 1.19276028e-05, 9.17844476e-03,
        1.35660096e-02, 1.85099437e-02]),
 'param_C': masked_array(data=[1, 1, 2, 2, 3, 3, 4, 4, 8, 8],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf', 'linear', 'rbf',
                  

In [72]:
#Highest accuracy came from C=1, kernel = 'rbf' SVC alrgorithm

clf_new = svm.SVC(kernel = 'rbf', C =1)
clf_new.fit(X_train_vectors, y_train)
print(clf_new.score(X_test_vectors, y_test))

0.8358778625954199


### Conclusion

#### Results
In this project, I used several supervised ML models to predict whether a product review from a customer is positive or negative. 

A dataset of 10'000 book reviews was used to train several ML models:
 - Support Vector Machine
 - Decision Tree
 - Naive Bayes
 - Logistic Regression
 
These models were then refined to obtain the most accurate preditions. The dataset was split 80:20 into training and testing sets. Once the models were trained, inputing reviews and their ratings (1-5 stars, translated into positive, negative and neutral) to train the model, the model was used to make precitions on the remainder of the dataset reviews. These predictions were compared the actual review rating n order to determine the accuracy of the ML models. 

Additionaly, a grid model was used to refine the ML model, passing through different values for specific parameters within the model. The best combination of these parameters were chosen in order to obtain the most accurate model.

The final model chosen was SVM model, using values 'rbf' and '1' for parameter 'kernal' and 'C'. The accuracy of this model was **84%**.

Although the model accuracy was relativey high, there is room for improvement.

#### Future steps

- Improve vectorizer: Currently a tfidf vectorizer is being used. This weighs the key words like "bad", "good", "excellent" and "terrible" higher than non-positive or negative descriptive words like "the", "this", "I" etc by giving a smaller weighting to words that appear more often. However, this method does not decipher between the less frequent words to determine if they are influential on the positiveness or negativeness of a reiview. For example, if the word "paramount" wasn't used frequently it would be weighted highly, but this does not provide an insight as to whether the review is positive or negative. A better model would be able to map words that do effect the review, for example adjectives, and weight these higher than similarly less frequently used word that don't convey positive or negative opinions.
<br>   
  
- Refine the data used to train the model: The model uses the entire review submitted by a customer, counting every word and every varyation of a word (e.g. good and good!). One way to improve the models accuracy would be to refine these reviews to contain mainly key words. This could be achieved by selectively stripping words that can be confidently disregarded.
<br>
  
- Increase data used to train the model: An obvious way to increase the accuracy of the model would be to increase the number of reviews used to train the model, thus allowing the ML algorithm to create a more accurate and detailed model of positive and negative words.
<br>
  
- The model could also be used to classify the type of product being reviewed: As well as using the review to predict a negative and positive rating, the ML model could be used to classify the product of a review, for example, if the product was clothing, electrical, home etc.