Author: WenJun Cen  
Date: 01/24/2020  
File: Github  

## Creating Classes

In [2]:
"""The classes are broken down into two. The first is enum class where values are identitfied as POSITIVE, 
NEGATIVE, and NEUTRAL. The second class is for ratings greater and equal to 4, it is positive. Ratings equal to 3, it is negative. 
Ratings less than 3 will be considered negative. Ratings range from 1 to 5."""

# Create Enum class with variables
class Score:
    POSITIVE = 'POSITIVE'
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'

# Create Class to get receive text and sentiment rating/score    
class Review:
    
    def __init__(self, text, rating):
        self.text = text
        self.rating = rating
        self.sentiment = self.get_sentiment()
    
    # return positive, neutral, or negative based on rating
    def get_sentiment(self):
        if self.rating >= 4:
            return Score.POSITIVE
        elif self.rating == 3:
            return Score.NEUTRAL
        else:
            return Score.NEGATIVE

## Importing Json

In [3]:
import json
import gzip


filename = 'reviews_Toys_and_Games_5.json.gz'  # Sample file.

reviewText = []
with gzip.open(filename , 'rb') as gzip_file:
    for line in gzip_file:  # Read one line.
        line = line.rstrip()
        if line:  # Any JSON data on it?
            review = json.loads(line)
            reviewText.append(Review(review['reviewText'], review['overall'])) # Appending only comments and scores
            
print(reviewText[123].text, reviewText[123].rating)

Tested by my 3 years & 2 months young grandson. I picked out 2 corners for him as starting points & he went on from there with very little help from me. His almost 2 years young sister wasn't interested in trying this puzzle.The pieces are huge! I measured two of them to give you a general idea of their size...one was 7" x 4 1/2". A corner piece measured 5 5/8" x 4 1/2".  Each sturdy piece has a nice shine & is approximately 1/8" thick.Another plus is the storage box which has a corded handle for carrying & even more importantly...the box is roomy! Meaning your child won't have to be extra careful getting the pieces back into the box. Oh my gosh, I remember the days when I'd put my kids' puzzles away & have to keep redoing it so the lid would close. Not so with this storage box, my grandson put the pieces away in no time flat & he didn't have to fiddle around getting them in just so.Nice quality puzzle...5 stars 5.0


## Preparing Data

In [4]:
# Looking at length of the three types of sentiments
negative = list(filter(lambda x: x.sentiment == Score.NEGATIVE, reviewText))
positive = list(filter(lambda x: x.sentiment == Score.POSITIVE, reviewText))
neutral = list(filter(lambda x: x.sentiment == Score.NEUTRAL, reviewText))
list1 = [negative, positive, neutral]
[len(listobj) for listobj in list1]

[11005, 140235, 16357]

In [5]:
# Make sure that all sentiments are proportional
import random
positive_prop = positive[:len(negative)]
neutral_prop = neutral[:len(negative)]
reviews = negative + positive_prop + neutral_prop
random.shuffle(reviews)

In [6]:
listprop = [negative, positive_prop, neutral_prop]
[len(listobj) for listobj in listprop]

[11005, 11005, 11005]

### Splitting Data into Training and Testing

In [7]:
from sklearn.model_selection import train_test_split

training, testing = train_test_split(reviews, train_size = 0.7, random_state=123)

In [8]:
# Look at length of both training and testing data
print(len(training))
print(len(testing))

23110
9905


In [9]:
# List comprehension to include all text and rating
train_x = [train.text for train in training]
train_y = [train.sentiment for train in training]

test_x = [test.text for test in testing]
test_y = [test.sentiment for test in testing]

In [10]:
# View comment of first line
train_x[0], training[0].rating, train_y[0]

("They're nothing spectacular, but the assortment of poses and colors are nice.  Most would not stay standing due to excess plastic on the feet, and the paint was pretty chipped up, but at this price I can't really complain.",
 3.0,
 'NEUTRAL')

### Transform corpus into vectorizer

In [11]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Adding stop words and allowing additional future stop words by using union
my_stop_words = text.ENGLISH_STOP_WORDS

vectorizer = CountVectorizer(stop_words=my_stop_words, lowercase=True)
train_vector_x = vectorizer.fit_transform(train_x)

test_vector_x = vectorizer.transform(test_x)

In [12]:
print(train_x[0])
print(train_vector_x[0].toarray())

They're nothing spectacular, but the assortment of poses and colors are nice.  Most would not stay standing due to excess plastic on the feet, and the paint was pretty chipped up, but at this price I can't really complain.
[[0 0 0 ... 0 0 0]]


## Predict customer review rating/score

### Decision Tree

In [16]:
from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(random_state=123)
clf_tree.fit(train_vector_x, train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=123, splitter='best')

In [17]:
clf_tree.predict(test_vector_x[0])

array(['NEGATIVE'], dtype='<U8')

In [18]:
clf_tree.score(test_vector_x, test_y)

0.5125694093891974

### Random Forest

In [21]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier(n_estimators=10, random_state=123)
clf_rf = clf_rf.fit(train_vector_x, train_y)

In [22]:
clf_rf.predict(test_vector_x[14])

array(['NEGATIVE'], dtype='<U8')

In [23]:
test_x[14], test_y[14]

("I couldn't believe all the good reviews and top rating for this toy after actually seeing it work.  All it does is make a single note sound with each squeeze of its stomach.  What could(n't) be more exciting?!  It goes on top of an already huge mountain of stuffed animals.  And they aren't even animals so there goes the educational factor.  Don't buy it just to have it because it is one of the top toys for the year, it is not even worth the $9.99 sale price.",
 'NEGATIVE')

In [24]:
clf_rf.score(test_vector_x, test_y)

0.564967188288743

### Support Vector Machine

In [25]:
from sklearn import svm

clf_svm = svm.SVC(kernel="linear", random_state=123)
clf_svm = clf_svm.fit(train_vector_x, train_y)

In [27]:
clf_svm.predict(test_vector_x[14])

array(['NEGATIVE'], dtype='<U8')

In [28]:
clf_svm.score(test_vector_x, test_y)

0.6339222614840989

### Neural Network 

In [30]:
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 2), random_state=1)

clf_nn.fit(train_vector_x, train_y)

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(5, 2), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [31]:
clf_nn.predict(test_vector_x[14])

array(['NEGATIVE'], dtype='<U8')

In [32]:
clf_nn.score(test_vector_x, test_y)

0.6324078748107017

## Evaluation 

In [41]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_nn.predict(test_vector_x), average='weighted'))
print(f1_score(test_y, clf_tree.predict(test_vector_x), average='weighted'))
print(f1_score(test_y, clf_rf.predict(test_vector_x), average='weighted'))
print(f1_score(test_y, clf_svm.predict(test_vector_x), average='weighted'))

0.632711008735242
0.5119965678714209
0.5644032988980122
0.6334646392365183


### Confusion Matrix for Neural Network

In [42]:
confusion_matrix(test_y, clf_nn.predict(test_vector_x))

array([[2114,  978,  249],
       [ 972, 1734,  587],
       [ 238,  617, 2416]], dtype=int64)

In [56]:
print(f'Total accuracy: {(2114 + 1734 + 2416)/(3*11005)}%')

Total accuracy: 0.18973194002726035%


### Confusion Matrix for Random Forest

In [45]:
confusion_matrix(test_y, clf_rf.predict(test_vector_x))

array([[1942, 1000,  399],
       [1058, 1500,  735],
       [ 410,  707, 2154]], dtype=int64)

In [55]:
print(f'Total accuracy: {(1942 + 1500 + 2154)/(3*11005)}%')

Total accuracy: 0.1694987127063456%


### Confusion Matrix for Decision Tree

In [48]:
confusion_matrix(test_y, clf_tree.predict(test_vector_x))

array([[1684, 1112,  545],
       [ 983, 1402,  908],
       [ 485,  795, 1991]], dtype=int64)

In [54]:
print(f'Total accuracy: {(1684 + 1402 + 1991)/(3*11005)}%')

Total accuracy: 0.15377858549144327%


### Confusion Matrix for Support Vector Machine

In [50]:
confusion_matrix(test_y, clf_svm.predict(test_vector_x))

array([[2143,  954,  244],
       [ 998, 1703,  592],
       [ 287,  551, 2433]], dtype=int64)

In [53]:
print(f'Total accuracy: {(2143 + 1703 + 2433)/(3*11005)}%')

Total accuracy: 0.19018627896410722%
