# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [19]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('Data/reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('Data/labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [20]:
len(reviews)

25000

In [21]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [22]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [23]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [24]:
from collections import Counter
import numpy as np

In [25]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [26]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

In [27]:
positive_counts.most_common()

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

In [28]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

In [29]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()

[('edie', 4.6913478822291435),
 ('paulie', 4.0775374439057197),
 ('felix', 3.1527360223636558),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.8067217286092401),
 ('victoria', 2.6810215287142909),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.5389738710582761),
 ('flawless', 2.451005098112319),
 ('superbly', 2.2600254785752498),
 ('perfection', 2.1594842493533721),
 ('astaire', 2.1400661634962708),
 ('captures', 2.0386195471595809),
 ('voight', 2.0301704926730531),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.9783454248084671),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796),
 ('refreshing', 1.8551812956655511),
 ('breathtaking', 1.8481124057791867),
 ('bourne', 1.8478489358790986),
 ('lemmon', 1.8458266904983307),
 ('delightful', 1.8002701588959635),
 ('flynn', 1.7996646487351682),
 ('andrews', 1.7764919970972666),
 ('homer', 1.7692866133759964),
 ('beautifully', 1.7626953362841438),
 ('socc

In [30]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('boll', -4.0778152602708904),
 ('uwe', -3.9218753018711578),
 ('seagal', -3.3202501058581921),
 ('unwatchable', -3.0269848170580955),
 ('stinker', -2.9876839403711624),
 ('mst', -2.7753833211707968),
 ('incoherent', -2.7641396677532537),
 ('unfunny', -2.5545257844967644),
 ('waste', -2.4907515123361046),
 ('blah', -2.4475792789485005),
 ('horrid', -2.3715779644809971),
 ('pointless', -2.3451073877136341),
 ('atrocious', -2.3187369339642556),
 ('redeeming', -2.2667790015910296),
 ('prom', -2.2601040980178784),
 ('drivel', -2.2476029585766928),
 ('lousy', -2.2118080125207054),
 ('worst', -2.1930856334332267),
 ('laughable', -2.172468615469592),
 ('awful', -2.1385076866397488),
 ('poorly', -2.1326133844207011),
 ('wasting', -2.1178155545614512),
 ('remotely', -2.111046881095167),
 ('existent', -2.0024805005437076),
 ('boredom', -1.9241486572738005),
 ('miserably', -1.9216610938019989),
 ('sucks', -1.9166645809588516),
 ('uninspired', -1.9131499212248517),
 ('lame', -1.9117232884159072),

# Project 2: Creating the Input/Output Data

In [31]:
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)

74074


In [32]:
list(vocab)

['',
 'fawn',
 'tsukino',
 'raining',
 'nunnery',
 'deferment',
 'sonja',
 'shaye',
 'tilton',
 'gag',
 'woods',
 'spiders',
 'bedknob',
 'francesco',
 'woody',
 'trawling',
 'fopish',
 'comically',
 'localized',
 'sevens',
 'disobeying',
 'yougoslavia',
 'ingratiating',
 'canet',
 'scola',
 'acurately',
 'scold',
 'gteborg',
 'cycling',
 'originality',
 'mutinies',
 'unnecessarily',
 'hermann',
 'rumbustious',
 'benvolio',
 'familiarness',
 'fyodor',
 'wracked',
 'staffed',
 'gandolphini',
 'donger',
 'eugenics',
 'dongen',
 'appropriation',
 'transvestism',
 'blodgett',
 'strictest',
 'screaming',
 'seamier',
 'bendan',
 'four',
 'wooded',
 'receiving',
 'liaisons',
 'grueling',
 'broiler',
 'wooden',
 'bucatinsky',
 'tambin',
 'broiled',
 'altagracia',
 'circuitry',
 'crotch',
 'stereotypical',
 'path',
 'shows',
 'burgade',
 'spoilerish',
 'thrace',
 'gaskets',
 'snuggles',
 'hanging',
 'scrapes',
 'feasibility',
 'miniatures',
 'snuggest',
 'zaniacs',
 'mortgages',
 'sustaining',


In [33]:
import numpy as np

layer_0 = np.zeros((1,vocab_size))
layer_0

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [34]:
word2index = {}

for i,word in enumerate(vocab):
    word2index[word] = i
word2index

{'': 0,
 'fawn': 1,
 'tsukino': 2,
 'saimin': 35970,
 'nunnery': 4,
 'deferment': 5,
 'sonja': 6,
 'tilton': 8,
 'vani': 48362,
 'woods': 10,
 'spiders': 11,
 'bedknob': 12,
 'hanging': 71,
 'woody': 14,
 'trawling': 15,
 'fopish': 16,
 'comically': 17,
 'localized': 18,
 'sevens': 19,
 'disobeying': 20,
 'yougoslavia': 21,
 'gad': 48363,
 'canet': 23,
 'scola': 24,
 'acurately': 25,
 'scold': 26,
 'gteborg': 27,
 'originality': 29,
 'grindhouses': 56014,
 'enchelada': 170,
 'gaa': 48366,
 'hermann': 32,
 'workaday': 58270,
 'rumbustious': 33,
 'benvolio': 34,
 'familiarness': 35,
 'hahahah': 67696,
 'wracked': 37,
 'diehard': 54199,
 'gandolphini': 39,
 'donger': 40,
 'eugenics': 41,
 'dongen': 42,
 'appropriation': 43,
 'transvestism': 44,
 'taj': 73069,
 'llbean': 43111,
 'strictest': 46,
 'screaming': 47,
 'seamier': 48,
 'bendan': 49,
 'revelers': 35976,
 'wooded': 51,
 'spacy': 23768,
 'liaisons': 53,
 'grueling': 54,
 'broiler': 55,
 'wooden': 56,
 'nightingale': 70788,
 'bucati

In [35]:
def update_input_layer(review):
    
    global layer_0
    
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])

In [36]:
layer_0

array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

In [37]:
def get_target_for_label(label):
    if(label == 'POSITIVE'):
        return 1
    else:
        return 0

In [38]:
labels[0]

'POSITIVE'

In [39]:
get_target_for_label(labels[0])

1

In [40]:
labels[1]

'NEGATIVE'

In [41]:
get_target_for_label(labels[1])

0

# Project 3: Building a Neural Network

- Start with your neural network from the last chapter
- 3 layer neural network
- no non-linearity in hidden layer
- use our functions to create the training data
- create a "pre_process_data" function to create vocabulary for our training data generating functions
- modify "train" to train over the entire corpus

### Where to Get Help if You Need it
- Re-watch previous week's Udacity Lectures
- Chapters 3-5 - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) - (40% Off: **traskud17**)

In [50]:
import time
import sys
import numpy as np

# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
       
        # set our random number generator 
        np.random.seed(1)
    
        self.pre_process_data(reviews, labels)
        
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self, reviews, labels):
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
    
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1,input_nodes))
    
        
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] += 1
                
    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def train(self, training_reviews, training_labels):
        
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            ### Forward pass ###

            # Input Layer
            self.update_input_layer(review)

            # Hidden layer
            layer_1 = self.layer_0.dot(self.weights_0_1)

            # Output layer
            layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))

            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(np.abs(layer_2_error) < 0.5):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            if(i % 250 == 0):
                print("Progress:" + str(100 * i/float(len(training_reviews)))[:4] \
                      + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                      + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                      + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            if(i%250 == 0):
                print("Progress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                      + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        
        # Input Layer
        self.update_input_layer(review.lower())

        # Hidden layer
        layer_1 = self.layer_0.dot(self.weights_0_1)

        # Output layer
        layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
        
        return "POSITIVE" if layer_2[0] > 0.5 else "NEGATIVE"
        

In [51]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

In [52]:
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])

Progress:0.0% Speed(reviews/sec):0.0% #Correct:0 #Tested:1 Testing Accuracy:0.0%
Progress:10.0% Speed(reviews/sec):2.246% #Correct:50 #Tested:101 Testing Accuracy:49.5%
Progress:20.0% Speed(reviews/sec):2.486% #Correct:100 #Tested:201 Testing Accuracy:49.7%
Progress:30.0% Speed(reviews/sec):2.468% #Correct:150 #Tested:301 Testing Accuracy:49.8%
Progress:40.0% Speed(reviews/sec):2.415% #Correct:200 #Tested:401 Testing Accuracy:49.8%
Progress:50.0% Speed(reviews/sec):2.344% #Correct:250 #Tested:501 Testing Accuracy:49.9%
Progress:60.0% Speed(reviews/sec):2.323% #Correct:300 #Tested:601 Testing Accuracy:49.9%
Progress:70.0% Speed(reviews/sec):2.360% #Correct:350 #Tested:701 Testing Accuracy:49.9%
Progress:80.0% Speed(reviews/sec):2.293% #Correct:400 #Tested:801 Testing Accuracy:49.9%
Progress:90.0% Speed(reviews/sec):2.181% #Correct:450 #Tested:901 Testing Accuracy:49.9%


In [53]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:0 #Trained:1 Training Accuracy:0.0%
Progress:1.04% Speed(reviews/sec):1.997 #Correct:125 #Trained:251 Training Accuracy:49.8%
Progress:2.08% Speed(reviews/sec):1.958 #Correct:250 #Trained:501 Training Accuracy:49.9%
Progress:3.12% Speed(reviews/sec):2.071 #Correct:375 #Trained:751 Training Accuracy:49.9%
Progress:4.16% Speed(reviews/sec):2.143 #Correct:500 #Trained:1001 Training Accuracy:49.9%
Progress:5.20% Speed(reviews/sec):2.214 #Correct:625 #Trained:1251 Training Accuracy:49.9%
Progress:6.25% Speed(reviews/sec):2.287 #Correct:750 #Trained:1501 Training Accuracy:49.9%
Progress:7.29% Speed(reviews/sec):2.350 #Correct:875 #Trained:1751 Training Accuracy:49.9%
Progress:8.33% Speed(reviews/sec):2.405 #Correct:1000 #Trained:2001 Training Accuracy:49.9%
Progress:9.37% Speed(reviews/sec):2.420 #Correct:1125 #Trained:2251 Training Accuracy:49.9%
Progress:10.4% Speed(reviews/sec):2.419 #Correct:1250 #Trained:2501 Training Accuracy:49.9%
Progress

KeyboardInterrupt: 

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.01)

In [None]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)

In [None]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])