# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [3]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [4]:
len(reviews)

25000

In [9]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [10]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [12]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation #

In [88]:
n = len(reviews)
import numpy as np

ni = np.random.randint(0,n, size=20)
for i in ni:
    pretty_print_review_and_label(i)

# pretty_print_review_and_label(0)

NEGATIVE	:	something very strange happens when you talk about global warming science goes o...
NEGATIVE	:	i  m   years old and i  ve spent as much of my life as i can remember  a fan of ...
POSITIVE	:	i only watched this because it starred josie lawrence  who i knew  br    br   fr...
NEGATIVE	:	woosh  man what can i say . . .   br    br   the opening  scene  maybe  we see a...
NEGATIVE	:	this movie is one of the most awful i  ve ever seen . not only is the dialogue a...
POSITIVE	:	this was an adorable movie . a real feel  good movie when you need one . the sto...
NEGATIVE	:	well done al gore  you have become the first person to have made  billion dollar...
NEGATIVE	:	this show was absolutely terrible . for one george isn  t funny  and his kids ar...
POSITIVE	:	very good drama about a young girl who attempts to unravel a series of horrible ...
NEGATIVE	:	this film was so amateurish i could hardly believe what i was seeing . it is sho...
NEGATIVE	:	this was an atrocious waste of my time 

In [87]:

bad = ('boring','lousy', 'wasted','suck','poor', 'terrible', 'bad', 'preposterous', 'worst', 'hated', 'awful', 'horrid', 'stupid', 'unoriginal', 'disappointing')
good = ('admire','wonderful', 'excellent', 'great', 'best', 'enjoyed', 'loved', 'impressed', 'original','fantastic')
ngood = 0
nbad = 0
tbad = 0
tgood = 0
tboth = 0
fbad = 0
fgood = 0
mygood = 0
mybad = 0
mylabels = []
for i,r in enumerate(reviews):
#     print(r[0:80])
#     print("XX" +labels[i] + "YY")
#     print(r.split())
#     rs = r.split()
    gg = len([i for i in good if i in r])
    gb = len([i for i in bad if i in r])
    if (gg > gb):
        mygood += 1
        mylabels.append('POSITIVE')
    else:
        if (gb > gg):
            mybad += 1
            mylabels.append('NEGATIVE')
    if (gg == gb):
        mylabels.append('NONE')
        
    if (gg and gb):
        tboth += 1
    if gg and (labels[i] == 'POSITIVE'):
        ngood += 1
    if gb and (labels[i] == 'NEGATIVE'):
        nbad += 1
    if gg and (labels[i] == 'NEGATIVE'):
        fgood += 1
    if gb and (labels[i] == 'POSITIVE'):
        fbad += 1
    if (labels[i] == 'POSITIVE'):
        tgood += 1
    else:
        tbad += 1
        
matches = 0
nnone = 0
for i in range(len(mylabels)):
    if mylabels[i] == 'NONE':
        nnone += 1
        continue
    if mylabels[i] == labels[i]:
        matches += 1

print('1', matches, nnone)
print('2', mygood, mybad)
print('3', nbad, fbad, tbad)
print('4', ngood, fgood, tgood)
print('5',tboth)

1 13985 7865
2 9685 7450
3 8635 3078 12500
4 8795 5433 12500
5 6144


In [36]:
good in reviews[0].split(" ")

False

In [65]:
[i for i in good if i in reviews[3]]

['great', 'best', 'original']

In [50]:
print(any(x in reviews[4] for x in good))

True


In [54]:
labels[0] == 'POSITIVE'

True

In [90]:
from collections import Counter


In [92]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()


In [93]:
for i in range(len(reviews)):
    if (labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(' '):
            negative_counts[word] += 1
            total_counts[word] += 1



In [115]:
pos_neg_ratios = Counter()

for term, cnt in list(total_counts.most_common()):
    if (cnt > 200):
        pos_neg_ratio = positive_counts[term]/float(negative_counts[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word, ratio in pos_neg_ratios.most_common():
    if (ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1/(ratio + 0.01)))


In [175]:
## most common in negative
neg_most = list(reversed(pos_neg_ratios.most_common()))[0:400]
neg_most[-10:]

[('spent', -0.65155245096848746),
 ('low', -0.64892927799573152),
 ('hour', -0.64745553340521167),
 ('saved', -0.64691036448575012),
 ('reason', -0.64625116218498069),
 ('makeup', -0.64491739746429821),
 ('lack', -0.64431614250687563),
 ('makers', -0.6435343096390499),
 ('empty', -0.64153525562181912),
 ('holes', -0.63738670049736545)]

In [180]:
#most common in positive
pos_most =pos_neg_ratios.most_common()[0:400]
pos_most[-10:]

[('roles', 0.49839716550752178),
 ('james', 0.49837216269470402),
 ('brought', 0.49783842823917956),
 ('hilarious', 0.49714551986191058),
 ('brutal', 0.49681488669639234),
 ('appropriate', 0.49643688631389105),
 ('dance', 0.49581998314812048),
 ('thoroughly', 0.49414593456733524),
 ('depicted', 0.49317068852726992),
 ('combination', 0.49247648509779424)]

## Start mini project 2 ##

Make two functions:
    1. Take a review, count the number of times each word occurs and put it into a vector

In [189]:


def wordvect(review, vocab_counter):
    words = review.split()
    for word in words:
        if word in vocab_counter.keys():
            vocab_counter[word] += 1
    

def review_label_to_bin(label):
    if label == 'POSITIVE':
        res = 1
    else:
        res = 0
        
    return res


posneg = pos_most + neg_most
vocabulary = sorted([t[0] for t in posneg])
vocab_counter = Counter()

for i in np.random.randint(0, len(reviews), 5):
    for w in vocabulary:
        vocab_counter[w] = 0
    wordvect(reviews[i], vocab_counter)
    print(review_label_to_bin(labels[i]), labels[i],  vocab_counter.most_common()[0:2])
    

1 POSITIVE [('young', 2), ('received', 1)]
0 NEGATIVE [('explanation', 2), ('bad', 1)]
1 POSITIVE [('series', 2), ('worse', 1)]
0 NEGATIVE [('best', 2), ('beautiful', 2)]
1 POSITIVE [('great', 2), ('best', 1)]
