# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


### Create lists of positive and negative word counts using Counter

In [6]:
from collections import Counter

word_count_positive = Counter([word for i, review in enumerate(reviews) for word in review.split() if labels[i] == 'POSITIVE'])
word_count_negative = Counter([word for i, review in enumerate(reviews) for word in review.split() if labels[i] == 'NEGATIVE'])

### It is possible to use the Counter object to sum counts for the same word to get the total count per word

In [7]:
# FYI, total counts computed by summing counter object
word_count_total = word_count_positive + word_count_negative
print(word_count_positive['movie'], ' + ' , word_count_negative['movie'], ' = ', word_count_total['movie'])

19074  +  24965  =  44039


### Create a ratio of positive to negative word counts for each word and show most positive words

In [8]:
import math

pos_neg_ratios = Counter()

[pos_neg_ratios.update({term: math.log(word_count_positive[term] / (word_count_negative[term] + 1) + .000000000001) }) 
 for term, count in word_count_total.most_common() if count > 100]  # Higher count reveals words 

pos_neg_ratios.most_common(40) # positive sentiments


# for term, cnt in word_count_total.most_common():
#     if cnt > 100:    # Adjusting this higher from 10 yields more sentiment-type words
#         pos_neg_ratios[term] = word_count_positive[term] / (word_count_negative[term] + 1)        

# for word, ratio in pos_neg_ratios.most_common():
# #     if ratio > 1:
#     pos_neg_ratios[word] = math.log(ratio + .000000000001) 
# #     else:
# #         pos_neg_ratios[word] = -math.log(1/(ratio + .000000000001))

[('edie', 4.691347882229153),
 ('paulie', 4.077537443905737),
 ('felix', 3.1527360223636984),
 ('polanski', 2.823361047613264),
 ('matthau', 2.8067217286093),
 ('victoria', 2.6810215287143593),
 ('mildred', 2.602689685444458),
 ('gandhi', 2.538973871058355),
 ('flawless', 2.451005098112405),
 ('superbly', 2.260025478575354),
 ('perfection', 2.1594842493534876),
 ('astaire', 2.1400661634963885),
 ('captures', 2.038619547159711),
 ('voight', 2.0301704926731845),
 ('wonderfully', 2.0218960560333676),
 ('powell', 1.9783454248086054),
 ('brosnan', 1.954799096472701),
 ('lily', 1.920376847050295),
 ('bakshi', 1.9029851043384285),
 ('lincoln', 1.9014583864846288),
 ('refreshing', 1.8551812956657077),
 ('breathtaking', 1.8481124057793443),
 ('bourne', 1.847848935879256),
 ('lemmon', 1.8458266904984886),
 ('delightful', 1.800270158896129),
 ('flynn', 1.7996646487353336),
 ('andrews', 1.7764919970974358),
 ('homer', 1.7692866133761669),
 ('beautifully', 1.7626953362843152),
 ('soccer', 1.7578579

### Show the words more likely to be negative

In [9]:
list(reversed(pos_neg_ratios.most_common()[-30:]))

[('boll', -4.969813299432),
 ('uwe', -4.624972813182271),
 ('seagal', -3.6441435602342946),
 ('unwatchable', -3.258096537995482),
 ('stinker', -3.2088254889899495),
 ('mst', -2.9502698994581227),
 ('incoherent', -2.9368917735122),
 ('unfunny', -2.692239595060803),
 ('waste', -2.6193845640028264),
 ('blah', -2.5704288232130907),
 ('horrid', -2.484906649776),
 ('pointless', -2.4553061800000595),
 ('atrocious', -2.425908309014732),
 ('redeeming', -2.368239063204804),
 ('prom', -2.3608540011074215),
 ('drivel', -2.3470368555544248),
 ('lousy', -2.3075726344950347),
 ('worst', -2.2869878961705323),
 ('laughable', -2.2643638801642227),
 ('awful', -2.227194247018161),
 ('poorly', -2.220755074737199),
 ('wasting', -2.2046046846247753),
 ('remotely', -2.197224577327219),
 ('existent', -2.079441541671836),
 ('boredom', -1.995100393238732),
 ('miserably', -1.9924301646828728),
 ('sucks', -1.9870682215415267),
 ('uninspired', -1.983297681119667),
 ('lame', -1.9817674589389105),
 ('insult', -1.9783