# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [7]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Mini Project 1

In [6]:
from collections import Counter
import numpy as np

In [7]:
positive_counts = Counter()
negative_counts = Counter()
total_counts    = Counter()

Count positive word, negative word and total word 

In [12]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] +=1
            total_counts[word] +=1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] +=1
            total_counts[word] +=1


In [15]:
positive_counts.most_common()

[('', 1100954),
 ('the', 346657),
 ('.', 319335),
 ('and', 179445),
 ('a', 167380),
 ('of', 153714),
 ('to', 133498),
 ('is', 114494),
 ('in', 100432),
 ('br', 98470),
 ('it', 96052),
 ('i', 81491),
 ('that', 71264),
 ('this', 70160),
 ('s', 67631),
 ('as', 52618),
 ('with', 46494),
 ('for', 44832),
 ('was', 43834),
 ('film', 41874),
 ('but', 41644),
 ('movie', 38148),
 ('his', 34454),
 ('on', 34016),
 ('you', 33362),
 ('he', 32564),
 ('are', 29614),
 ('not', 28544),
 ('t', 27441),
 ('one', 27311),
 ('have', 25174),
 ('be', 24832),
 ('by', 23994),
 ('all', 23885),
 ('who', 22929),
 ('an', 22588),
 ('at', 22470),
 ('from', 21534),
 ('her', 20948),
 ('they', 19790),
 ('has', 18372),
 ('so', 18308),
 ('like', 18076),
 ('about', 16627),
 ('very', 16610),
 ('out', 16268),
 ('there', 16114),
 ('she', 15558),
 ('what', 15475),
 ('or', 15464),
 ('good', 15440),
 ('more', 15042),
 ('when', 14913),
 ('some', 14883),
 ('if', 14570),
 ('just', 14304),
 ('can', 14003),
 ('story', 13560),
 ('time', 

In [18]:
pos_neg_ratio = Counter()

Count the ratio of the positive words above the negative word

In [20]:
for term,cnt in list(total_counts.most_common()):
    if(cnt>100):
        pos_neg_ratio[term] = positive_counts[term] / float(negative_counts[term]+1)

for word,ratio in list(pos_neg_ratio.most_common()):
    if(ratio>1):
        pos_neg_ratio[word] = np.log(ratio)
    else:
        pos_neg_ratio[word] = -np.log((1 / (ratio+0.01)))
        

In [21]:
pos_neg_ratio.most_common()

[('edie', 5.3844950627890888),
 ('antwone', 5.1704839950381514),
 ('din', 5.0998664278241987),
 ('gunga', 4.8828019225863706),
 ('goldsworthy', 4.8675344504555822),
 ('gypo', 4.7874917427820458),
 ('yokai', 4.7874917427820458),
 ('flavia', 4.6249728132842707),
 ('visconti', 4.6249728132842707),
 ('paulie', 4.3652195163575005),
 ('gundam', 3.6054978451748854),
 ('iturbi', 3.5648268054439574),
 ('felix', 3.2580965380214821),
 ('capote', 3.1945831322991562),
 ('excellently', 3.1612467120315646),
 ('fido', 3.1416861861770706),
 ('deanna', 3.1090609588609941),
 ('hayworth', 3.1090609588609941),
 ('vance', 3.077970371790963),
 ('kolchak', 3.0647251450409425),
 ('sammo', 3.0349529867072724),
 ('laputa', 3.0155349008501706),
 ('sox', 2.9667447366807385),
 ('alvin', 2.9519296508955981),
 ('biko', 2.9385738597140425),
 ('polanski', 2.9103724246028344),
 ('mathieu', 2.9061201148643039),
 ('matthau', 2.8638801424491884),
 ('gilliam', 2.8581108956779433),
 ('harriet', 2.7725887222397811),
 ('chavez

In [22]:
list(reversed(pos_neg_ratio.most_common()))

[('boll', -4.0763882786651227),
 ('uwe', -3.9194396030831955),
 ('thunderbirds', -3.6397022181378524),
 ('beowulf', -3.6295196084548165),
 ('dahmer', -3.5511512728628007),
 ('wayans', -3.5131804380804752),
 ('seagal', -3.3178813486436498),
 ('gamera', -3.0783307970967213),
 ('dreck', -3.0320527378989506),
 ('unwatchable', -3.0231580865348979),
 ('stinker', -2.9836231447642416),
 ('mst', -2.7729386376804541),
 ('incoherent', -2.7609457389729286),
 ('flimsy', -2.7480279574971846),
 ('dillinger', -2.667782815794761),
 ('yawn', -2.6571141133625016),
 ('unfunny', -2.5527880775662997),
 ('waste', -2.490427936954779),
 ('ugh', -2.4841729740397565),
 ('turd', -2.4841729740397565),
 ('blah', -2.4451592049923065),
 ('slugs', -2.4341766208007418),
 ('tripe', -2.3777546363727264),
 ('horrid', -2.3674337397349707),
 ('wtf', -2.3641029707876955),
 ('pointless', -2.3441458135812723),
 ('atrocious', -2.3162514610431946),
 ('damme', -2.2993644782940037),
 ('redeeming', -2.2652667164185862),
 ('prom', -