# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

In [11]:
def pretty_print_review_and_lable(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

#IMDB reviews (we list them as just positive and low)
g = open('reviews.txt', 'r') #what we know
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt', 'r') # what we want to know
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [12]:
len(reviews)

25000

In [13]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [14]:
labels[1]

'NEGATIVE'

## Lession: Develop a Predictive Theory

In [15]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_lable(2137)
pretty_print_review_and_lable(12816)
pretty_print_review_and_lable(6267)
pretty_print_review_and_lable(21934)
pretty_print_review_and_lable(5297)
pretty_print_review_and_lable(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


Taking just letter isn't going to help us. And taking the full review isn't going to be helpful either. We see words like Terrible, and Impossible more offten with Negative, vs excellent for positive words. So maybe the count of the words is a better indication. But let do a quick validation to see if our theory is right

## Project 1: Quick Theory Validation

In [17]:
from collections import Counter
import numpy as np

In [19]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [21]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] +=1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] +=1
            total_counts[word] +=1
        

In [22]:
positive_counts.most_common()

[('', 550486),
 ('the', 173333),
 ('.', 159681),
 ('and', 89723),
 ('a', 83692),
 ('of', 76859),
 ('to', 66752),
 ('is', 57249),
 ('in', 50217),
 ('br', 49235),
 ('it', 48027),
 ('i', 40748),
 ('that', 35634),
 ('this', 35080),
 ('s', 33816),
 ('as', 26310),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13721),
 ('one', 13656),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11943),
 ('who', 11465),
 ('an', 11294),
 ('at', 11236),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8314),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7738),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7457),
 ('some', 7442),
 ('if', 7285),
 ('just', 7152),
 ('can', 7002),
 ('story', 6780),
 ('time', 6516),
 ('my', 6490),
 ('g

This is great, but we dosn't tell us what we want to know. What we want to know is which word is the most positive (meaning appear more offten in positive reviews vs negative reviews

In [29]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 50):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio
        
        
for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

In [27]:
#Words most Frequently seen in a review that is positive
pos_neg_ratios.most_common()

[('edie', 4.6913478822291435),
 ('antwone', 4.4773368144782069),
 ('din', 4.4067192472642533),
 ('gunga', 4.1896547420264252),
 ('goldsworthy', 4.1743872698956368),
 ('gypo', 4.0943445622221004),
 ('yokai', 4.0943445622221004),
 ('paulie', 4.0775374439057197),
 ('visconti', 3.9318256327243257),
 ('flavia', 3.9318256327243257),
 ('blandings', 3.8712010109078911),
 ('kells', 3.8712010109078911),
 ('brashear', 3.8501476017100584),
 ('gino', 3.8286413964890951),
 ('deathtrap', 3.8066624897703196),
 ('harilal', 3.713572066704308),
 ('panahi', 3.713572066704308),
 ('ossessione', 3.6635616461296463),
 ('tsui', 3.6375861597263857),
 ('caruso', 3.6375861597263857),
 ('sabu', 3.6109179126442243),
 ('ahmad', 3.6109179126442243),
 ('khouri', 3.5835189384561099),
 ('dominick', 3.5835189384561099),
 ('aweigh', 3.5553480614894135),
 ('mj', 3.5553480614894135),
 ('mcintire', 3.5263605246161616),
 ('kriemhild', 3.5263605246161616),
 ('blackie', 3.4965075614664802),
 ('daisies', 3.4965075614664802),
 ('

In [28]:
#Words most frequently seen in negative reviews
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('rosarios', -4.6051701859880918),
 ('frewer', -4.6051701859880918),
 ('manu', -4.6051701859880918),
 ('borel', -4.6051701859880918),
 ('swinton', -4.6051701859880918),
 ('sagemiller', -4.6051701859880918),
 ('summersisle', -4.6051701859880918),
 ('qi', -4.6051701859880918),
 ('redline', -4.6051701859880918),
 ('slipstream', -4.6051701859880918),
 ('bolo', -4.6051701859880918),
 ('emraan', -4.6051701859880918),
 ('geico', -4.6051701859880918),
 ('cato', -4.6051701859880918),
 ('liliom', -4.6051701859880918),
 ('rajni', -4.6051701859880918),
 ('mayeda', -4.6051701859880918),
 ('crapfest', -4.6051701859880918),
 ('tmtm', -4.6051701859880918),
 ('sued', -4.6051701859880918),
 ('keyes', -4.6051701859880918),
 ('nichole', -4.6051701859880918),
 ('straightheads', -4.6051701859880918),
 ('aluminium', -4.6051701859880918),
 ('groaning', -4.6051701859880918),
 ('templars', -4.6051701859880918),
 ('krista', -4.6051701859880918),
 ('spandex', -4.6051701859880918),
 ('unisols', -4.605170185988091