# Sentiment Classification & How To "Frame Problems" for a Neural Network

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits


### Project Outline:

- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [6]:
from collections import Counter
import numpy as np

positive_words = []
negative_words = []
for i in range(len(reviews)):
    if(labels[i] == "POSITIVE"):
        positive_words += reviews[i].split(" ")
    else:
        negative_words += reviews[i].split(" ")

positive_words = Counter(positive_words)
negative_words = Counter(negative_words)
total_words = positive_words + negative_words

total_words.most_common()

[('', 1111930),
 ('the', 336713),
 ('.', 327192),
 ('and', 164107),
 ('a', 163009),
 ('of', 145864),
 ('to', 135720),
 ('is', 107328),
 ('br', 101872),
 ('it', 96352),
 ('in', 93968),
 ('i', 87623),
 ('this', 76000),
 ('that', 73245),
 ('s', 65361),
 ('was', 48208),
 ('as', 46933),
 ('for', 44343),
 ('with', 44125),
 ('movie', 44039),
 ('but', 42603),
 ('film', 40155),
 ('you', 34230),
 ('on', 34200),
 ('t', 34081),
 ('not', 30626),
 ('he', 30138),
 ('are', 29430),
 ('his', 29374),
 ('have', 27731),
 ('be', 26957),
 ('one', 26789),
 ('all', 23978),
 ('at', 23513),
 ('they', 22906),
 ('by', 22546),
 ('an', 21560),
 ('who', 21433),
 ('so', 20617),
 ('from', 20498),
 ('like', 20276),
 ('there', 18832),
 ('her', 18421),
 ('or', 18004),
 ('just', 17771),
 ('about', 17374),
 ('out', 17113),
 ('if', 16803),
 ('has', 16790),
 ('what', 16159),
 ('some', 15747),
 ('good', 15143),
 ('can', 14654),
 ('more', 14251),
 ('she', 14223),
 ('when', 14182),
 ('very', 14069),
 ('up', 13291),
 ('time', 127

In [7]:
ratios = Counter()

for word, count in list(total_words.most_common()):
    if(count > 550):
        #the more common the word in both negative and positive review, the lesser it's chances of being a deciding factor
        ratio = positive_words[word] / float(negative_words[word] + 1)
        #improved solution by Andrew Trask
        if(ratio > 1):
            ratios[word] = np.log(ratio)
        else:
            ratios[word] = -np.log((1 / (ratio+0.01)))

In [8]:
#words commonly seen in a positive review
ratios.most_common()

[('superb', 1.7091514458966952),
 ('wonderful', 1.5645425925262093),
 ('fantastic', 1.5048433868558566),
 ('excellent', 1.4647538505723599),
 ('amazing', 1.3919815802404802),
 ('powerful', 1.2999662776313934),
 ('favorite', 1.2668956297860055),
 ('perfect', 1.246742480713785),
 ('brilliant', 1.2287554137664785),
 ('perfectly', 1.1971931173405572),
 ('loved', 1.1563661500586044),
 ('highly', 1.1420208631618658),
 ('today', 1.1050431789984001),
 ('unique', 1.0881409888008142),
 ('beauty', 1.050410186850232),
 ('greatest', 1.0248947127715422),
 ('incredible', 1.0061677561461084),
 ('sweet', 0.98966110487955483),
 ('oscar', 0.98721905111049713),
 ('beautiful', 0.97326301262841053),
 ('heart', 0.95238806924516806),
 ('masterpiece', 0.94155039863339296),
 ('season', 0.89827222637147675),
 ('great', 0.88810470901464589),
 ('enjoyed', 0.87070195951624607),
 ('moving', 0.85566611005772031),
 ('memorable', 0.84801189112086062),
 ('episodes', 0.84223712084137292),
 ('strong', 0.84167135777060931)

In [9]:
#words commonly seen in a negative review
list(reversed(ratios.most_common()))[0:30]

[('waste', -2.4907515123361046),
 ('worst', -2.1930856334332267),
 ('awful', -2.1385076866397488),
 ('poorly', -2.1326133844207011),
 ('lame', -1.9117232884159072),
 ('horrible', -1.844894301366784),
 ('wasted', -1.7773187850732606),
 ('crap', -1.7677639636718392),
 ('badly', -1.6974777654416224),
 ('worse', -1.6820086052689358),
 ('terrible', -1.6742829939664696),
 ('mess', -1.6373136300125852),
 ('stupid', -1.6042380193725321),
 ('dull', -1.5370486519863287),
 ('avoid', -1.5285820410306898),
 ('ridiculous', -1.4643037391798193),
 ('boring', -1.4475226133603798),
 ('dumb', -1.3830343745326832),
 ('annoying', -1.3279306437993517),
 ('bad', -1.3181383703873577),
 ('unless', -1.3136186458671415),
 ('fails', -1.3117952384452776),
 ('supposed', -1.2447538467688914),
 ('poor', -1.2354574363960786),
 ('joke', -1.2084501667221239),
 ('cheap', -1.2024325652844348),
 ('oh', -1.060145138351082),
 ('predictable', -0.99614466563369874),
 ('sorry', -0.98361286453760466),
 ('save', -0.96543738528113