# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [4]:
labels[1]

'NEGATIVE'

# Lesson: Develop a Predictive Theory

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)
pretty_print_review_and_label(0)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...
POSITIVE	:	bromwell high is a cartoon comedy . it ran at the same time as some other progra...


In [7]:
# hypothesis: words contained in a review text are correlated with the label

In [16]:
# using the magic of the Counter() object
from collections import Counter
import numpy as np
# to take away common words
from nltk.corpus import stopwords

In [18]:
# initializing the Counter objects
pos_count = Counter()
neg_count = Counter()
total_count = Counter()

In [19]:
stop_words = set(stopwords.words("english"))

In [20]:
for i in range(len(reviews)):
    if labels[i] == "POSITIVE":
        for word in reviews[i].split(" "):
            # use Counter object like a dictionary that already has all the keys
            # but define "at the space of key"
            if word not in stop_words:
                # add only the more distinguishing words to the pos/neg count
                pos_count[word] += 1
            # count all words including stopwords
            total_count[word] += 1
    elif labels[i] == "NEGATIVE":
        for word in reviews[i].split(" "):
            if word not in stop_words:
                neg_count[word] += 1
            total_count[word] += 1

In [36]:
# useful Counter functions!
pos_count.most_common()[:10]

[('', 550468),
 ('.', 159654),
 ('br', 49235),
 ('film', 20937),
 ('movie', 19074),
 ('one', 13655),
 ('like', 9038),
 ('good', 7720),
 ('story', 6780),
 ('time', 6515)]

In [37]:
neg_count.most_common()[:10]

[('', 561462),
 ('.', 167538),
 ('br', 52637),
 ('movie', 24965),
 ('film', 19218),
 ('one', 13134),
 ('like', 11238),
 ('even', 7687),
 ('good', 7423),
 ('bad', 7401)]

But this doesn't really tell us about which words are **indicative** of positive or negative Reviews. Most of these words appear both in positive and negative Reviews very often.

In [38]:
# we need to do NORMALIZATION - which word is the most common for pos/neg
pos_neg_ratios = Counter()

for word, count in list(total_count.most_common()): # do I actually need the list() for this to work?!
    if count > 100:
        pos_neg_ratio = pos_count[word] / float(neg_count[word] + 1) # why +1?! and float() is unnecessary in p3?
        pos_neg_ratios[word] = pos_neg_ratio

In [None]:
# the +1 from before has something to do with the upcoming (some log normalization)
for word, ratio in pos_neg_ratios.most_common():
    if ratio > 1:
        pos_neg_ratios[word] = np.log(ratio) # why do we do log here?
    else:
        # not sure what's going on here (I assume it's reversing the count if it's negatively correlated)
        pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01))) # this stays too unexplained IMO

In [39]:
# checking out liked actors
pos_neg_ratios.most_common()[:30]

[('edie', 109.0),
 ('paulie', 59.0),
 ('felix', 23.4),
 ('polanski', 16.833333333333332),
 ('matthau', 16.555555555555557),
 ('victoria', 14.6),
 ('mildred', 13.5),
 ('gandhi', 12.666666666666666),
 ('flawless', 11.6),
 ('superbly', 9.583333333333334),
 ('perfection', 8.666666666666666),
 ('astaire', 8.5),
 ('captures', 7.68),
 ('voight', 7.615384615384615),
 ('wonderfully', 7.552631578947368),
 ('powell', 7.230769230769231),
 ('brosnan', 7.0625),
 ('lily', 6.823529411764706),
 ('bakshi', 6.705882352941177),
 ('lincoln', 6.695652173913044),
 ('refreshing', 6.392857142857143),
 ('breathtaking', 6.3478260869565215),
 ('bourne', 6.346153846153846),
 ('lemmon', 6.333333333333333),
 ('delightful', 6.051282051282051),
 ('flynn', 6.0476190476190474),
 ('andrews', 5.909090909090909),
 ('homer', 5.866666666666666),
 ('beautifully', 5.828125),
 ('soccer', 5.8)]

In [35]:
# and how does the other end look like?
list(reversed(pos_neg_ratios.most_common()))[:30]
# interstingly I have different results here than he has. seems I made a mistake somewhere...

[('its', -4.6051701859880918),
 ('were', -4.6051701859880918),
 ('in', -4.6051701859880918),
 ('where', -4.6051701859880918),
 ('our', -4.6051701859880918),
 ('under', -4.6051701859880918),
 ('ve', -4.6051701859880918),
 ('more', -4.6051701859880918),
 ('him', -4.6051701859880918),
 ('further', -4.6051701859880918),
 ('has', -4.6051701859880918),
 ('her', -4.6051701859880918),
 ('any', -4.6051701859880918),
 ('of', -4.6051701859880918),
 ('myself', -4.6051701859880918),
 ('few', -4.6051701859880918),
 ('whom', -4.6051701859880918),
 ('doing', -4.6051701859880918),
 ('shouldn', -4.6051701859880918),
 ('you', -4.6051701859880918),
 ('over', -4.6051701859880918),
 ('once', -4.6051701859880918),
 ('how', -4.6051701859880918),
 ('haven', -4.6051701859880918),
 ('re', -4.6051701859880918),
 ('hadn', -4.6051701859880918),
 ('when', -4.6051701859880918),
 ('because', -4.6051701859880918),
 ('aren', -4.6051701859880918),
 ('not', -4.6051701859880918)]

^ something's not quite right here, because I am getting different results in the negative `most.common()` while the positive one seems to be correct. Must be a mistake somewhere in the calculation part?