# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

# Lesson: Curate a Dataset

In [2]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [3]:
len(reviews)

25000

In [4]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [5]:
labels[1]

'NEGATIVE'

# Lesson: Develop a Predictive Theory

In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)
pretty_print_review_and_label(0)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...
POSITIVE	:	bromwell high is a cartoon comedy . it ran at the same time as some other progra...


In [7]:
# hypothesis: words contained in a review text are correlated with the label

In [9]:
# PART 1 - what are the words?

# collecting all words together
tokens = list()

for r in reviews:
    tokens += [w for w in r.split(" ")]

In [12]:
tokens[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', '.', 'it', 'ran', 'at']

In [None]:
word_counthere must be a better, more computationally effective way of doing the above.
# it's taking forever

word_count

In [44]:
################## DIFFERENT TRY ##################
neg_coms, pos_coms = list(), list()

for s, t in zip(labels, reviews):
    if s == "NEGATIVE":
        neg_coms.append(t)
    elif s == "POSITIVE":
        pos_coms.append(t)
    else:
        "look deeper"

In [45]:
len(neg_coms)

12500

In [48]:
[neg_coms.remove(w) for w in neg_coms if w in stop_words]

[]

In [47]:
len(neg_coms)

12500

In [15]:
neg_dict, pos_dict = dict(), dict()

# countinnextwords in negative reviews
for c in neg_coms:
    for w in c.split(" "):
        neg_dict[w] = neg_dict.get(w, 0) + 1

In [17]:
# counting words in positive reviews
for c in pos_coms:
    for w in c.split(" "):
        pos_dict[w] = pos_dict.get(w, 0) + 1

In [49]:
len(pos_dict)

55214

In [52]:
pos_dict = {key: value for key, value in pos_dict.items() 
             if key not in stop_words}

In [54]:
neg_dict = {key: value for key, value in neg_dict.items() 
             if key not in stop_words}

In [53]:
len(pos_dict)

55061

In [55]:
# sort the dictionaries by most common words
import operator
sorted_neg = sorted(neg_dict.items(), key=operator.itemgetter(1), reverse=True)
sorted_pos = sorted(pos_dict.items(), key=operator.itemgetter(1), reverse=True)

In [56]:
{k:v for k, v in list(neg_dict.items())[:5]}

{'': 561462, 'benvolio': 2, 'dwarfed': 2, 'highly': 277, 'jobbers': 1}

In [63]:
sorted_neg[30:50]

[('watch', 3550),
 ('character', 3506),
 ('better', 3358),
 ('know', 3334),
 ('seen', 3265),
 ('ever', 3263),
 ('never', 3259),
 ('two', 3173),
 ('little', 3096),
 ('films', 3077),
 ('nothing', 2990),
 ('say', 2982),
 ('end', 2954),
 ('something', 2942),
 ('many', 2909),
 ('thing', 2866),
 ('show', 2862),
 ('scene', 2816),
 ('scenes', 2785),
 ('go', 2717)]

In [62]:
sorted_pos[30:50]

[('movies', 3586),
 ('characters', 3560),
 ('character', 3514),
 ('man', 3460),
 ('show', 3432),
 ('watch', 3424),
 ('seen', 3414),
 ('little', 3341),
 ('still', 3340),
 ('make', 3303),
 ('could', 3237),
 ('never', 3226),
 ('know', 2833),
 ('years', 2758),
 ('ever', 2734),
 ('end', 2696),
 ('real', 2568),
 ('scene', 2567),
 ('back', 2547),
 ('though', 2475)]

In [66]:
# let's pull out the top 200 and remove the words that appear in both lists
cn, cp = sorted_neg[:200], sorted_pos[:200]

# list of just the words
pos_200 = [l[0] for l in cp]
# remove those that are also in the positive words
for line in cn:
    if line[0] in pos_200:
        cn.remove(line)

In [67]:
len(cn)

113

In [68]:
neg_200 = [l[0] for l in cn]

for line in cp:
    if line[0] in neg_200:
        cp.remove(line)

In [69]:
len(cp)

147

In [70]:
# top 10 uniquely negative
cn[:10]

[('.', 167538),
 ('movie', 24965),
 ('one', 13134),
 ('even', 7687),
 ('bad', 7401),
 ('really', 6262),
 ('see', 5452),
 ('much', 5078),
 ('people', 4806),
 ('could', 4686)]

In [71]:
# top 10 uniquely negative
cp[:10]

[('', 550468),
 ('br', 49235),
 ('film', 20937),
 ('one', 13655),
 ('like', 9038),
 ('good', 7720),
 ('story', 6780),
 ('time', 6515),
 ('well', 6405),
 ('also', 5550)]

In [33]:
# remove stopwords
from nltk.corpus import stopwords

In [36]:
stop_words = set(stopwords.words("english"))

In [15]:
check_neg = []
check_pos = []
fail_neg = []
fail_pos = []

# collecting all the examples in different lists, depending on whether it the hypothesis is true or false
for s, t in zip(labels, reviews):
    if neg in t and s == "NEGATIVE":
        check_neg.append([s, t])
    elif neg in t and s != "NEGATIVE":
        fail_neg.append([s, t])
    if pos in t and s == "POSITIVE":
        check_pos.append([s, t])
    elif pos in t and s != "POSITIVE":
        fail_pos.append([s, t])

In [18]:
# doing some stats and looking at the results
print("NEGATIVE with '{0}': correct: {1}, incorrect: {2}".format(neg, len(check_neg), len(fail_neg)))
print("That's a ratio of {0}:1".format(len(check_neg)/len(fail_neg)))
print()
print("POSITIVE with '{0}': correct: {1}, incorrect: {2}".format(pos, len(check_pos), len(fail_pos)))
print("That's a ratio of {0}:1".format(len(check_pos)/len(fail_pos)))

NEGATIVE with 'terrible': correct: 1122, incorrect: 217
That's a ratio of 5.170506912442396:1

POSITIVE with 'excellent': correct: 1476, incorrect: 352
That's a ratio of 4.193181818181818:1


Altogether only the existence of either of these two words in a review text seems to make it 4-5 times more probable that the review is POSITIVE ("excellent") or NEGATIVE ("terrible") respectively.