# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [6]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [7]:
labels[1]

'NEGATIVE'

# Lesson: Develop a Predictive Theory

In [8]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)
pretty_print_review_and_label(0)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...
POSITIVE	:	bromwell high is a cartoon comedy . it ran at the same time as some other progra...


In [9]:
# hypothesis: reviews that contain 'terrible' are NEGATIVE,
#             those containing 'excellent' are POSITIVE
neg = "terrible"
pos = "excellent"

In [15]:
check_neg = []
check_pos = []
fail_neg = []
fail_pos = []

# collecting all the examples in different lists, depending on whether it the hypothesis is true or false
for s, t in zip(labels, reviews):
    if neg in t and s == "NEGATIVE":
        check_neg.append([s, t])
    elif neg in t and s != "NEGATIVE":
        fail_neg.append([s, t])
    if pos in t and s == "POSITIVE":
        check_pos.append([s, t])
    elif pos in t and s != "POSITIVE":
        fail_pos.append([s, t])

In [18]:
# doing some stats and looking at the results
print("NEGATIVE with '{0}': correct: {1}, incorrect: {2}".format(neg, len(check_neg), len(fail_neg)))
print("That's a ratio of {0}:1".format(len(check_neg)/len(fail_neg)))
print()
print("POSITIVE with '{0}': correct: {1}, incorrect: {2}".format(pos, len(check_pos), len(fail_pos)))
print("That's a ratio of {0}:1".format(len(check_pos)/len(fail_pos)))

NEGATIVE with 'terrible': correct: 1122, incorrect: 217
That's a ratio of 5.170506912442396:1

POSITIVE with 'excellent': correct: 1476, incorrect: 352
That's a ratio of 4.193181818181818:1


Altogether only the existence of either of these two words in a review text seems to make it 4-5 times more probable that the review is POSITIVE ("excellent") or NEGATIVE ("terrible") respectively.