In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [4]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [7]:
from collections import Counter
import numpy as np

In [14]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

# TODO: Loop over all the words in all the reviews and increment the counts in the appropriate counter objects
for i in range(len(reviews)):
    if labels[i] == "POSITIVE":
        for words in reviews[i].split(" "):
            positive_counts[words] += 1
    else:
        for words in reviews[i].split(" "):
            negative_counts[words] += 1


In [30]:
# Create Counter object to store positive/negative ratios
pos_neg_ratios = Counter()

for i in positive_counts.keys():
    if i in negative_counts.keys():
        pos_neg_ratios[i] = positive_counts[i] / float(negative_counts[i]+1)



In [31]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 1.0607993145235326
Pos-to-neg ratio for 'amazing' = 4.022813688212928
Pos-to-neg ratio for 'terrible' = 0.17744252873563218


Looking closely at the values you just calculated, we see the following:

* Words that you would expect to see more often in positive reviews – like "amazing" – have a ratio greater than 1. The more skewed a word is toward postive, the farther from 1 its positive-to-negative ratio  will be.
* Words that you would expect to see more often in negative reviews – like "terrible" – have positive values that are less than 1. The more skewed a word is toward negative, the closer to zero its positive-to-negative ratio will be.
* Neutral words, which don't really convey any sentiment because you would expect to see them in all sorts of reviews – like "the" – have values very close to 1. A perfectly neutral word – one that was used in exactly the same number of positive reviews as negative reviews – would be almost exactly 1. The `+1` we suggested you add to the denominator slightly biases words toward negative, but it won't matter because it will be a tiny bias and later we'll be ignoring words that are too close to neutral anyway.

Ok, the ratios tell us which words are used more often in postive or negative reviews, but the specific values we've calculated are a bit difficult to work with. A very positive word like "amazing" has a value above 4, whereas a very negative word like "terrible" has a value around 0.18. Those values aren't easy to compare for a couple of reasons:

* Right now, 1 is considered neutral, but the absolute value of the postive-to-negative rations of very postive words is larger than the absolute value of the ratios for the very negative words. So there is no way to directly compare two numbers and see if one word conveys the same magnitude of positive sentiment as another word conveys negative sentiment. So we should center all the values around netural so the absolute value fro neutral of the postive-to-negative ratio for a word would indicate how much sentiment (positive or negative) that word conveys.
* When comparing absolute values it's easier to do that around zero than one. 

To fix these issues, we'll convert all of our ratios to new values using logarithms.

**TODO:** Go through all the ratios you calculated and convert them to logarithms. (i.e. use `np.log(ratio)`)

In the end, extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

In [29]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 0.05902269426102881
Pos-to-neg ratio for 'amazing' = 1.3919815802404802
Pos-to-neg ratio for 'terrible' = -1.7291085042663878


In [26]:
for key,value in pos_neg_ratios.items():
    pos_neg_ratios[key] = np.log(value)

In [27]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 0.05902269426102881
Pos-to-neg ratio for 'amazing' = 1.3919815802404802
Pos-to-neg ratio for 'terrible' = -1.7291085042663878


If everything worked, now you should see neutral words with values close to zero. In this case, "the" is near zero but slightly positive, so it was probably used in more positive reviews than negative reviews. But look at "amazing"'s ratio - it's above `1`, showing it is clearly a word with positive sentiment. And "terrible" has a similar score, but in the opposite direction, so it's below `-1`. It's now clear that both of these words are associated with specific, opposing sentiments.

Now run the following cells to see more ratios. 

The first cell displays all the words, ordered by how associated they are with postive reviews. (Your notebook will most likely truncate the output so you won't actually see *all* the words in the list.)

The second cell displays the 30 words most associated with negative reviews by reversing the order of the first list and then looking at the first 30 words. (If you want the second cell to display all the words, ordered by how associated they are with negative reviews, you could just write `reversed(pos_neg_ratios.most_common())`.)

You should continue to see values similar to the earlier ones we checked – neutral words will be close to `0`, words will get more positive as their ratios approach and go above `1`, and words will get more negative as their ratios approach and go below `-1`. That's why we decided to use the logs instead of the raw ratios.

In [32]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()

[('paulie', 59.0),
 ('gundam', 30.666666666666668),
 ('iturbi', 26.5),
 ('giovanna', 24.0),
 ('feinstone', 24.0),
 ('felix', 23.4),
 ('luzhin', 22.5),
 ('philo', 21.5),
 ('lindy', 21.0),
 ('capote', 20.333333333333332),
 ('fido', 20.25),
 ('excellently', 19.666666666666668),
 ('clutter', 19.5),
 ('anchors', 19.0),
 ('vance', 19.0),
 ('kolchak', 18.75),
 ('deanna', 18.666666666666668),
 ('hayworth', 18.666666666666668),
 ('korda', 17.5),
 ('sammo', 17.333333333333332),
 ('sox', 17.0),
 ('laputa', 17.0),
 ('biko', 17.0),
 ('callahan', 17.0),
 ('polanski', 16.833333333333332),
 ('alvin', 16.75),
 ('matthau', 16.555555555555557),
 ('aiello', 16.5),
 ('mathieu', 16.0),
 ('adele', 15.666666666666666),
 ('gackt', 15.666666666666666),
 ('myrtle', 15.666666666666666),
 ('bake', 15.5),
 ('gilliam', 15.25),
 ('bernsen', 15.0),
 ('hickock', 15.0),
 ('aviv', 15.0),
 ('ashraf', 15.0),
 ('gannon', 15.0),
 ('eustache', 15.0),
 ('victoria', 14.6),
 ('beery', 14.5),
 ('partition', 14.5),
 ('dench', 14.5

In [33]:
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('boll', 0.006944444444444444),
 ('uwe', 0.00980392156862745),
 ('thunderbirds', 0.016129032258064516),
 ('beowulf', 0.01639344262295082),
 ('dahmer', 0.018518518518518517),
 ('wayans', 0.0196078431372549),
 ('ajay', 0.020833333333333332),
 ('grendel', 0.020833333333333332),
 ('awfulness', 0.02564102564102564),
 ('seagal', 0.026143790849673203),
 ('steaming', 0.02631578947368421),
 ('welch', 0.02702702702702703),
 ('deathstalker', 0.02857142857142857),
 ('sabretooth', 0.03225806451612903),
 ('interminable', 0.03225806451612903),
 ('forwarding', 0.03333333333333333),
 ('devgan', 0.034482758620689655),
 ('gamera', 0.03571428571428571),
 ('varma', 0.037037037037037035),
 ('picker', 0.037037037037037035),
 ('razzie', 0.037037037037037035),
 ('dreck', 0.0379746835443038),
 ('unwatchable', 0.038461538461538464),
 ('nada', 0.04),
 ('stinker', 0.04040404040404041),
 ('kirkland', 0.041666666666666664),
 ('nostril', 0.041666666666666664),
 ('giamatti', 0.041666666666666664),
 ('aag', 0.04347826